Re: [HACKERS] path toward faster partition pruning

2017-11-09 Thread Kyotaro HORIGUCHI
Hello,

At Fri, 10 Nov 2017 14:44:55 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20171110.144455.117208639.horiguchi.kyot...@lab.ntt.co.jp>

> > Those two conditions are not orthogonal. Maybe something like
> > following seems more understantable.
> > 
> > > if (!constfalse)
> > > {
> > >   /* No constraints on the keys, so, return *all* partitions. */
> > >   if (nkeys == 0)
> > > return bms_add_range(result, 0, partdesc->nparts - 1);
> > > 
> > >   result = get_partitions_for_keys(relation, &keys);
> > > }

So, the condition (!constfalse && nkeys == 0) cannot return
there. I'm badly confused by the variable name.

I couldn't find another reasonable structure using the current
classify_p_b_keys(), but could you add a comment like the
following as an example?

+ /*
+  * Ths function processes other than OR expressions and returns
+  * the excluded OR expressions in or_clauses
+  */
>   nkeys = classify_partition_bounding_keys(relation, clauses,
>&keys, &constfalse,
>&or_clauses);
>   /*
>* Only look up in the partition decriptor if the query provides
>* constraints on the keys at all.
>*/
>   if (!constfalse)
>   {
> if (nkey > 0)
>   result = get_partitions_for_keys(relation, &keys);
> else
-+   /* No constraints on the keys, so, all partitions are passed. */
>   result = bms_add_range(result, 0, partdesc->nparts - 1);
>   }
> 
+   /*
+* We have a partition set for clauses not returned in or_clauses
+* here. Conjuct the result of each OR clauses.
+    */
>   foreach(lc, or_clauses)
>   {
> BoolExpr *or = (BoolExpr *) lfirst(lc);
> ListCell *lc1;
> Bitmapset *or_partset = NULL;
> 
+ Assert(or_clause(or));

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-11-09 Thread Kyotaro HORIGUCHI
Ooops! The following comment is wrong. Please ignore it.

At Fri, 10 Nov 2017 14:38:11 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20171110.143811.97616847.horiguchi.kyot...@lab.ntt.co.jp>
> Those two conditions are not orthogonal. Maybe something like
> following seems more understantable.
> 
> > if (!constfalse)
> > {
> >   /* No constraints on the keys, so, return *all* partitions. */
> >   if (nkeys == 0)
> > return bms_add_range(result, 0, partdesc->nparts - 1);
> > 
> >   result = get_partitions_for_keys(relation, &keys);
> > }
> 
> I'm not sure what is meant to be (formally) done here, but it
> seems to me that OrExpr is assumed to be only at the top level of
> the caluses. So the following (just an example, but meaningful
> expression in this shpape must exists.) expression is perhaps
> wrongly processed here.
> 
> CREATE TABLE p (a int) PARITION BY (a);
> CREATE TABLE c1 PARTITION OF p FOR VALUES FROM (0) TO (10);
> CREATE TABLE c2 PARTITION OF p FOR VALUES FROM (10) TO (20);
> 
> SELECT * FROM p WHERE a = 15 AND (a = 15 OR a = 5);
> 
> get_partitions_for_keys() returns both c1 and c2 and still
> or_clauses here holds (a = 15 OR a = 5) and the function tries to
> *add* partitions for a = 15 and a = 5 separetely.

This is working fine. Sorry for the bogus comment.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-11-09 Thread Kyotaro HORIGUCHI
Hello, this is the second part of the review.

At Fri, 10 Nov 2017 12:30:00 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20171110.123000.151902771.horiguchi.kyot...@lab.ntt.co.jp>
> In 0002, bms_add_range has a bit naive-looking loop
> In 0003, 

In 0004,

The name get_partitions_from_clauses_guts(), it seems to me that
we usually use _internal for recursive part of some function. (I
have the same comment as David about the comment for
get_partition_from_clause())

About the same function:

Couldn't we get out in the fast path when clauses == NIL?

+/* No constraints on the keys, so, return *all* partitions. */
+result = bms_add_range(result, 0, partdesc->nparts - 1);

This allows us to return immediately here. And just above this,

+   if (nkeys > 0 && !constfalse)
+   result = get_partitions_for_keys(relation, &keys);
+   else if (!constfalse)

Those two conditions are not orthogonal. Maybe something like
following seems more understantable.

> if (!constfalse)
> {
>   /* No constraints on the keys, so, return *all* partitions. */
>   if (nkeys == 0)
> return bms_add_range(result, 0, partdesc->nparts - 1);
> 
>   result = get_partitions_for_keys(relation, &keys);
> }

I'm not sure what is meant to be (formally) done here, but it
seems to me that OrExpr is assumed to be only at the top level of
the caluses. So the following (just an example, but meaningful
expression in this shpape must exists.) expression is perhaps
wrongly processed here.

CREATE TABLE p (a int) PARITION BY (a);
CREATE TABLE c1 PARTITION OF p FOR VALUES FROM (0) TO (10);
CREATE TABLE c2 PARTITION OF p FOR VALUES FROM (10) TO (20);

SELECT * FROM p WHERE a = 15 AND (a = 15 OR a = 5);

get_partitions_for_keys() returns both c1 and c2 and still
or_clauses here holds (a = 15 OR a = 5) and the function tries to
*add* partitions for a = 15 and a = 5 separetely.


I'd like to pause here.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-11-09 Thread Kyotaro HORIGUCHI
+ * It the operator happens to be '<>', which is never listed
If?


+if (!op_in_opfamily(expr_op, partopfamily))
+{
+  Oidnegator = get_negator(expr_op);
+
+  if (!OidIsValid(negator) ||
+!op_in_opfamily(negator, partopfamily))
+continue;

classify_partition_bounding_keys() checks the same thing by
checking whether the negator's strategy is
BTEquealStrategyNumber. (I'm not sure the operator is guaranteed
to be of btreee, though..) Aren't they needed to be in similar
way?

# In the function, "partkey strategy" and "operator strategy" are
# confusing..

+  AttrNumber  attno;

This declaration might be better in a narrower scope.


-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Restricting maximum keep segments by repslots

2017-11-09 Thread Kyotaro HORIGUCHI
Oops! The previous patch is forgetting the default case and crashes.

At Wed, 08 Nov 2017 13:14:31 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20171108.131431.170534842.horiguchi.kyot...@lab.ntt.co.jp>
> > I don't think 'distance' is a good metric - that's going to continually
> > change. Why not store the LSN that's available and provide a function
> > that computes this? Or just rely on the lsn - lsn operator?
> 
> It seems reasonable.,The 'secured minimum LSN' is common among
> all slots so showing it in the view may look a bit stupid but I
> don't find another suitable place for it.  distance = 0 meant the
> state that the slot is living but insecured in the previous patch
> and that information is lost by changing 'distance' to
> 'min_secure_lsn'.
> 
> Thus I changed the 'live' column to 'status' and show that staus
> in text representation.
> 
> status: secured | insecured | broken
> 
> So this looks like the following (max_slot_wal_keep_size = 8MB,
> which is a half of the default segment size)
> 
> -- slots that required WAL is surely available
> select restart_lsn, status, min_secure_lsn, pg_current_wal_lsn() from 
> pg_replication_slots;
> restart_lsn | status  | min_recure_lsn | pg_current_wal_lsn 
> +-++
> 0/1A60  | secured | 0/1A00 | 0/1B42BC78
> 
> -- slots that required WAL is still available but insecured
> restart_lsn | status| min_recure_lsn | pg_current_wal_lsn 
> +---++
> 0/1A60  | insecured | 0/1C00 | 0/1D76C948
> 
> -- slots that required WAL is lost
> # We should have seen the log 'Some replication slots have lost...'
> 
> restart_lsn | status | min_recure_lsn | pg_current_wal_lsn 
> +++
> 0/1A60  | broken | 0/1C00 | 0/1D76C9F0
> 
> 
> I noticed that I abandoned the segment fragment of
> max_slot_wal_keep_size in calculating in the routines. The
> current patch honors the frament part of max_slot_wal_keep_size.

I changed IsLsnStillAvailable to return meaningful values
regardless whether max_slot_wal_keep_size is set or not.

# I had been forgetting to count the version for latestst several
# patches. I give the version '4' - as the next of the last
# numbered patch.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 109f056e257aba70dddc8d466767ed0a1db371e2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 28 Feb 2017 11:39:48 +0900
Subject: [PATCH 1/2] Add WAL releaf vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c | 39 +++
 src/backend/utils/misc/guc.c  | 11 
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xlog.h |  1 +
 4 files changed, 52 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dd028a1..cfdae39 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int			wal_level = WAL_LEVEL_MINIMAL;
 int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
+int			max_slot_wal_keep_size_mb = 0;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -9432,9 +9433,47 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 	if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
 	{
 		XLogSegNo	slotSegNo;
+		int			slotlimitsegs;
+		uint64		recptroff;
+		uint64		slotlimitbytes;
+		uint64		slotlimitfragment;
+
+		recptroff = XLogSegmentOffset(recptr, wal_segment_size);
+		slotlimitbytes = 1024 * 1024 * max_slot_wal_keep_size_mb;
+		slotlimitfragment =	XLogSegmentOffset(slotlimitbytes,
+			  wal_segment_size);
+
+		/* calculate segments to keep by max_slot_wal_keep_size_mb */
+		slotlimitsegs = ConvertToXSegs(max_slot_wal_keep_size_mb,
+	   wal_segment_size);
+		/* honor the fragment */
+		if (recptroff < slotlimitfragment)
+			slotlimitsegs++;
 
 		XLByteToSeg(keep, slotSegNo, wal_segment_size);
 
+		/*
+		 * ignore slots if too many wal segments are kept.
+		 * max_slot_wal_keep_size is just accumulated on wal_keep_segments.
+		 */
+		if (max_slot_wal_keep_size_mb > 0 && slotSegNo + slotlimitsegs < segno)
+		{
+			segno = segno - slotlimitsegs; /* must be positive */
+
+			/*
+			 * warn only if the checkpoint flushes the required segment.
+			 * we assume here that *logSegNo is calculated keep location.
+			 */
+			if (slotSeg

Re: [HACKERS] [PATCH] Improve geometric types

2017-11-09 Thread Kyotaro HORIGUCHI
Hello,

> I'd like to put comments on 0001 and 0004 only now:
...

I don't have a comment on 0002.

About 0003:

>  @@ -4487,21 +4486,21 @@ circle_in(PG_FUNCTION_ARGS)
> ...
>   circle->radius = single_decode(s, &s, "circle", str);
> - if (circle->radius < 0)
> + if (float8_lt(circle->radius, 0.0))
>   ereport(ERROR,
>   (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),

flost8_lt and its family functions are provided to unify the
sorting order including NaN. NaN is not rejected by the usage of
float8_lt in the case but it is what the function is expected to
be used for. If we wanted to check if it is positive, it
unexpectedly throws an exception.  (I suppose that NaNs should be
silently ignored rather than stopping a query by throwng an
exception.)

Addition to that I don't think it proper that applying EPSILON(!)
there. It should be strictly compared regardless whether EPSION
is considered or not.


Similary, circle_overlap for example, float8_le is used to
compare the distance and the summed radius.

NaN causes a problem in another place.
  
> PG_RETURN_BOOL(FPle(point_dt(&circle1->center, &circle2->center),
> float8_pl(circle1->radius, circle2->radius)));

If the distance was NaN and the summed radius is not, the
function returns true. I think that a reasonable behavior is that
an object containing NaN cannot make any meaningful relationship
with other objects as floating number itself behave so. (Any
comparison other than != with NaN returns always false)

Just using another series of comparison functions that return
false for NaN-containing comparison is not a solution since the
meaning of the returned false differs by context, just same as
the first problem above. For exameple, the fictious functions
below,

| bool circle_overlaps()
|   ret = FPle(distance, radius_sum);

This gives correct results, but

| bool circle_not_overlaps()
|   ret = FPgt(distance, radius_sum);

This gives a wrong result for NaN-containing objects.

Perhaps it is enough to explicitly define behaviors for NaN
before comparison.

circle_overlap()
> distance = point_dt();
> radius_sum = float8_pl(...);
>
> /* NaN-containing objects doesn't overlap any other objects */ 
> if (isnan(distance) || isnan(radius_sum))
>  PG_RETURN_BOOL(false);
>
> /* NaN ordering of FPle() doesn't get into mischief here */
> return PG_RETURN_BOOL(FPle(distance, radius_sum));

(End Of the Comment to 0003)

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Restricting maximum keep segments by repslots

2017-11-07 Thread Kyotaro HORIGUCHI
Hello,

At Mon, 6 Nov 2017 05:20:50 -0800, Andres Freund  wrote in 
<20171106132050.6apzynxrqrzgh...@alap3.anarazel.de>
> Hi,
> 
> On 2017-10-31 18:43:10 +0900, Kyotaro HORIGUCHI wrote:
> >   - distance:
> > how many bytes LSN can advance before the margin defined by
> > max_slot_wal_keep_size (and wal_keep_segments) is exhasuted,
> > or how many bytes this slot have lost xlog from restart_lsn.
> 
> I don't think 'distance' is a good metric - that's going to continually
> change. Why not store the LSN that's available and provide a function
> that computes this? Or just rely on the lsn - lsn operator?

It seems reasonable.,The 'secured minimum LSN' is common among
all slots so showing it in the view may look a bit stupid but I
don't find another suitable place for it.  distance = 0 meant the
state that the slot is living but insecured in the previous patch
and that information is lost by changing 'distance' to
'min_secure_lsn'.

Thus I changed the 'live' column to 'status' and show that staus
in text representation.

status: secured | insecured | broken

So this looks like the following (max_slot_wal_keep_size = 8MB,
which is a half of the default segment size)

-- slots that required WAL is surely available
select restart_lsn, status, min_secure_lsn, pg_current_wal_lsn() from 
pg_replication_slots;
restart_lsn | status  | min_recure_lsn | pg_current_wal_lsn 
+-++
0/1A60  | secured | 0/1A00 | 0/1B42BC78

-- slots that required WAL is still available but insecured
restart_lsn | status| min_recure_lsn | pg_current_wal_lsn 
+---++
0/1A60  | insecured | 0/1C00 | 0/1D76C948

-- slots that required WAL is lost
# We should have seen the log 'Some replication slots have lost...'

restart_lsn | status | min_recure_lsn | pg_current_wal_lsn 
+++
0/1A60  | broken | 0/1C00 | 0/1D76C9F0


I noticed that I abandoned the segment fragment of
max_slot_wal_keep_size in calculating in the routines. The
current patch honors the frament part of max_slot_wal_keep_size.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 109f056e257aba70dddc8d466767ed0a1db371e2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 28 Feb 2017 11:39:48 +0900
Subject: [PATCH 1/2] Add WAL releaf vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c | 39 +++
 src/backend/utils/misc/guc.c  | 11 
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xlog.h |  1 +
 4 files changed, 52 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dd028a1..cfdae39 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int			wal_level = WAL_LEVEL_MINIMAL;
 int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
+int			max_slot_wal_keep_size_mb = 0;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -9432,9 +9433,47 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 	if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
 	{
 		XLogSegNo	slotSegNo;
+		int			slotlimitsegs;
+		uint64		recptroff;
+		uint64		slotlimitbytes;
+		uint64		slotlimitfragment;
+
+		recptroff = XLogSegmentOffset(recptr, wal_segment_size);
+		slotlimitbytes = 1024 * 1024 * max_slot_wal_keep_size_mb;
+		slotlimitfragment =	XLogSegmentOffset(slotlimitbytes,
+			  wal_segment_size);
+
+		/* calculate segments to keep by max_slot_wal_keep_size_mb */
+		slotlimitsegs = ConvertToXSegs(max_slot_wal_keep_size_mb,
+	   wal_segment_size);
+		/* honor the fragment */
+		if (recptroff < slotlimitfragment)
+			slotlimitsegs++;
 
 		XLByteToSeg(keep, slotSegNo, wal_segment_size);
 
+		/*
+		 * ignore slots if too many wal segments are kept.
+		 * max_slot_wal_keep_size is just accumulated on wal_keep_segments.
+		 */
+		if (max_slot_wal_keep_size_mb > 0 && slotSegNo + slotlimitsegs < segno)
+		{
+			segno = segno - slotlimitsegs; /* must be positive */
+
+			/*
+			 * warn only if the checkpoint flushes the required segment.
+			 * we assume here that *logSegNo is calculated keep location.
+			 */
+			if (slotSegNo < *logSegNo)
+ereport(WARNING,
+	(errmsg ("restart LSN of replication slots is ignored by checkpoint"),
+	 errdetail("Some replication slots have lost required WAL segnents to continue by up to %ld segments.",

Re: [HACKERS] [PATCH] Improve geometric types

2017-11-07 Thread Kyotaro HORIGUCHI
Hello, thanks for the new patch.

0004 failed to be applied on the underneath patches.

At Sun, 5 Nov 2017 15:54:19 +0100, Emre Hasegeli  wrote in 

> > I am not sure how useful NaNs are in geometric types context, but we
> > allow them, so inconsistent hypot() would be a problem.  I will change
> > my patches to keep pg_hypot().
> 
> New versions of the patches are attached with 2 additional ones.  The
> new versions leave pg_hypot() in place.  One of the new patches
> improves the test coverage.  The line coverage of geo_ops.c increases
> from 55% to 81%.  The other one fixes -0 values to 0 on float
> operators.  I am not sure about performance implication of this, so
> kept it separate.  It may be a better idea to check this only on the
> platforms that has tendency to produce -0.
> 
> While working on the tests, I found some unreachable code and removed
> it.  I also found that lseg ## lseg operator returning wrong results.
> It is defined as "closest point to first segment on the second
> segment", but:
> 
> > # select '[(1,2),(3,4)]'::lseg ## '[(0,0),(6,6)]'::lseg;
> >  ?column?
> > --
> >  (1,2)
> > (1 row)
> 
> I appended the fix to the patches.  This is also effecting lseg ## box 
> operator.

Mmm.. It returns (1.5, 1.5) with the 0004 patch. It is surely a
point on the second operand but I'm not sure it's right that the
operator returns a specific point for two parallel segments.

> I also changed recently band-aided point ## lseg operator to return
> the point instead of NULL when it cannot find the correct result to
> avoid the operators depending on this one to crash.

I'd like to put comments on 0001 and 0004 only now:

 - Adding [LR]DELIM_L looks good but they should be described in
   the comment just above.

 - Renaming float8_slope to slope seems no problem.

 - I'm not sure the change of box_construct is good but currently
   all callers fits the new interface (giving two points, not
   four coordinates).

 - close_lseg seems forgetting the case where the two given
   segments are crossing (and parallel). For example,

   select '(-8,-8),(1,1)'::lseg ## '(-10,0),(2,0)'::lseg;

   is expected to return (0,0), which is the crossing point of
   the two segments, but it returns (1,0) (and returned (1,1)
   before the patch), which is the point on the l2 closest to the
   closer end of l1 to l2.

   As mentioned above, it is difficult to dicide what is the
   proper result for parallel segments. I suppose that the
   following operation should return an invalid value as a point.

   select '(-1,0),(1,0)'::lseg ## '(-1,1),(1,1)'::lseg;

   close_lseg does the following operations in the else case of
   'if (float8_...)'. If i'm not missing something, the result of
   the following operation is always the closer end of l2. In
   other words it seems a waste of cycles.
   
   | point = DirectFunctionCall2(close_ps,
   | PointPGetDatum(&l2->p[closer_end2]),
   | LsegPGetDatum(l1));
   | return DirectFunctionCall2(close_ps, point, LsegPGetDatum(l2));


- make_bound_box operates directly on the poly->boundbox. I'm
  afraid that such coding hinders compiers from using registers.

  This is a bit apart from this patch, it would be better if we
  could eliminate internal calls using DirectFunctionCall.

reagrds,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal - Default namespaces for XPath expressions (PostgreSQL 11)

2017-11-06 Thread Kyotaro HORIGUCHI
Thank you for the new patch.

- The latest patch is missing xpath_parser.h at least since
  ns-3. That of the first (not-numbered) version was still
  usable.

- c29c578 conflicts on doc/src/sgml/func.sgml


At Sun, 15 Oct 2017 12:06:11 +0200, Pavel Stehule  
wrote in 
> 2017-10-02 12:22 GMT+02:00 Kyotaro HORIGUCHI <
> horiguchi.kyot...@lab.ntt.co.jp>:
> 
> > Hi, thanks for the new patch.
> >
> > # The patch is missing xpath_parser.h. That of the first patch was usable.
> >
> > At Thu, 28 Sep 2017 07:59:41 +0200, Pavel Stehule 
> > wrote in  > mail.gmail.com>
> > > Hi
> > >
> > > now xpath and xpath_exists supports default namespace too
> >
> > At Wed, 27 Sep 2017 22:41:52 +0200, Pavel Stehule 
> > wrote in  > gmail.com>
> > > > 1. Uniformity among simliar features
> > > >
> > > >   As mentioned in the proposal, but it is lack of uniformity that
> > > >   the xpath transformer is applied only to xmltable and not for
> > > >   other xpath related functions.
> > > >
> > >
> > > I have to fix the XPath function. The SQL/XML function Xmlexists doesn't
> > > support namespaces/
> >
> > Sorry, I forgot to care about that. (And the definition of
> > namespace array is of course fabricated by me). I'd like to leave
> > this to committers.  Anyway it is working but the syntax (or
> > whether it is acceptable) is still arguable.
> >
> > SELECT xpath('/a/text()', 'http://example.com";>
> > test',
> >  ARRAY[ARRAY['', 'http://example.com']]);
> > |  xpath
> > | 
> > |  {test}
> > | (1 row)
> >
> >
> > The internal name is properly rejected, but the current internal
> > name (pgdefnamespace.pgsqlxml.internal) seems a bit too long. We
> > are preserving some short names and reject them as
> > user-defined. Doesn't just 'pgsqlxml' work?
> >
> >
> > Default namespace correctly become to be applied on bare
> > attribute names.
> >
> > > updated doc,
> > > fixed all variants of expected result test file
> >
> > Sorry for one by one comment but I found another misbehavior.
> >
> > create table t1 (id int, doc xml);
> > insert into t1
> >values
> >(5, 'http://x.y";>50 > rows>');
> > select x.* from t1, xmltable(XMLNAMESPACES('http://x.y' AS x),
> > '/x:rows/x:row' passing t1.doc columns data int PATH
> > 'child::x:a[1][attribute::hoge="haha"]') as x;
> > |  data
> > | --
> > |50
> >
> > but the following fails.
> >
> > select x.* from t1, xmltable(XMLNAMESPACES(DEFAULT 'http://x.y'),
> > '/rows/row' passing t1.doc columns data int PATH
> > 'child::a[1][attribute::hoge="haha"]') as x;
> > |  data
> > | --
> > |
> > | (1 row)
> >
> > Perhaps child::a is not prefixed by the transformation.
> >
> 
> the problem was in unwanted attribute modification. The parser didn't
> detect "attribute::hoge" as attribute. Updated parser does it. I reduce
> duplicated code there more.

It worked as expected. But the comparison of "attribute" is
missing t1.length = 9 so the following expression wrongly passes.

child::a[1][attributeabcdefg::hoge="haha"

It is confusing that is_qual_name becomes true when t2 is not a
"qual name", and the way it treats a double-colon is hard to
understand.

It essentially does inserting the default namespace before
unqualified non-attribute name. I believe we can easily
look-ahead to detect a double colon and it would make things
simpler. Could you consider something like the attached patch?
(applies on top of ns-4 patch.)

> > XPath might be complex enough so that it's worth switching to
> > yacc/lex based transformer that is formally verifiable and won't
> > need a bunch of cryptic tests that finally cannot prove the
> > completeness. synchronous_standy_names is far simpler than XPath
> > but using yacc/lex parser.
> >
> >
> > Anyway the following is nitpicking of the current xpath_parser.c.
> >
> > - NODENAME_FIRSTCHAR allows '-' as the first char but it is
> >   excluded from NameStartChar (https://www.w3.org/TR/REC-
> > xml/#NT-NameStartChar)
> >   I think characters with high-bit set is okay.
> >   Also IS_NODENAME_CHAR should be changed.
> >
> 
> fixed
> 
> 
> > - NODENAME_FIRSTCHAR and IS_NO

Re: [HACKERS] More stats about skipped vacuums

2017-10-31 Thread Kyotaro HORIGUCHI
This is just a repost as a (true) new thread.

At Mon, 30 Oct 2017 20:57:50 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20171030.205750.246076862.horiguchi.kyot...@lab.ntt.co.jp>
> At Fri, 20 Oct 2017 19:15:16 +0900, Masahiko Sawada  
> wrote in 
> > >   n_mod_since_analyze  | 2
> > > + vacuum_requred   | true
> > > + last_vacuum_oldest_xid   | 8023
> > > + last_vacuum_left_to_truncate | 5123
> > > + last_vacuum_truncated| 387
> > >   last_vacuum  | 2017-10-10 17:21:54.380805+09
> > >   last_autovacuum  | 2017-10-10 17:21:54.380805+09
> > > + last_autovacuum_status   | Killed by lock conflict
> > > ...
> > >   autovacuum_count | 128
> > > + incomplete_autovacuum_count  | 53
> > >
> > > # The last one might be needless..
> > 
> > I'm not sure that the above informations will help for users or DBA
> > but personally I sometimes want to have the number of index scans of
> > the last autovacuum in the pg_stat_user_tables view. That value
> > indicates how efficiently vacuums performed and would be a signal to
> > increase the setting of autovacuum_work_mem for user.
> 
> Btree and all existing index AMs (except brin) seem to visit the
> all pages in every index scan so it would be valuable. Instead
> the number of visited index pages during a table scan might be
> usable. It is more relevant to performance than the number of
> scans, on the other hand it is a bit difficult to get something
> worth from the number in a moment. I'll show the number of scans
> in the first cut.
> 
> > > Where the "Killed by lock conflict" is one of the followings.
> > >
> > >- Completed
> > >- Truncation skipped
> > >- Partially truncated
> > >- Skipped
> > >- Killed by lock conflict
> > >
> > > This seems enough to find the cause of a table bloat. The same
> > > discussion could be applied to analyze but it might be the
> > > another issue.
> > >
> > > There may be a better way to indicate the vacuum soundness. Any
> > > opinions and suggestions are welcome.
> > >
> > > I'm going to make a patch to do the 'formal' one for the time
> > > being.

Done with small modifications. In the attached patch
pg_stat_all_tables has the following new columns. Documentations
is not provided at this stage.

-
  n_mod_since_analyze | 0
+ vacuum_required | not requried
  last_vacuum | 
  last_autovacuum | 2017-10-30 18:51:32.060551+09
  last_analyze| 
  last_autoanalyze| 2017-10-30 18:48:33.414711+09
  vacuum_count| 0
+ last_vacuum_truncated   | 0
+ last_vacuum_untruncated | 0
+ last_vacuum_index_scans | 0
+ last_vacuum_oldest_xmin | 2134
+ last_vacuum_status  | agressive vacuum completed
+ autovacuum_fail_count   | 0
  autovacuum_count| 5
  analyze_count   | 0
  autoanalyze_count   | 1
-
Where each column shows the following infomation.

+ vacuum_required | not requried

 VACUUM requirement status. Takes the following values.

  - partial
Partial (or normal) will be performed by the next autovacuum.
The word "partial" is taken from the comment for
vacuum_set_xid_limits.

  - aggressive
Aggressive scan will be performed by the next autovacuum.

  - required
Any type of autovacuum will be performed. The type of scan is
unknown because the view failed to take the required lock on
the table. (AutoVacuumrequirement())

  - not required
Next autovacuum won't perform scan on this relation.

  - not required (lock not acquired)

Autovacuum should be disabled and the distance to
freeze-limit is not known because required lock is not
available.

  - close to freeze-limit xid
Shown while autovacuum is disabled. The table is in the
manual vacuum window to avoid anti-wraparound autovacuum.

+ last_vacuum_truncated | 0

  The number of truncated pages in the last completed
  (auto)vacuum.

+ last_vacuum_untruncated | 0
  The number of pages the last completed (auto)vacuum tried to
  truncate but could not for some reason.

+ last_vacuum_index_scans | 0
  The number of index scans performed in the last completed
  (auto)vacuum.

+ last_vacuum_oldest_xmin | 2134
  The oldest xmin used in the last completed (auto)vacuum.

+ last_vacuum_status  | agressive vacuum completed

  The finish status of the last vacuum. Takes the following
  values. (pg_stat_get_last_vacuum_status())

   - completed
 The last partial (auto)vacuum is completed.

   - vacuum full completed
 The last VACUUM FULL is comp

Re: [HACKERS] show "aggressive" or not in autovacuum logs

2017-10-31 Thread Kyotaro HORIGUCHI
At Thu, 26 Oct 2017 12:42:23 +0200, Robert Haas  wrote 
in 
> On Thu, Oct 26, 2017 at 10:18 AM, Kyotaro HORIGUCHI
>  wrote:
> > Thank you. I forgot that point. Changed them so that the messages
> > are detected as msgids.
> 
> Committed, changing "aggressive" to "aggressively" in one place for
> correct English.

Thank you for commiting!

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protect syscache from bloating with negative cache entries

2017-10-31 Thread Kyotaro HORIGUCHI
This is a rebased version of the patch.

At Fri, 17 Mar 2017 14:23:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170317.142313.232290068.horiguchi.kyot...@lab.ntt.co.jp>
> At Tue, 7 Mar 2017 19:23:14 -0800, David Steele  wrote 
> in <3b7b7f90-db46-8c37-c4f7-443330c3a...@pgmasters.net>
> > On 3/3/17 4:54 PM, David Steele wrote:
> > 
> > > On 2/1/17 1:25 AM, Kyotaro HORIGUCHI wrote:
> > >> Hello, thank you for moving this to the next CF.
> > >>
> > >> At Wed, 1 Feb 2017 13:09:51 +0900, Michael Paquier
> > >>  wrote in
> > >> 
> > >>> On Tue, Jan 24, 2017 at 4:58 PM, Kyotaro HORIGUCHI
> > >>>  wrote:
> > >>>> Six new syscaches in 665d1fa was conflicted and 3-way merge
> > >>>> worked correctly. The new syscaches don't seem to be targets of
> > >>>> this patch.
> > >>> To be honest, I am not completely sure what to think about this patch.
> > >>> Moved to next CF as there is a new version, and no new reviews to make
> > >>> the discussion perhaps move on.
> > >> I'm thinking the following is the status of this topic.
> > >>
> > >> - The patch stll is not getting conflicted.
> > >>
> > >> - This is not a hollistic measure for memory leak but surely
> > >>saves some existing cases.
> > >>
> > >> - Shared catcache is another discussion (and won't really
> > >>proposed in a short time due to the issue on locking.)
> > >>
> > >> - As I mentioned, a patch that caps the number of negative
> > >>entries is avaiable (in first-created - first-delete manner)
> > >>    but it is having a loose end of how to determine the
> > >>limitation.
> > > While preventing bloat in the syscache is a worthwhile goal, it
> > > appears
> > > there are a number of loose ends here and a new patch has not been
> > > provided.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 9f2c81dbc9bc344cafd6995dfc5969d55a8457d9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 28 Aug 2017 11:36:21 +0900
Subject: [PATCH 1/2] Cleanup negative cache of pg_statistic when dropping a
 relation.

Accessing columns that don't have statistics leaves negative entries
in catcache for pg_statstic, but there's no chance to remove
them. Especially when repeatedly creating then dropping temporary
tables bloats catcache so much that memory pressure becomes
significant. This patch removes negative entries in STATRELATTINH,
ATTNAME and ATTNUM when corresponding relation is dropped.
---
 src/backend/utils/cache/catcache.c |  58 ++-
 src/backend/utils/cache/syscache.c | 302 +++--
 src/include/utils/catcache.h   |   3 +
 src/include/utils/syscache.h   |   2 +
 4 files changed, 282 insertions(+), 83 deletions(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 95a0742..bd303f3 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -423,10 +423,11 @@ CatCachePrintStats(int code, Datum arg)
 
 		if (cache->cc_ntup == 0 && cache->cc_searches == 0)
 			continue;			/* don't print unused caches */
-		elog(DEBUG2, "catcache %s/%u: %d tup, %ld srch, %ld+%ld=%ld hits, %ld+%ld=%ld loads, %ld invals, %ld lsrch, %ld lhits",
+		elog(DEBUG2, "catcache %s/%u: %d tup, %d negtup, %ld srch, %ld+%ld=%ld hits, %ld+%ld=%ld loads, %ld invals, %ld lsrch, %ld lhits",
 			 cache->cc_relname,
 			 cache->cc_indexoid,
 			 cache->cc_ntup,
+			 cache->cc_nnegtup,
 			 cache->cc_searches,
 			 cache->cc_hits,
 			 cache->cc_neg_hits,
@@ -495,8 +496,11 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
 	 * point into tuple, allocated together with the CatCTup.
 	 */
 	if (ct->negative)
+	{
 		CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys,
 		 cache->cc_keyno, ct->keys);
+		--cache->cc_nnegtup;
+	}
 
 	pfree(ct);
 
@@ -697,6 +701,51 @@ ResetCatalogCache(CatCache *cache)
 }
 
 /*
+ *		CleanupCatCacheNegEntries
+ *
+ *	Remove negative cache tuples matching a partial key.
+ *
+ */
+void
+CleanupCatCacheNegEntries(CatCache *cache, ScanKeyData *skey)
+{
+	int i;
+
+	/* If this cache has no negative entries, nothing to do */
+	if (cache->cc_nnegtup == 0)
+		return;
+
+	/* searching with a partial key means scanning the whole cache */
+	for (i = 0; i < cache->cc_nbuckets; i++)
+	{
+		dlist_head *bucket = &cache->cc_bucket[i];
+		dlist_mutable_iter iter;
+
+		dlist_foreach_modify(iter, bucket)
+		{
+			const CCFastEqualFN *cc_fastequal = cache->cc_fastequal;

Re: [HACKERS] Restricting maximum keep segments by repslots

2017-10-31 Thread Kyotaro HORIGUCHI
Hello, this is a rebased version.

It gets a change of the meaning of monitoring value along with
rebasing.

In previous version, the "live" column mysteriously predicts the
necessary segments will be kept or lost by the next checkpoint
and the "distance" offered a still more mysterious value.

In this version the meaning of the two columns became clear and
informative.

pg_replication_slots
  - live:
true the slot have not lost necessary segments.

  - distance:
how many bytes LSN can advance before the margin defined by
max_slot_wal_keep_size (and wal_keep_segments) is exhasuted,
or how many bytes this slot have lost xlog from restart_lsn.

There is a case where live = t and distance = 0. The slot is
currently having all the necessary segments but will start to
lose them at most two checkpoint passes.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 57eaa2b878d30bfcebb093cca0e772fe7a9bff0e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 28 Feb 2017 11:39:48 +0900
Subject: [PATCH 1/2] Add WAL releaf vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c | 24 
 src/backend/utils/misc/guc.c  | 11 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xlog.h |  1 +
 4 files changed, 37 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dd028a1..f79cefb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int			wal_level = WAL_LEVEL_MINIMAL;
 int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
+int			max_slot_wal_keep_size_mb = 0;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -9432,9 +9433,32 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 	if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
 	{
 		XLogSegNo	slotSegNo;
+		int			slotlimitsegs = ConvertToXSegs(max_slot_wal_keep_size_mb);
 
 		XLByteToSeg(keep, slotSegNo, wal_segment_size);
 
+		/*
+		 * ignore slots if too many wal segments are kept.
+		 * max_slot_wal_keep_size is just accumulated on wal_keep_segments.
+		 */
+		if (max_slot_wal_keep_size_mb > 0 && slotSegNo + slotlimitsegs < segno)
+		{
+			segno = segno - slotlimitsegs; /* must be positive */
+
+			/*
+			 * warn only if the checkpoint flushes the required segment.
+			 * we assume here that *logSegNo is calculated keep location.
+			 */
+			if (slotSegNo < *logSegNo)
+ereport(WARNING,
+	(errmsg ("restart LSN of replication slots is ignored by checkpoint"),
+	 errdetail("Some replication slots have lost required WAL segnents to continue by up to %ld segments.",
+	   (segno < *logSegNo ? segno : *logSegNo) - slotSegNo)));
+
+			/* emergency vent */
+			slotSegNo = segno;
+		}
+
 		if (slotSegNo <= 0)
 			segno = 1;
 		else if (slotSegNo < segno)
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 65372d7..511023a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2368,6 +2368,17 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+			gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+		 NULL,
+		 GUC_UNIT_MB
+		},
+		&max_slot_wal_keep_size_mb,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the maximum time to wait for WAL replication."),
 			NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 368b280..e76c73a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -234,6 +234,7 @@
 #max_wal_senders = 10		# max number of walsender processes
 # (change requires restart)
 #wal_keep_segments = 0		# in logfile segments; 0 disables
+#max_slot_wal_keep_size = 0	# measured in bytes; 0 disables
 #wal_sender_timeout = 60s	# in milliseconds; 0 disables
 
 #max_replication_slots = 10	# max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0f2b8bd..f0c0255 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -98,6 +98,7 @@ extern int	wal_segment_size;
 extern int	min_wal_size_mb;
 extern int	max_wal_size_mb;
 extern int	wal_keep_segments;
+extern int	max_slot_wal_keep_size_mb;
 extern int	XLOGbuffers;
 extern int	XLogArchiveTimeout;
 extern int	wal_retrieve_retry_interval;
-- 
2.9.2

>From 37749d46ba97de38e4593f141bb8a82c67fc

Re: [HACKERS] More stats about skipped vacuums

2017-10-30 Thread Kyotaro HORIGUCHI
At Thu, 26 Oct 2017 15:06:30 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20171026.150630.115694437.horiguchi.kyot...@lab.ntt.co.jp>
> At Fri, 20 Oct 2017 19:15:16 +0900, Masahiko Sawada  
> wrote in 
> > >   n_mod_since_analyze  | 2
> > > + vacuum_requred   | true
> > > + last_vacuum_oldest_xid   | 8023
> > > + last_vacuum_left_to_truncate | 5123
> > > + last_vacuum_truncated| 387
> > >   last_vacuum  | 2017-10-10 17:21:54.380805+09
> > >   last_autovacuum  | 2017-10-10 17:21:54.380805+09
> > > + last_autovacuum_status   | Killed by lock conflict
> > > ...
> > >   autovacuum_count | 128
> > > + incomplete_autovacuum_count  | 53
> > >
> > > # The last one might be needless..
> > 
> > I'm not sure that the above informations will help for users or DBA
> > but personally I sometimes want to have the number of index scans of
> > the last autovacuum in the pg_stat_user_tables view. That value
> > indicates how efficiently vacuums performed and would be a signal to
> > increase the setting of autovacuum_work_mem for user.
> 
> Btree and all existing index AMs (except brin) seem to visit the
> all pages in every index scan so it would be valuable. Instead
> the number of visited index pages during a table scan might be
> usable. It is more relevant to performance than the number of
> scans, on the other hand it is a bit difficult to get something
> worth from the number in a moment. I'll show the number of scans
> in the first cut.
> 
> > > Where the "Killed by lock conflict" is one of the followings.
> > >
> > >- Completed
> > >- Truncation skipped
> > >- Partially truncated
> > >- Skipped
> > >- Killed by lock conflict
> > >
> > > This seems enough to find the cause of a table bloat. The same
> > > discussion could be applied to analyze but it might be the
> > > another issue.
> > >
> > > There may be a better way to indicate the vacuum soundness. Any
> > > opinions and suggestions are welcome.
> > >
> > > I'm going to make a patch to do the 'formal' one for the time
> > > being.

Done with small modifications. In the attached patch
pg_stat_all_tables has the following new columns. Documentations
is not provided at this stage.

-
  n_mod_since_analyze | 0
+ vacuum_required | not requried
  last_vacuum | 
  last_autovacuum | 2017-10-30 18:51:32.060551+09
  last_analyze| 
  last_autoanalyze| 2017-10-30 18:48:33.414711+09
  vacuum_count| 0
+ last_vacuum_truncated   | 0
+ last_vacuum_untruncated | 0
+ last_vacuum_index_scans | 0
+ last_vacuum_oldest_xmin | 2134
+ last_vacuum_status  | agressive vacuum completed
+ autovacuum_fail_count   | 0
  autovacuum_count| 5
  analyze_count   | 0
  autoanalyze_count   | 1
-
Where each column shows the following infomation.

+ vacuum_required | not requried

 VACUUM requirement status. Takes the following values.

  - partial
Partial (or normal) will be performed by the next autovacuum.
The word "partial" is taken from the comment for
vacuum_set_xid_limits.

  - aggressive
Aggressive scan will be performed by the next autovacuum.

  - required
Any type of autovacuum will be performed. The type of scan is
unknown because the view failed to take the required lock on
the table. (AutoVacuumrequirement())

  - not required
Next autovacuum won't perform scan on this relation.

  - not required (lock not acquired)

Autovacuum should be disabled and the distance to
freeze-limit is not known because required lock is not
available.

  - close to freeze-limit xid
Shown while autovacuum is disabled. The table is in the
manual vacuum window to avoid anti-wraparound autovacuum.

+ last_vacuum_truncated | 0

  The number of truncated pages in the last completed
  (auto)vacuum.

+ last_vacuum_untruncated | 0
  The number of pages the last completed (auto)vacuum tried to
  truncate but could not for some reason.

+ last_vacuum_index_scans | 0
  The number of index scans performed in the last completed
  (auto)vacuum.

+ last_vacuum_oldest_xmin | 2134
  The oldest xmin used in the last completed (auto)vacuum.

+ last_vacuum_status  | agressive vacuum completed

  The finish status of the last vacuum. Takes the following
  values. (pg_stat_get_last_vacuum_status())

   - completed
 The last partial (auto)vacuum is completed.

   - vacuum full completed
 The last VACUUM FULL is completed.

   - aggressive vacuum completed

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-10-26 Thread Kyotaro HORIGUCHI
Hello. Thank you for looking this.

At Mon, 16 Oct 2017 17:58:03 +0900, Michael Paquier  
wrote in 
> On Thu, Sep 7, 2017 at 12:33 PM, Kyotaro HORIGUCHI
>  wrote:
> > At Wed, 6 Sep 2017 12:23:53 -0700, Andres Freund  wrote 
> > in <20170906192353.ufp2dq7wm5fd6...@alap3.anarazel.de>
> >> I'm not following. All we need to use is the beginning of the relevant
> >> records, that's easy enough to keep track of. We don't need to read the
> >> WAL or anything.
> >
> > The beginning is already tracked and nothing more to do.
> 
> I have finally allocated time to look at your newly-proposed patch,
> sorry for the time it took. Patch 0002 forgot to include sys/stat.h to
> allow the debugging tool to compile :)
> 
> > The first *problem* was WaitForWALToBecomeAvailable requests the
> > beginning of a record, which is not on the page the function has
> > been told to fetch. Still tliRecPtr is required to determine the
> > TLI to request, it should request RecPtr to be streamed.
> 
> [...]
> 
> > The rest to do is let XLogPageRead retry other sources
> > immediately. To do this I made ValidXLogPageHeader@xlogreader.c
> > public (and renamed to XLogReaderValidatePageHeader).
> >
> > The patch attached fixes the problem and passes recovery
> > tests. However, the test for this problem is not added. It needs
> > to go to the last page in a segment then put a record continues
> > to the next segment, then kill the standby after receiving the
> > previous segment but before receiving the whole record.
> 
> +typedef struct XLogPageHeaderData *XLogPageHeader;
> [...]
> +/* Validate a page */
> +extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
> +   XLogRecPtr recptr, XLogPageHeader 
> hdr);
> Instead of exposing XLogPageHeaderData, I would recommend just using
> char* and remove this declaration. The comment on top of
> XLogReaderValidatePageHeader needs to make clear what caller should
> provide.

Seems reasonable. Added several lines in the comment for the
function.

> +if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
> +  (XLogPageHeader) readBuf))
> +goto next_record_is_invalid;
> +
> [...]
> -ptr = tliRecPtr;
> +ptr = RecPtr;
>  tli = tliOfPointInHistory(tliRecPtr, 
> expectedTLEs);
> 
>  if (curFileTLI > 0 && tli < curFileTLI)
> The core of the patch is here (the patch has no comment so it is hard
> to understand what's the point of what is being done), and if I

Hmm, sorry. Added a brief comment there.

> understand that correctly, you allow the receiver to fetch the
> portions of a record spawned across multiple segments from different
> sources, and I am not sure that we'd want to break that promise.

We are allowing consecutive records at a segment boundary from
different sources are in the same series of xlog records. A
continuation records never spans over two TLIs but I might be
missing something here. (I found that an error message shows an
incorrect record pointer. The error message seems still be
useful.)

> Shouldn't we instead have the receiver side track the beginning of the
> record and send that position for the physical slot's restart_lsn?

The largest obstacle to do that is that walreceiver is not
utterly concerned to record internals. In other words, it doesn't
know what a record is. Teaching that introduces much complexity
and the complexity slows down the walreceiver.

Addition to that, this "problem" occurs also on replication
without a slot. The latest patch also help the case.

> This way the receiver would retain WAL segments from the real
> beginning of a record. restart_lsn for replication slots is set when
> processing the standby message in ProcessStandbyReplyMessage() using
> now the flush LSN, so a more correct value should be provided using
> that. Andres, what's your take on the matter?


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From b23c1d69ad86fcbb992cb21c604f587d82441001 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 7 Sep 2017 12:14:55 +0900
Subject: [PATCH 1/2] Allow switch WAL source midst of record.

The corrent recovery machinary assumes the whole of a record is
avaiable from single source. This prevents a standby from restarting
under a certain condition. This patch allows source switching during
reading a series of continuation records.
---
 src/backend/access/transam/xlog.c   | 14 --
 src/backend/access/transam/xlogreader.c | 28 ++--
 src/

Re: [HACKERS] show "aggressive" or not in autovacuum logs

2017-10-26 Thread Kyotaro HORIGUCHI
Thank you for the comment.

(Thank you Sawada-san for reviewng, too.)

At Thu, 19 Oct 2017 13:03:38 +0200, Alvaro Herrera  
wrote in <20171019110338.awwzc3y674co7wof@alvherre.pgsql>
> Kyotaro HORIGUCHI wrote:
> 
> > How about the followings?
> > 
> > "automatic [agressive ]vacuum of table \"%s..."
> > "[aggressive ]vacuuming \"%s..."
> 
> That form of log message seems acceptable to me (first one is missing a 'g').
> 
> In any case, please do not construct the sentence with %s expanding the
> word, because that is going to cause a problem for translations.  Use an
> 'if' test with two full sentences instead.  Yes, as Robert said, it's
> going to add more strings, but that's okay.

Thank you. I forgot that point. Changed them so that the messages
are detected as msgids.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 7252bfc0fafcf9d4d38067913325cf82c88d1e1e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 28 Aug 2017 13:12:25 +0900
Subject: [PATCH] Show "aggressive" or not in vacuum messages

VACUUM VERBOSE or autovacuum emits log message with "%u skipped
frozen" but we cannot tell whether the vacuum was non-freezing (or not
aggressive) vacuum or freezing (or aggressive) vacuum having no tuple
to freeze. This patch adds indication of aggressive (auto)vacuum in
log messages and VACUUM VERBOSE message.
---
 src/backend/commands/vacuumlazy.c | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 30b1c08..1080fcf 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -355,6 +355,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	   params->log_min_duration))
 		{
 			StringInfoData buf;
+			char *msgfmt;
 
 			TimestampDifference(starttime, endtime, &secs, &usecs);
 
@@ -373,7 +374,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 			 * emitting individual parts of the message when not applicable.
 			 */
 			initStringInfo(&buf);
-			appendStringInfo(&buf, _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n"),
+			if (aggressive)
+msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n");
+			else
+msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
+			appendStringInfo(&buf, msgfmt,
 			 get_database_name(MyDatabaseId),
 			 get_namespace_name(RelationGetNamespace(onerel)),
 			 RelationGetRelationName(onerel),
@@ -486,10 +491,16 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 	pg_rusage_init(&ru0);
 
 	relname = RelationGetRelationName(onerel);
-	ereport(elevel,
-			(errmsg("vacuuming \"%s.%s\"",
-	get_namespace_name(RelationGetNamespace(onerel)),
-	relname)));
+	if (aggressive)
+		ereport(elevel,
+(errmsg("aggressive vacuuming \"%s.%s\"",
+		get_namespace_name(RelationGetNamespace(onerel)),
+		relname)));
+	else
+		ereport(elevel,
+(errmsg("vacuuming \"%s.%s\"",
+		get_namespace_name(RelationGetNamespace(onerel)),
+		relname)));
 
 	empty_pages = vacuumed_pages = 0;
 	num_tuples = tups_vacuumed = nkeep = nunused = 0;
-- 
2.9.2


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] More stats about skipped vacuums

2017-10-25 Thread Kyotaro HORIGUCHI
Mmm. I've failed to create a brand-new thread..

Thank you for the comment.

At Fri, 20 Oct 2017 19:15:16 +0900, Masahiko Sawada  
wrote in 
> On Tue, Oct 10, 2017 at 7:26 PM, Kyotaro HORIGUCHI
>  wrote:
> > Hello.
> > Once in a while I am asked about table bloat. In most cases the
> > cause is long lasting transactions and vacuum canceling in some
> > cases. Whatever the case users don't have enough clues to why
> > they have bloated tables.
> >
> > At the top of the annoyances list for users would be that they
> > cannot know whether autovacuum decided that a table needs vacuum
> > or not. I suppose that it could be shown in pg_stat_*_tables.
> >
> >   n_mod_since_analyze | 2
> > + vacuum_requred  | true
> >   last_vacuum | 2017-10-10 17:21:54.380805+09
> >
> > If vacuum_required remains true for a certain time, it means that
> > vacuuming stopped halfway or someone is killing it repeatedly.
> > That status could be shown in the same view.
> 
> Because the table statistics is updated at end of the vacuum I think
> that the autovacuum will process the table at the next cycle if it has
> stopped halfway or has killed. So you mean that vacuum_required is for
> uses who want to reclaim garbage without wait for autovacuum retry?

It could be used for the purpose and for just knowing that a
table is left for a long time needing a vacuum and it would be a
trigger for users to take measures to deal with the situation.

> >   n_mod_since_analyze | 2
> > + vacuum_requred  | true
> >   last_vacuum | 2017-10-10 17:21:54.380805+09
> >   last_autovacuum | 2017-10-10 17:21:54.380805+09
> > + last_autovacuum_status  | Killed by lock conflict
> >
> > Where the "Killed by lock conflict" would be one of the followings.
> >
> >   - Completed (oldest xmin = 8023)
> >   - May not be fully truncated (yielded at 1324 of 6447 expected)
> >   - Truncation skipped
> >   - Skipped by lock failure
> >   - Killed by lock conflict
> >
> >
> > If we want more formal expression, we can show the values in the
> > following shape. And adding some more values could be useful.
> >
> >   n_mod_since_analyze  | 2
> > + vacuum_requred   | true
> > + last_vacuum_oldest_xid   | 8023
> > + last_vacuum_left_to_truncate | 5123
> > + last_vacuum_truncated| 387
> >   last_vacuum  | 2017-10-10 17:21:54.380805+09
> >   last_autovacuum  | 2017-10-10 17:21:54.380805+09
> > + last_autovacuum_status   | Killed by lock conflict
> > ...
> >   autovacuum_count | 128
> > + incomplete_autovacuum_count  | 53
> >
> > # The last one might be needless..
> 
> I'm not sure that the above informations will help for users or DBA
> but personally I sometimes want to have the number of index scans of
> the last autovacuum in the pg_stat_user_tables view. That value
> indicates how efficiently vacuums performed and would be a signal to
> increase the setting of autovacuum_work_mem for user.

Btree and all existing index AMs (except brin) seem to visit the
all pages in every index scan so it would be valuable. Instead
the number of visited index pages during a table scan might be
usable. It is more relevant to performance than the number of
scans, on the other hand it is a bit difficult to get something
worth from the number in a moment. I'll show the number of scans
in the first cut.

> > Where the "Killed by lock conflict" is one of the followings.
> >
> >- Completed
> >- Truncation skipped
> >- Partially truncated
> >- Skipped
> >    - Killed by lock conflict
> >
> > This seems enough to find the cause of a table bloat. The same
> > discussion could be applied to analyze but it might be the
> > another issue.
> >
> > There may be a better way to indicate the vacuum soundness. Any
> > opinions and suggestions are welcome.
> >
> > I'm going to make a patch to do the 'formal' one for the time
> > being.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-10-20 Thread Kyotaro HORIGUCHI
Hello.

Fully-asynchronous executor needs that every node is stateful and
suspendable at the time of requesting for the next tuples to
underneath nodes. I tried pure push-base executor but failed.

After the miserable patch upthread, I finally managed to make
executor nodes suspendable using computational jump and got rid
of recursive calls of executor. But it runs about x10 slower for
simple SeqScan case. (pgbench ran with 9% degradation.) It
doesn't seem recoverable by handy improvements. So I gave up
that.

Then I returned to single-level asynchrony, in other words, the
simple case with async-aware nodes just above async-capable
nodes. The motive of using the framework in the previous patch
was that we had degradation on the sync (or normal) code paths by
polluting ExecProcNode with async stuff and as Tom's suggestion
the node->ExecProcNode trick can isolate the async code path.

The attached PoC patch theoretically has no impact on the normal
code paths and just brings gain in async cases. (Additional
members in PlanState made degradation seemingly comes from
alignment, though.)

But I haven't had enough stable result from performance
test. Different builds from the same source code gives apparently
different results...

Anyway I'll show the best one in the several times run here.

   original(ms) patched(ms)gain(%)
A: simple table scan :  9714.70  9656.73 0.6
B: local partitioning:  4119.44  4131.10-0.3
C: single remote table   :  9484.86  9141.89 3.7
D: sharding (single con) :  7114.34  6751.21 5.1
E: sharding (multi con)  :  7166.56  1827.9374.5

A and B are degradation checks, which are expected to show no
degradation.  C is the gain only by postgres_fdw's command
presending on a remote table.  D is the gain of sharding on a
connection. The number of partitions/shards is 4.  E is the gain
using dedicate connection per shard.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From fc424c16e124934581a184fcadaed1e05f7672c8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c|  2 +-
 src/backend/storage/ipc/latch.c   | 18 ++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c | 68 +++
 src/include/storage/latch.h   |  4 +-
 src/include/utils/resowner_private.h  |  8 
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 754154b..d459f32 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -220,7 +220,7 @@ pq_init(void)
 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 	  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 4eb6e83..e6fc3dd 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. 

[HACKERS] More stats about skipped vacuums

2017-10-10 Thread Kyotaro HORIGUCHI
Hello.
Once in a while I am asked about table bloat. In most cases the
cause is long lasting transactions and vacuum canceling in some
cases. Whatever the case users don't have enough clues to why
they have bloated tables.

At the top of the annoyances list for users would be that they
cannot know whether autovacuum decided that a table needs vacuum
or not. I suppose that it could be shown in pg_stat_*_tables.

  n_mod_since_analyze | 2
+ vacuum_requred  | true
  last_vacuum | 2017-10-10 17:21:54.380805+09

If vacuum_required remains true for a certain time, it means that
vacuuming stopped halfway or someone is killing it repeatedly.
That status could be shown in the same view.

  n_mod_since_analyze | 2
+ vacuum_requred  | true
  last_vacuum | 2017-10-10 17:21:54.380805+09
  last_autovacuum | 2017-10-10 17:21:54.380805+09
+ last_autovacuum_status  | Killed by lock conflict

Where the "Killed by lock conflict" would be one of the followings.

  - Completed (oldest xmin = 8023)
  - May not be fully truncated (yielded at 1324 of 6447 expected)
  - Truncation skipped
  - Skipped by lock failure
  - Killed by lock conflict


If we want more formal expression, we can show the values in the
following shape. And adding some more values could be useful.

  n_mod_since_analyze  | 2
+ vacuum_requred   | true
+ last_vacuum_oldest_xid   | 8023
+ last_vacuum_left_to_truncate | 5123
+ last_vacuum_truncated| 387
  last_vacuum  | 2017-10-10 17:21:54.380805+09
  last_autovacuum  | 2017-10-10 17:21:54.380805+09
+ last_autovacuum_status   | Killed by lock conflict
...
  autovacuum_count | 128
+ incomplete_autovacuum_count  | 53

# The last one might be needless..

Where the "Killed by lock conflict" is one of the followings.

   - Completed
   - Truncation skipped
   - Partially truncated
   - Skipped
   - Killed by lock conflict

This seems enough to find the cause of a table bloat. The same
discussion could be applied to analyze but it might be the
another issue.

There may be a better way to indicate the vacuum soundness. Any
opinions and suggestions are welcome.

I'm going to make a patch to do the 'formal' one for the time
being.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logging idle checkpoints

2017-10-05 Thread Kyotaro HORIGUCHI
At Thu, 5 Oct 2017 13:41:42 +0200, Alvaro Herrera  
wrote in <20171005114142.dupjeqe2cnplhgkx@alvherre.pgsql>
> Kyotaro HORIGUCHI wrote:
> 
> > # This reminded me of a concern. I'd like to count vacuums that
> > # are required but skipped by lock-failure, or killed by other
> > # backend.
> 
> We clearly need to improve the stats and logs related to vacuuming work
> executed, both by autovacuum and manually invoked.  One other item I
> have in my head is to report numbers related to the truncation phase of
> a vacuum run, since in some cases it causes horrible and hard to
> diagnose problems.  (Also, add an reloption to stop vacuum from doing
> the truncation phase at all -- for some usage patterns that is a serious
> problem.)
> 
> However, please do open a new thread about it.

Thanks! Will do after a bit time of organization of the thougts.

reagareds,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Another oddity in handling of WCO constraints in postgres_fdw

2017-10-05 Thread Kyotaro HORIGUCHI
At Thu, 5 Oct 2017 18:08:50 +0900, Etsuro Fujita  
wrote in <60e94494-4e5d-afed-e482-b9ad1986b...@lab.ntt.co.jp>
> On 2017/10/04 21:28, Ashutosh Bapat wrote:
> > On Wed, Oct 4, 2017 at 5:32 PM, Robert Haas 
> > wrote:
> >> On Wed, Oct 4, 2017 at 6:40 AM, Ashutosh Bapat
> >>  wrote:
> >>> We can
> >>> check whether a row being sent from the local server to the foreign
> >>> server obeys WCO, but what foreign server does to that row is beyond
> >>> local server's scope.
> >>
> >> But I think right now we're not checking the row being sent from the
> >> local server, either.
> 
> We don't check the row *before* sending it to the remote server, but
> check the row returned by ExecForeignInsert/ExecForeignUpdate, which
> is allowed to have been changed by the remote server.  In
> postgres_fdw, we currently return the data actually inserted/updated
> if RETURNING/AFTER TRIGGER present, but not if WCO only presents.  So,
> for the postgres_fdw foreign table, WCO is enforced on the data that
> was actually inserted/updated if RETURNING/AFTER TRIGGER present and
> on the original data core supplied if WCO only presents, which is
> inconsistent behavior.
> 
> > Didn't 7086be6e3627c1ad797e32ebbdd232905b5f577f fix that?
> 
> No.  The commit addressed another issue.
> 
> >> The WCO that is being ignored isn't a
> >> constraint on the foreign table; it's a constraint on a view which
> >> happens to reference the foreign table.  It seems quite odd for the
> >> "assume constraints are valid" property of the foreign table to
> >> propagate back up into the view that references it.
> 
> Agreed.
> 
> > The view with WCO is local but the modification which violates WCO is
> > being made on remote server by a trigger on remote table. Trying to
> > control that doesn't seem to be a good idea, just like we can't
> > control what rows get inserted on the foreign server when they violate
> > local constraints. I am using local constraints as an example of
> > precedence where we ignore what's happening on remote side and enforce
> > whatever we could enforce locally. Local server should make sure that
> > any rows sent from local server to the remote server do not violate
> > any local WCO.
> 
> Seems odd (and too restrictive) to me too.

Since WCO ensures finally inserted values, we can't do other than
acturally requesting for the values. So just merging WCO columns
to RETURNING in deparsed query is ok. But can't we concatenate
returningList and withCheckOptionList at more higher level?
Specifically, just passing calculated used_attr to
deparse(Insert|Update)Sql instead of returningList and
withCheckOptionList separately.  Deparsed queries anyway forget
the origin of requested columns.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logging idle checkpoints

2017-10-04 Thread Kyotaro HORIGUCHI
At Tue, 3 Oct 2017 08:22:27 -0400, Stephen Frost  wrote in 
<2017100317.gj4...@tamriel.snowman.net>
> Greetings,
> 
> * Kyotaro HORIGUCHI (horiguchi.kyot...@lab.ntt.co.jp) wrote:
> > At Tue, 3 Oct 2017 10:23:08 +0900, Michael Paquier 
> >  wrote in 
> > 
> > > On Tue, Oct 3, 2017 at 12:01 AM, Stephen Frost  wrote:
> > > Since their introduction in
> > > 335feca4, m_timed_checkpoints and m_requested_checkpoints track the
> > > number of checkpoint requests, not if a checkpoint has been actually
> > > executed or not, I am not sure that this should be changed after 10
> > > years. So, to put it in other words, wouldn't we want a way to track
> > > checkpoints that are *executed*, meaning that we could increment a
> > > counter after doing the skip checks in CreateRestartPoint() and
> > > CreateCheckPoint()?
> > 
> > This sounds reasonable to me.
> 
> I agree that tracking executed checkpoints is valuable, but, and perhaps
> I'm missing something, isn't that the same as tracking non-skipped
> checkpoints? I suppose we could have both, if we really feel the need,
> provided that doesn't result in more work or effort being done than
> simply keeping the count.  I'd hate to end up in a situation where we're
> writing things out unnecessairly just to keep track of checkpoints that
> were requested but ultimately skipped because there wasn't anything to
> do.

I'm fine with counting both executed and skipped. But perhaps the
time of lastest checkpoint fits the concern better, like
vacuum. It is seen in control file but not in system views. If we
have count skipped checkpoints, I'd like to see the time (or LSN)
of last checkpoint in system views.

  checkpoints_timed | bigint   |   |  | 
  checkpoints_req   | bigint   |   |  | 
+ checkpoints_skipped   | bigint
+ last_checkpint    | timestamp with time zone or LSN?


# This reminded me of a concern. I'd like to count vacuums that
# are required but skipped by lock-failure, or killed by other
# backend.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Improve geometric types

2017-10-03 Thread Kyotaro HORIGUCHI
At Mon, 2 Oct 2017 08:23:49 -0400, Robert Haas  wrote in 

> On Mon, Oct 2, 2017 at 4:23 AM, Kyotaro HORIGUCHI
>  wrote:
> > For other potential reviewers:
> >
> > I found the origin of the function here.
> >
> > https://www.postgresql.org/message-id/4a90bd76.7070...@netspace.net.au
> > https://www.postgresql.org/message-id/AANLkTim4cHELcGPf5w7Zd43_dQi_2RJ_b5_F_idSSbZI%40mail.gmail.com
> >
> > And the reason for pg_hypot is seen here.
> >
> > https://www.postgresql.org/message-id/407d949e0908222139t35ad3ad2q3e6b15646a27d...@mail.gmail.com
> >
> > I think the replacement is now acceptable according to the discussion.
> > ==
> 
> I think if we're going to do this it should be separated out as its
> own patch.

+1

> Also, I think someone should explain what the reasoning
> behind the change is.  Do we, for example, foresee that the built-in
> code might be faster or less likely to overflow?  Because we're
> clearly taking a risk -- most trivially, that the BF will break, or
> more seriously, that some machines will have versions of this function
> that don't actually behave quite the same.
> 
> That brings up a related point.  How good is our test case coverage
> for hypot(), especially in strange corner cases, like this one
> mentioned in pg_hypot()'s comment:
> 
>  * This implementation conforms to IEEE Std 1003.1 and GLIBC, in that the
>  * case of hypot(inf,nan) results in INF, and not NAN.

I'm not sure how precise we practically need them to be
identical.  FWIW as a rough confirmation on my box, I compared
hypot and pg_hypot for the following arbitrary choosed pairs of
parameters.


 {2.2e-308, 2.2e-308},
 {2.2e-308, 1.7e307},
 {1.7e307, 1.7e307},
 {1.7e308, 1.7e308},
 {2.2e-308, DBL_MAX},
 {1.7e308, DBL_MAX},
 {DBL_MIN, DBL_MAX},
 {DBL_MAX, DBL_MAX},
 {1.7e307, INFINITY},
 {2.2e-308, INFINITY},
 {0, INFINITY},
 {DBL_MIN, INFINITY},
 {INFINITY, INFINITY},
 {1, NAN},
 {INFINITY, NAN},
 {NAN, NAN},


Only the first pair gave slightly not-exactly-equal results but
it seems to do no harm. hypot set underflow flag.

 0: hypot=3.111269837220809e-308 (== 0.0 is 0, < DBL_MIN is 0)
   pg_hypot=3.11126983722081e-308 (== 0.0 is 0, < DBL_MIN is 0)
   equal=0,
   hypot(errno:0, inval:0, div0:0, of=0, uf=1),
   pg_hypot(errno:0, inval:0, div0:0, of=0, uf=0)

But not sure how the both functions behave on other platforms.

> I'm potentially willing to commit a patch that just makes the
> pg_hypot() -> hypot() change and does nothing else, if there are not
> objections to that change, but I want to be sure that we'll know right
> away if that turns out to break.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#include 
#include 
#include 
#include 
#include 

double
pg_hypot(double x, double y)
{
double  yx;

/* Handle INF and NaN properly */
if (isinf(x) || isinf(y))
return (double) INFINITY;

if (isnan(x) || isnan(y))
return (double) NAN;

/* Else, drop any minus signs */
x = fabs(x);
y = fabs(y);

/* Swap x and y if needed to make x the larger one */
if (x < y)
{
double  temp = x;

x = y;
y = temp;
}

/*
 * If y is zero, the hypotenuse is x.  This test saves a few cycles in
 * such cases, but more importantly it also protects against
 * divide-by-zero errors, since now x >= y.
 */
if (y == 0.0)
return x;

/* Determine the hypotenuse */
yx = y / x;
return x * sqrt(1.0 + (yx * yx));
}

void
setfeflags(int *invalid, int *divzero, int *overflow, int *underflow)
{
int err = fetestexcept(FE_INVALID | FE_DIVBYZERO |
   FE_OVERFLOW | FE_UNDERFLOW);
*invalid = ((err & FE_INVALID) != 0);
*divzero = ((err & FE_DIVBYZERO) != 0);
*overflow = ((err & FE_OVERFLOW) != 0);
*underflow = ((err & FE_UNDERFLOW) != 0);
}

int
main(void)
{
double x;
double y;
double p[][2] =
{
{2.2e-308, 2.2e-308},
{2.2e-308, 1.7e307},
{1.7e307, 1.7e307},
{1.7e308, 1.7e308},
{2.2e-308, DBL_MAX},
{1.7e308, DBL_MAX},
{DBL_MIN, DBL_MAX},
{DBL_MAX, DBL_MAX},
{1.7e307, INFINITY},
{2.2e-308, INFINITY},
{0, INFINITY},
{DBL_MIN, INFINITY},
{INFINITY, INFINITY},
{1, NAN},
{INFINITY, 

Re: [HACKERS] [PATCH] Improve geometric types

2017-10-03 Thread Kyotaro HORIGUCHI
Thanks.

At Mon, 2 Oct 2017 11:46:15 +0200, Emre Hasegeli  wrote in 

> > And this should be the last comment of mine on the patch set.
> 
> Thank you.  The new versions of the patches are attached addressing
> your comments.
> 
> > By the way, I found that MAXDOUBLEWIDTH has two inconsistent
> > definitions in formatting.c(500) and float.c(128). The definition
> > in new float.h is according to float.c and it seems better to be
> > untouched and it would be another issue.
> 
> The last version of the patch don't move these declarations to the header 
> file.

Yeah, it is not about the patch itself.

> > # The commit message of 0001 says that "using C11 hypot()" but it
> > # is sine C99. I suppose we shold follow C99 at the time so I
> > # suggest rewrite it in the next version if any.
> 
> Changed.
> 
> > close_pl got a bug fix not only refactoring. I think it is
> > recommended to separate bugs and improvements, but I'm fine with
> > the current patch.
> 
> I split the refactoring to the first patch.
> 
> > You added sanity check "A==0 && B==0" (in Ax + By + C) in
> > line_recv. I'm not sure the necessity of the check since it has
> > been checked in line_in but anyway the comparisons seem to be
> > typo(or thinko) of FPzero.
> 
> Tom Lane suggested [1] this one.  I now made it use FPzero().

I see. It's fine with me. I suppose that Tom didn't intened the
suggestion to be teken literary so using FPzero() seems better
(at least for now).

> > dist_pl is changed to take the smaller distance of both ends of
> > the segment. It seems absorbing error, so it might be better
> > taking the mean of the two distances. If you have a firm reason
> > for the change, it is better to be written there, or it might be
> > better left alone.
> 
> I don't really, so I left that part out.

Mmm, sorry. It's my mistake.

> [1] https://www.postgresql.org/message-id/11053.1466362319%40sss.pgh.pa.us

I'll look the new version further.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal - Default namespaces for XPath expressions (PostgreSQL 11)

2017-10-02 Thread Kyotaro HORIGUCHI
At Mon, 2 Oct 2017 12:43:19 +0200, Pavel Stehule  
wrote in 
> > Sorry, I forgot to care about that. (And the definition of
> > namespace array is of course fabricated by me). I'd like to leave
> > this to committers.  Anyway it is working but the syntax (or
> > whether it is acceptable) is still arguable.
> >
> > SELECT xpath('/a/text()', 'http://example.com";>
> > test',
> >  ARRAY[ARRAY['', 'http://example.com']]);
> > |  xpath
> > | 
> > |  {test}
> > | (1 row)
> >
> >
> > The internal name is properly rejected, but the current internal
> > name (pgdefnamespace.pgsqlxml.internal) seems a bit too long. We
> > are preserving some short names and reject them as
> > user-defined. Doesn't just 'pgsqlxml' work?
> >
> 
> LibXML2 does trim to 100 bytes length names. So
> pgdefnamespace.pgsqlxml.internal
> is safe from this perspective.
> 
> I would to decraese a risk of possible collision, so longer string is
> better. Maybe "pgsqlxml.internal" is good enoug - I have not a idea. But if
> somewhere will be this string printed, then
> "pgdefnamespace.pgsqlxml.internal" has clean semantic, and it is reason,
> why I prefer this string. PostgreSQL uses 63 bytes names - and this string
> is correct too.

Ok, I'm fine with that.

> > Default namespace correctly become to be applied on bare
> > attribute names.
> >
> > > updated doc,
> > > fixed all variants of expected result test file
> >
> > Sorry for one by one comment but I found another misbehavior.
> >
> > create table t1 (id int, doc xml);
> > insert into t1
> >values
> >(5, 'http://x.y";>50 > rows>');
> > select x.* from t1, xmltable(XMLNAMESPACES('http://x.y' AS x),
> > '/x:rows/x:row' passing t1.doc columns data int PATH
> > 'child::x:a[1][attribute::hoge="haha"]') as x;
> > |  data
> > | --
> > |50
> >
> > but the following fails.
> >
> > select x.* from t1, xmltable(XMLNAMESPACES(DEFAULT 'http://x.y'),
> > '/rows/row' passing t1.doc columns data int PATH
> > 'child::a[1][attribute::hoge="haha"]') as x;
> > |  data
> > | --
> > |
> > | (1 row)
> >
> > Perhaps child::a is not prefixed by the transformation.
> >
> > XPath might be complex enough so that it's worth switching to
> > yacc/lex based transformer that is formally verifiable and won't
> > need a bunch of cryptic tests that finally cannot prove the
> > completeness. synchronous_standy_names is far simpler than XPath
> > but using yacc/lex parser.
> >
> 
> I don't think (not yet) - it is simple state machine now, and when the code
> will be stable, then will not be modified.

Hmm. Ok, agreed. I didn't mean the current shape ought to be
changed.

> Thank you for comments, I'll look on it


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logging idle checkpoints

2017-10-02 Thread Kyotaro HORIGUCHI
At Tue, 3 Oct 2017 10:23:08 +0900, Michael Paquier  
wrote in 
> On Tue, Oct 3, 2017 at 12:01 AM, Stephen Frost  wrote:
> > I certainly don't care for the idea of adding log messages saying we
> > aren't doing anything just to match a count that's incorrectly claiming
> > that checkpoints are happening when they aren't.
> >
> > The down-thread suggestion of keeping track of skipped checkpoints might
> > be interesting, but I'm not entirely convinced it really is.  We have
> > time to debate that, of course, but I don't really see how that's
> > helpful.  At the moment, it seems like the suggestion to add that column
> > is based on the assumption that we're going to start logging skipped
> > checkpoints and having that column would allow us to match up the count
> > between the new column and the "skipped checkpoint" messages in the logs
> > and I can not help but feel that this is a ridiculous amount of effort
> > being put into the analysis of something that *didn't* happen.
> 
> Being able to look at how many checkpoints are skipped can be used as
> a tuning indicator of max_wal_size and checkpoint_timeout, or in short
> increase them if those remain idle.

We ususally adjust the GUCs based on how often checkpoint is
*executed* and how many of the executed checkpoints have been
triggered by xlog progress (or with shorter interval than
timeout). It seems enough. Counting skipped checkpoints gives
just a rough estimate of how long the system was getting no
substantial updates. I doubt that users get something valuable by
counting skipped checkpoints.

> Since their introduction in
> 335feca4, m_timed_checkpoints and m_requested_checkpoints track the
> number of checkpoint requests, not if a checkpoint has been actually
> executed or not, I am not sure that this should be changed after 10
> years. So, to put it in other words, wouldn't we want a way to track
> checkpoints that are *executed*, meaning that we could increment a
> counter after doing the skip checks in CreateRestartPoint() and
> CreateCheckPoint()?

This sounds reasonable to me.

CreateRestartPoint() is already returning ckpt_performed, it is
used to let checkpointer retry in 15 seconds rather than waiting
the next checkpoint_timeout. Checkpoint might deserve the same
treatment on skipping.

By the way RestartCheckPoint emits DEBUG2 messages on skipping.
Although restartpoint has different characteristics from
checkpoint, if we change the message level for CreateCheckPoint
(currently DEBUG1), CreateRestartPoint might should get the same
change.  (Elsewise at least they ought to have the same message
level?)

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS][BUG] Cache invalidation for queries that contains const of temporary composite type

2017-10-02 Thread Kyotaro HORIGUCHI
Hi,

Sorry, I saw this once but somehow my attension was blown away on
the way.

At Tue, 3 Oct 2017 02:41:34 +0300, Alexander Korotkov 
 wrote in 

> On Tue, Oct 3, 2017 at 12:20 AM, Maksim Milyutin 
> wrote:
> 
> > I have tested the following case:
> >
> > create type pair as (x int, y int);
> > prepare test as select json_populate_record(null::pair, '{"x": 1, "y":
> > 2}'::json);
> > drop type pair cascade;
> >
> > execute test;
> >
> > -- The following output is obtained before patch
> > ERROR:  cache lookup failed for type 16419
> >
> > -- After applying patch
> > ERROR:  type "pair" does not exist
> >
> > But after recreating 'pair' type I'll get the following message:
> > ERROR:  cached plan must not change result type
> >
> > I don't know whether it's right behavior. Anyhow your point is a good
> > motivation to experiment and investigate different scenarios of work with
> > cached plan that depends on non-stable type. Thanks for that.
> >
> 
> I think ideally, cached plan should be automatically invalidated and stored
> procedure should work without error.
> Not really sure if it's feasible...

Without the patch dropping a table used in a prepared query
results in the similar error. So I suppose it's the desired
behavior in the case.

execute test;
| ERROR:  relation "t3" does not exist


The first thought that patch gave me is that the problem is not
limited to constants. Actually the following sequence also
reproduces similar failure even with this patch.

create table t2 (x int , y int);
create type pair as (x int, y int);
prepare test as select row(x, y)::pair from t2;
drop type pair;
execute test;
| ERROR:  cache lookup failed for type 16410

In this case the causal expression is in the following form.

  TargetEntry (
expr = (
  RowExpr:
typeid = 16410,
row_format = COERCE_EXPLICIT_CAST,
args = List (Var(t2.x), Var(t2.y))
)
  )


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal - Default namespaces for XPath expressions (PostgreSQL 11)

2017-10-02 Thread Kyotaro HORIGUCHI
Hi, thanks for the new patch.

# The patch is missing xpath_parser.h. That of the first patch was usable.

At Thu, 28 Sep 2017 07:59:41 +0200, Pavel Stehule  
wrote in 
> Hi
> 
> now xpath and xpath_exists supports default namespace too

At Wed, 27 Sep 2017 22:41:52 +0200, Pavel Stehule  
wrote in 
> > 1. Uniformity among simliar features
> >
> >   As mentioned in the proposal, but it is lack of uniformity that
> >   the xpath transformer is applied only to xmltable and not for
> >   other xpath related functions.
> >
> 
> I have to fix the XPath function. The SQL/XML function Xmlexists doesn't
> support namespaces/

Sorry, I forgot to care about that. (And the definition of
namespace array is of course fabricated by me). I'd like to leave
this to committers.  Anyway it is working but the syntax (or
whether it is acceptable) is still arguable.

SELECT xpath('/a/text()', 'http://example.com";>test',
 ARRAY[ARRAY['', 'http://example.com']]);
|  xpath  
| 
|  {test}
| (1 row)


The internal name is properly rejected, but the current internal
name (pgdefnamespace.pgsqlxml.internal) seems a bit too long. We
are preserving some short names and reject them as
user-defined. Doesn't just 'pgsqlxml' work?


Default namespace correctly become to be applied on bare
attribute names.

> updated doc,
> fixed all variants of expected result test file

Sorry for one by one comment but I found another misbehavior.

create table t1 (id int, doc xml);
insert into t1 
   values
   (5, 'http://x.y";>50');
select x.* from t1, xmltable(XMLNAMESPACES('http://x.y' AS x), '/x:rows/x:row' 
passing t1.doc columns data int PATH 'child::x:a[1][attribute::hoge="haha"]') 
as x;
|  data 
| --
|50

but the following fails.

select x.* from t1, xmltable(XMLNAMESPACES(DEFAULT 'http://x.y'), '/rows/row' 
passing t1.doc columns data int PATH 'child::a[1][attribute::hoge="haha"]') as 
x;
|  data 
| --
|  
| (1 row)

Perhaps child::a is not prefixed by the transformation.

XPath might be complex enough so that it's worth switching to
yacc/lex based transformer that is formally verifiable and won't
need a bunch of cryptic tests that finally cannot prove the
completeness. synchronous_standy_names is far simpler than XPath
but using yacc/lex parser.


Anyway the following is nitpicking of the current xpath_parser.c.

- NODENAME_FIRSTCHAR allows '-' as the first char but it is
  excluded from NameStartChar (https://www.w3.org/TR/REC-xml/#NT-NameStartChar)
  I think characters with high-bit set is okay.
  Also IS_NODENAME_CHAR should be changed.

- NODENAME_FIRSTCHAR and IS_NODENAME_CHAR is in the same category
  but have different naming schemes. Can these are named in the same way?

- The current transoformer seems to using up to one token stack
  depth. Maybe the stack is needless. (pushed token is always
  popped just after)

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Improve geometric types

2017-10-02 Thread Kyotaro HORIGUCHI
Hello. Thank you for the new version.

0001: applies cleanly. regress passed.
  this mainly refactoring geo_ops.c and replacing pg_hypot with hypot(3).
0002: applies cleanly. regress passed.
  this just replaces float-ops macros into inline functions.
0003: applies cleanly. regress passed.
  replaces double with float8 and bare arithmetic with defined functions.
0004: applies cleanly. regress passsed.
  fix broken line-related functions.
  I have some comments on this (later).

At Wed, 27 Sep 2017 17:44:52 +0200, Emre Hasegeli  wrote in 

> The new versions of the patches are attached addressing your comments.
> 
> > C++ surely make just static functions inlined but I'm not sure C
> > compiler does that.
> 
> Thank you for your explanation.  I marked the mentioned functions "inline".

Thanks.

> > So we should be safe to have a buffer with 26 byte length and 500
> > bytes will apparently too large and even 128 will be too loose in
> > most cases. So how about something like the following?
> >
> > #define MINDOUBLEWIDTH 32
> 
> I left this part out for now.  We can improve it separately.

Agreed. I found that psprintf does that with initial length of
128.

By the way, I found that MAXDOUBLEWIDTH has two inconsistent
definitions in formatting.c(500) and float.c(128). The definition
in new float.h is according to float.c and it seems better to be
untouched and it would be another issue.


At Wed, 27 Sep 2017 17:45:04 +0200, Emre Hasegeli  wrote in 

> > The patch replace pg_hypot with hypot in libc. The man page says
> > as follows.
...
> I included them on the latest version.

# The commit message of 0001 says that "using C11 hypot()" but it
# is sine C99. I suppose we shold follow C99 at the time so I
# suggest rewrite it in the next version if any.

It seems bettern than expected.  I confirmed that
pg_hypot(DBL_MAX, DBL_MAX) returned a value that is equivalent to
HUGE_VAL * HUGE_VAL on glibc, but I'm not sure the behavior on
other platforms is the same.

==
For other potential reviewers:

I found the origin of the function here.

https://www.postgresql.org/message-id/4a90bd76.7070...@netspace.net.au
https://www.postgresql.org/message-id/AANLkTim4cHELcGPf5w7Zd43_dQi_2RJ_b5_F_idSSbZI%40mail.gmail.com

And the reason for pg_hypot is seen here.

https://www.postgresql.org/message-id/407d949e0908222139t35ad3ad2q3e6b15646a27d...@mail.gmail.com

I think the replacement is now acceptable according to the discussion.
==

And this should be the last comment of mine on the patch set.

In 0004, line_distance and line_interpt_internal seem
correct. Seemingly horizontal/vertical checks are redundant but
it is because unclearly defined EPSLON bahavior. lseg_perp seems
correct.

close_pl got a bug fix not only refactoring. I think it is
recommended to separate bugs and improvements, but I'm fine with
the current patch.


You added sanity check "A==0 && B==0" (in Ax + By + C) in
line_recv. I'm not sure the necessity of the check since it has
been checked in line_in but anyway the comparisons seem to be
typo(or thinko) of FPzero.

dist_pl is changed to take the smaller distance of both ends of
the segment. It seems absorbing error, so it might be better
taking the mean of the two distances. If you have a firm reason
for the change, it is better to be written there, or it might be
better left alone.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Walsender timeouts and large transactions

2017-10-01 Thread Kyotaro HORIGUCHI
Hello Sokolov.

At Fri, 29 Sep 2017 15:19:23 +0300, Sokolov Yura  
wrote in 
> I don't want to make test to lasts so long and generate so many data.
> That is why I used such small timeouts for tests.

I understand your point, but still *I* think such a short timeout
is out of expectation by design. (But it can be set.)

Does anyone have opinions on this?

> Test is failing if there is "short quit" after
> `!pq_is_send_pending()`,
> so I doubt your patch will pass the test.

It is because I think that the test "should" fail since the
timeout is out of expected range. I (and perhaps also Petr) is
thinking that the problem is just that a large transaction causes
a timeout with an ordinary timeout. My test case is based on the
assumption.

Your test is for a timeout during replication-startup with
extremely short timeout. This may be a different problem to
discuss, but perhaps better to be solved together.

I'd like to have opnions from others on this point.

> And you've change calculated sleep time with sane waiting on all
> insteresting events (using WaitLatchOrSocket) to semi-busy loop.
> It at least could affect throughput.

Uggh! I misunderstood there. It wais for writing socket so the
sleep is wrong and WaitLatchOrSocket is right.

After all, I put +1 for Petr's latest patch. Sorry for my
carelessness.

> And why did you remove `SetLatch(MyLatch)` in the end of function?
> Probably this change is correct, but not obvious.

Latch is needless there if it waited a fixed duration, but if it
waits writefd events there, also latch should be waited.


> > Any thoughts?
> 
> It certainly could be my test and my patch is wrong. But my point
> is that test should be written first. Especially for such difficult
> case. Without test it is impossible to say does our patches fix
> something. And it is impossible to say if patch does something
> wrong. And impossible to say if patch fixes this problem but
> introduce new problem.
> 
> Please, write test for your remarks. If you think, my patch breaks
> something, write test for the case my patch did broke. If you think
> my test is wrong, write your test that is more correct.
> 
> Without tests it will be just bird's hubbub.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug with pg_basebackup and 'shared' tablespace

2017-10-01 Thread Kyotaro HORIGUCHI
Thanks for the objection with clear reasoning.

For clarity, I first proposed to prohibit servers of different
versions from sharing same tablespace directory.

https://www.postgresql.org/message-id/20170406.160844.120459562.horiguchi.kyot...@lab.ntt.co.jp

And I had -1 that it is just a reverting to the previous behavior
(it was exactly the patch intended, though) and persuaded to take
the way in this thread there, so I'm here.

At Fri, 29 Sep 2017 13:43:22 -0400, Robert Haas  wrote 
in 
> On Fri, Sep 29, 2017 at 2:06 AM, Michael Paquier
>  wrote:
> > My tendency about this patch is still that it should be rejected. This
> > is presenting additional handling for no real gain.
> 
> I vehemently disagree.  If the server lets you create a tablespace,
> then everything that happens after that ought to work.
> 
> On another thread, there is the issue that if you create a tablespace
> inside $PGDATA, things break.  We should either unbreak those things

pg_basebackup copies the tablespace twice, or some maintenaince
commands give a wrong result, or careless cleanup script can blow
away a part of the data.

> or not allow creating the tablespace in the first place.  On this
> thread, there is the issue that if you create two tablespaces for
> different PG versions in the same directory, things break.  We should

Server never accesses out of /CARVER/ directory in the
 and servers with different versoins can share the
 directory (I think this is a bug). pg_upgrade will
complain if it finds the destination CATVER directory created
even though no data will be broken.

Just a clarification, not "fixing" the problem, users may get
punished by pg_basebackup later. If "fixing" in this way,
pg_basebacup will work in the case but in turn pg_upgrade may
punish them later. May I assume that we agree on this point?

> either unbreak those things or not allow creating the tablespace in
> the first place.
> 
> It is completely awful behavior to let users do things and then punish
> them later for having done them.  Users are not obliged to read the
> minds of the developers and guess what things the developers consider
> "reasonable".  They should be able to count on the principle that if
> they do something that we consider wrong, they'll get an error when
> they try to do it -- not have it superficially appear to work and then
> explode later.
> 
> To put that another way, there should be ONE rule about what is or is
> not allowable in a particular situation, and all commands, utilities,
> etc. that deal with that situation should handle it in a uniform
> fashion.  Each .c file shouldn't get to make up its own notion of what
> is or is not supported.

Anyway currently server and pg_basebackup disagrees on the
point. If the "just reverting" patch above is not rejected again,
I'll resume working on it. Or other way to go? This is not an
issue that ought to take a long time.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Walsender timeouts and large transactions

2017-09-29 Thread Kyotaro HORIGUCHI
Hello,

At Wed, 27 Sep 2017 14:28:37 +0300, Sokolov Yura  
wrote in <90bb67da7131e6186b50897c4b0f0...@postgrespro.ru>
> On 2017-09-12 11:28, Kyotaro HORIGUCHI wrote:
> > Hello,
> > At Wed, 06 Sep 2017 13:46:16 +, Yura Sokolov
> >  wrote in
> > <20170906134616.18925.88390.p...@coridan.postgresql.org>
> > As the result, I think that the functions should be modified as
> > the following.
> > - Forcing slow-path if time elapses a half of a ping period is
> >   right. (GetCurrentTimestamp is anyway requried.)
> > - The slow-path should not sleep waiting Latch. It should just
> >   pg_usleep() for maybe 1-10ms.
> > - We should go to the fast path just after keepalive or response
> >   message has been sent. In other words, the "if (now <" block
> >   should be in the "for (;;)" loop. This avoids needless runs on
> >   the slow-path.
> > It would be refactorable as the following.
> >   prepare for the send buffer;
> >   for (;;)
> >   {
> > now = GetCurrentTimeStamp();
> > if (now < )...
> > {
> >   fast-path
> > }
> > else
> > {
> >   slow-path
> > }
> > return if finished
> > sleep for 1ms?
> >   }
> > What do you think about this?
> > regards,
> > --
> > Kyotaro Horiguchi
> > NTT Open Source Software Center
> 
> Good day, Petr, Kyotaro
> 
> I've created failing test for issue (0001-Add-failing-test...) .
> It tests insertion of 2 rows with 10ms wal_sender_timeout
> (it fails in WalSndWriteData on master) and then deletion of
> those rows with 1ms wal_sender_timeout (it fails in WalSndLoop).
> 
> Both Peter's patch and my simplified suggestion didn't pass the
> test. I didn't checked Kyotaro's suggestion, though, cause I
> didn't understand it well.

Mmm. The test seems broken. wal_sender_timeout = 10ms with
wal_receiver_status_interval=10s immediately causes a
timeout. Avoiding the timeout is just breaking the sane code.

wal_sender_timeout = 3s and wal_receiver_status_interval=1s
effectively causes the problem with about 100 lines of (int)
insertion on UNIX socket connection, on my poor box.

The original complain here came from the fact that
WalSndWriteData skips processing of replies for a long time on a
fast network. However Petr's patch fixed the problem, I pointed
that just letting the function take the slow path leads to
another problem, that is, waiting for new WAL records can result
in a unwanted pause in the slow path.

Combining the solutions for the two problem is my proposal sited
above. The sentence seems in quite bad style but the attached
file is the concorete patch of that.

Any thoughts?

regards,
-- 
Kyotaro Horiguchi
NTT Open Source Software Center
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***
*** 1151,1156  static void
--- 1151,1158 
  WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  bool last_write)
  {
+ 	TimestampTz now = GetCurrentTimestamp();
+ 
  	/* output previously gathered data in a CopyData packet */
  	pq_putmessage_noblock('d', ctx->out->data, ctx->out->len);
  
***
*** 1160,1235  WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  	 * several releases by streaming physical replication.
  	 */
  	resetStringInfo(&tmpbuf);
! 	pq_sendint64(&tmpbuf, GetCurrentTimestamp());
  	memcpy(&ctx->out->data[1 + sizeof(int64) + sizeof(int64)],
  		   tmpbuf.data, sizeof(int64));
  
- 	/* fast path */
- 	/* Try to flush pending output to the client */
- 	if (pq_flush_if_writable() != 0)
- 		WalSndShutdown();
- 
- 	if (!pq_is_send_pending())
- 		return;
- 
  	for (;;)
  	{
  		int			wakeEvents;
  		long		sleeptime;
  		TimestampTz now;
  
  		/*
! 		 * Emergency bailout if postmaster has died.  This is to avoid the
! 		 * necessity for manual cleanup of all postmaster children.
  		 */
! 		if (!PostmasterIsAlive())
! 			exit(1);
! 
! 		/* Clear any already-pending wakeups */
! 		ResetLatch(MyLatch);
! 
! 		CHECK_FOR_INTERRUPTS();
! 
! 		/* Process any requests or signals received recently */
! 		if (ConfigReloadPending)
! 		{
! 			ConfigReloadPending = false;
! 			ProcessConfigFile(PGC_SIGHUP);
! 			SyncRepInitConfig();
! 		}
! 
! 		/* Check for input from the client */
! 		ProcessRepliesIfAny();
! 
! 		/* Try to flush pending output to the client */
! 		if (pq_flush_if_writable() != 0)
! 			WalSndShutdown();
! 
! 		/* If we finished clearing the buffered data, we're done here. */
! 		if (!pq_is_send_pending())
! 			break;
! 
  		now = GetCurrentTimestamp();
- 
- 		/* die if timeout was reached */
- 		WalSndCheckTimeOut(now);
- 
- 		/

Re: [HACKERS] Bug with pg_basebackup and 'shared' tablespace

2017-09-28 Thread Kyotaro HORIGUCHI
I think that we can take more more robust way to detect the
CATVER directory. Just checking if it is a top-level directory
will work. That is, making the following change.

-   if (firstfile && !basetablespace)
+   /* copybuf must contain at least one '/' here */
+   if (!basetablespace && strchr(copybuf, '/')[1] == 0)

This condition exactly hits only CATVER directories not being
disturbed by file ordering of the tar stream.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GUC for cleanup indexes threshold.

2017-09-25 Thread Kyotaro HORIGUCHI
At Fri, 22 Sep 2017 17:15:08 +0300, Sokolov Yura  
wrote in 
> On 2017-09-22 16:22, Sokolov Yura wrote:
> > On 2017-09-22 11:21, Masahiko Sawada wrote:
> >> On Fri, Sep 22, 2017 at 4:16 PM, Kyotaro HORIGUCHI
> >>  wrote:
> >>> At Fri, 22 Sep 2017 15:00:20 +0900, Masahiko Sawada
> >>>  wrote in
> >>> 
> >>>> On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI
> >>>>  wrote:
> >>>> > I was just looking the thread since it is found left alone for a
> >>>> > long time in the CF app.
> >>>> >
> >>>> > At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan  
> >>>> > wrote
> >>>> > in
> >>>> > 
> >>>> >> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund 
> >>>> >> wrote:
> >>>> >> > Hi,
> >>>> >> >
> >>>> >> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote:
> >>>> >> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas
> >>>> >> >>  wrote:
> >>>> >> >> [ lots of valuable discussion ]
> >>>> >> >
> >>>> >> > I think this patch clearly still is in the design stage, and has
> >>>> >> > received plenty feedback this CF.  I'll therefore move this to the
> >>>> >> > next
> >>>> >> > commitfest.
> >>>> >>
> >>>> >> Does anyone have ideas on a way forward here? I don't, but then I
> >>>> >> haven't thought about it in detail in several months.
> >>>> >
> >>>> > Is the additional storage in metapage to store the current status
> >>>> > of vaccum is still unacceptable even if it can avoid useless
> >>>> > full-page scan on indexes especially for stable tables?
> >>>> >
> >>>> > Or, how about additional 1 bit in pg_stat_*_index to indicate
> >>>> > that the index *don't* require vacuum cleanup stage. (default
> >>>> > value causes cleanup)
> >>>> You meant that "the next cycle" is the lazy_cleanup_index() function
> >>>> called by lazy_scan_heap()?
> >>> Both finally call btvacuumscan under a certain condition, but
> >>> what I meant by "the next cycle" is the lazy_cleanup_index call
> >>> in the next round of vacuum since abstract layer (index am) isn't
> >>> conscious of the detail of btree.
> >>> 
> >>>> > index_bulk_delete (or ambulkdelete) returns the flag in
> >>>> > IndexBulkDeleteResult then lazy_scan_heap stores the flag in
> >>>> > stats and in the next cycle it is looked up to decide the
> >>>> > necessity of index cleanup.
> >>>> >
> >>>> Could you elaborate about this? For example in btree index, the index
> >>>> cleanup skips to scan on the index scan if index_bulk_delete has been
> >>>> called during vacuuming because stats != NULL. So I think we don't
> >>>> need such a flag.
> >>> The flag works so that successive two index full scans don't
> >>> happen in a vacuum round. If any rows are fully deleted, just
> >>> following btvacuumcleanup does nothing.
> >>> I think what you wanted to solve here was the problem that
> >>> index_vacuum_cleanup runs a full scan even if it ends with no
> >>> actual work, when manual or anti-wraparound vacuums.  (I'm
> >>> getting a bit confused on this..) It is caused by using the
> >>> pointer "stats" as the flag to instruct to do that. If the
> >>> stats-as-a-flag worked as expected, the GUC doesn't seem to be
> >>> required.
> >> Hmm, my proposal is like that if a table doesn't changed since the
> >> previous vacuum much we skip the cleaning up index.
> >> If the table has at least one garbage we do the lazy_vacuum_index and
> >> then IndexBulkDeleteResutl is stored, which causes to skip doing the
> >> btvacuumcleanup. On the other hand, if the table doesn't have any
> >> garbage but some new tuples inserted since the previous vacuum, we
> >> don't do the lazy_vacuum_index but do the lazy_cleanup_index. In this
> >> case, we always do the lazy_cleanup_index (i.g, we do the full scan)
> >> even if only one tuple is inserted. That&#x

Re: [HACKERS] GUC for cleanup indexes threshold.

2017-09-25 Thread Kyotaro HORIGUCHI
At Mon, 25 Sep 2017 19:20:07 +0900, Masahiko Sawada  
wrote in 
> >> * we stash an XID when a btree page is deleted, which is used to
> >> determine when it's finally safe to recycle the page
> >
> > Is it a "problem" of this proposal?
> >
> 
> As Peter explained before[1], the problem is that there is an XID
> stored in dead btree pages that is used in the subsequent
> RecentGlobalXmin interlock that determines if recycling is safe.
> 
> [1] 
> https://www.postgresql.org/message-id/CAH2-Wz%3D1%3Dt5fcGGfarQGcAWBqaCh%2BdLMjpYCYHpEyzK8Qg6OrQ%40mail.gmail.com

Yeah I know that, and I understand that it is the reason why it
is bad to just skip looking the GUC regardless.

On the other hand looking the recycle status, I think we don't
need the GUC. (I believe) The patch is *a Poc* in the way. (I'd
like to let RecordPageWithFreeSpace to return the previous value
to avoid duplicate fsm search..)

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal - Default namespaces for XPath expressions (PostgreSQL 11)

2017-09-25 Thread Kyotaro HORIGUCHI
ion and such name is
 hardly used, a namespace prefix with the name should be
 rejected.


4. A mistake in the documentaion ?

 The documentaion says about the XMLNAMESPACES clause as the
 folows.

https://www.postgresql.org/docs/10/static/functions-xml.html
> The optional XMLNAMESPACES clause is a comma-separated list of
> namespaces. It specifies the XML namespaces used in the document
> and their aliases.

As far as looking into XmlTableSetNamespace, (and if I read the
documentation correctly) the defined namespaces are not applied
on documents, but expressions. This patch is not to blame for
this. So this will be another patch backbatchable to Pg10.

| The optional XMLNAMESPACES clause is a comma-separated list of
| namespaces. It specifies the XML namespaces used in the
| row_expression.



regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GUC for cleanup indexes threshold.

2017-09-22 Thread Kyotaro HORIGUCHI
At Fri, 22 Sep 2017 17:21:04 +0900, Masahiko Sawada  
wrote in 
> On Fri, Sep 22, 2017 at 4:16 PM, Kyotaro HORIGUCHI
>  wrote:
> > At Fri, 22 Sep 2017 15:00:20 +0900, Masahiko Sawada  
> > wrote in 
> > 
> >> On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI
> >>  wrote:
> >> Could you elaborate about this? For example in btree index, the index
> >> cleanup skips to scan on the index scan if index_bulk_delete has been
> >> called during vacuuming because stats != NULL. So I think we don't
> >> need such a flag.
> >
> > The flag works so that successive two index full scans don't
> > happen in a vacuum round. If any rows are fully deleted, just
> > following btvacuumcleanup does nothing.
> >
> > I think what you wanted to solve here was the problem that
> > index_vacuum_cleanup runs a full scan even if it ends with no
> > actual work, when manual or anti-wraparound vacuums.  (I'm
> > getting a bit confused on this..) It is caused by using the
> > pointer "stats" as the flag to instruct to do that. If the
> > stats-as-a-flag worked as expected, the GUC doesn't seem to be
> > required.
> 
> Hmm, my proposal is like that if a table doesn't changed since the
> previous vacuum much we skip the cleaning up index.
> 
> If the table has at least one garbage we do the lazy_vacuum_index and
> then IndexBulkDeleteResutl is stored, which causes to skip doing the
> btvacuumcleanup. On the other hand, if the table doesn't have any
> garbage but some new tuples inserted since the previous vacuum, we
> don't do the lazy_vacuum_index but do the lazy_cleanup_index. In this
> case, we always do the lazy_cleanup_index (i.g, we do the full scan)
> even if only one tuple is inserted. That's why I proposed a new GUC
> parameter which allows us to skip the lazy_cleanup_index in the case.

I think the problem raised in this thread is that the last index
scan may leave dangling pages.

> > Addition to that, as Simon and Peter pointed out
> > index_bulk_delete can leave not-fully-removed pages (so-called
> > half-dead pages and pages that are recyclable but not registered
> > in FSM, AFAICS) in some cases mainly by RecentGlobalXmin
> > interlock. In this case, just inhibiting cleanup scan by a
> > threshold lets such dangling pages persist in the index. (I
> > conldn't make such a many dangling pages, though..)
> >
> > The first patch in the mail (*1) does that. It seems having some
> > bugs, though..
> >
> >
> > Since the dangling pages persist until autovacuum decided to scan
> > the belonging table again, we should run a vacuum round (or
> > index_vacuum_cleanup itself) even having no dead rows if we want
> > to clean up such pages within a certain period. The second patch
> > doesn that.
> >
> 
> IIUC half-dead pages are not relevant to this proposal. The proposal
> has two problems;
> 
> * By skipping index cleanup we could leave recyclable pages that are
> not marked as a recyclable.

Yes.

> * we stash an XID when a btree page is deleted, which is used to
> determine when it's finally safe to recycle the page

Is it a "problem" of this proposal?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GUC for cleanup indexes threshold.

2017-09-22 Thread Kyotaro HORIGUCHI
I apologize in advance of possible silliness.

At Thu, 21 Sep 2017 13:54:01 -0300, Claudio Freire  
wrote in 
> On Tue, Sep 19, 2017 at 8:55 PM, Peter Geoghegan  wrote:
> > On Tue, Sep 19, 2017 at 4:47 PM, Claudio Freire  
> > wrote:
> >> Maybe this is looking at the problem from the wrong direction.
> >>
> >> Why can't the page be added to the FSM immediately and the check be
> >> done at runtime when looking for a reusable page?
> >>
> >> Index FSMs currently store only 0 or 255, couldn't they store 128 for
> >> half-recyclable pages and make the caller re-check reusability before
> >> using it?
> >
> > No, because it's impossible for them to know whether or not the page
> > that their index scan just landed on recycled just a second ago, or
> > was like this since before their xact began/snapshot was acquired.
> >
> > For your reference, this RecentGlobalXmin interlock stuff is what
> > Lanin & Shasha call "The Drain Technique" within "2.5 Freeing Empty
> > Nodes". Seems pretty hard to do it any other way.
> 
> I don't see the difference between a vacuum run and distributed
> maintainance at _bt_getbuf time. In fact, the code seems to be in
> place already.

The pages prohibited to register as "free" by RecentGlobalXmin
cannot be grabbed _bt_getbuf since the page is liked from nowhere
nor FSM doesn't offer the pages is "free".

> _bt_page_recyclable seems to prevent old transactions from treating
> those pages as recyclable already, and the description of the
> technique in 2.5 doesn't seem to preclude doing the drain while doing
> other operations. In fact, Lehman even considers the possibility of
> multiple concurrent garbage collectors.

_bt_page_recyclable prevent a vacuum scan from discarding pages
that might be looked from any active transaction, and the "drain"
itself is a technique to prevent freeing still-active pages so a
scan using the "drain" technique is freely executed
simultaneously with other transactions. The paper might allow
concurrent GCs (or vacuums) but our nbtree is saying that no
concurrent vacuum is assumed. Er... here it is.

nbtpages.c:1589: _bt_unlink_halfdead_page
| * right.  This search could fail if either the sibling or the target page
| * was deleted by someone else meanwhile; if so, give up.  (Right now,
| * that should never happen, since page deletion is only done in VACUUM
| * and there shouldn't be multiple VACUUMs concurrently on the same
| * table.)


> It's only a matter of making the page visible in the FSM in a way that
> can be efficiently skipped if we want to go directly to a page that
> actually CAN be recycled to avoid looping forever looking for a
> recyclable page in _bt_getbuf. In fact, that's pretty much Lehman's

Mmm. What _bt_getbuf does is recheck the page given from FSM as a
"free page". If FSM gives no more page, it just tries to extend
the index relation. Or am I reading you wrongly?

> drain technique right there. FSM entries with 128 would be "the
> queue", and FSM entries with 255 would be "the freelist". _bt_getbuf
> can be the GC getting a buffer to try and recycle, give up after a few
> tries, and get an actual recycleable buffer instead (or extend the
> relationship). In essence, microvacuum.
> 
> Unless I'm missing something and RecentGlobalXmin somehow doesn't
> exclude all old transactions, I don't see a problem.
> 
> Lanin & Shasha use reference counting to do GC wait during the drain,
> and most of the ordering of operations needed there is because of
> that, but using the xmin seems to make all those considerations moot.
> An xact earlier than RecentGlobalXmin cannot have active transactions
> able to follow links to that page AFAIK.
> 
> TBH, I didn't read the whole papers, though I probably will.
> 
> But, in essence, what's the difference of vacuum doing
> 
> if (_bt_page_recyclable(page))
> {
> /* Okay to recycle this page */
> RecordFreeIndexPage(rel, blkno);
> vstate->totFreePages++;
> stats->pages_deleted++;
> }
> 
> VS doing it in _bt_getbuf?
> 
> What am I missing?

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GUC for cleanup indexes threshold.

2017-09-22 Thread Kyotaro HORIGUCHI
At Fri, 22 Sep 2017 15:00:20 +0900, Masahiko Sawada  
wrote in 
> On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI
>  wrote:
> > I was just looking the thread since it is found left alone for a
> > long time in the CF app.
> >
> > At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan  wrote in 
> > 
> >> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund  wrote:
> >> > Hi,
> >> >
> >> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote:
> >> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas  
> >> >> wrote:
> >> >> [ lots of valuable discussion ]
> >> >
> >> > I think this patch clearly still is in the design stage, and has
> >> > received plenty feedback this CF.  I'll therefore move this to the next
> >> > commitfest.
> >>
> >> Does anyone have ideas on a way forward here? I don't, but then I
> >> haven't thought about it in detail in several months.
> >
> > Is the additional storage in metapage to store the current status
> > of vaccum is still unacceptable even if it can avoid useless
> > full-page scan on indexes especially for stable tables?
> >
> > Or, how about additional 1 bit in pg_stat_*_index to indicate
> > that the index *don't* require vacuum cleanup stage. (default
> > value causes cleanup)
> 
> You meant that "the next cycle" is the lazy_cleanup_index() function
> called by lazy_scan_heap()?

Both finally call btvacuumscan under a certain condition, but
what I meant by "the next cycle" is the lazy_cleanup_index call
in the next round of vacuum since abstract layer (index am) isn't
conscious of the detail of btree.

> > index_bulk_delete (or ambulkdelete) returns the flag in
> > IndexBulkDeleteResult then lazy_scan_heap stores the flag in
> > stats and in the next cycle it is looked up to decide the
> > necessity of index cleanup.
> >
> 
> Could you elaborate about this? For example in btree index, the index
> cleanup skips to scan on the index scan if index_bulk_delete has been
> called during vacuuming because stats != NULL. So I think we don't
> need such a flag.

The flag works so that successive two index full scans don't
happen in a vacuum round. If any rows are fully deleted, just
following btvacuumcleanup does nothing.

I think what you wanted to solve here was the problem that
index_vacuum_cleanup runs a full scan even if it ends with no
actual work, when manual or anti-wraparound vacuums.  (I'm
getting a bit confused on this..) It is caused by using the
pointer "stats" as the flag to instruct to do that. If the
stats-as-a-flag worked as expected, the GUC doesn't seem to be
required.

Addition to that, as Simon and Peter pointed out
index_bulk_delete can leave not-fully-removed pages (so-called
half-dead pages and pages that are recyclable but not registered
in FSM, AFAICS) in some cases mainly by RecentGlobalXmin
interlock. In this case, just inhibiting cleanup scan by a
threshold lets such dangling pages persist in the index. (I
conldn't make such a many dangling pages, though..)

The first patch in the mail (*1) does that. It seems having some
bugs, though..


Since the dangling pages persist until autovacuum decided to scan
the belonging table again, we should run a vacuum round (or
index_vacuum_cleanup itself) even having no dead rows if we want
to clean up such pages within a certain period. The second patch
doesn that.


[*1] 
https://www.postgresql.org/message-id/20170921.174957.236914340.horiguchi.kyot...@lab.ntt.co.jp

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] analyzeCTE is too strict about typmods?

2017-09-21 Thread Kyotaro HORIGUCHI
At Thu, 03 Aug 2017 19:29:40 -0400, Tom Lane  wrote in 
<28993.1501802...@sss.pgh.pa.us>
> I wrote:
> > In short, therefore, it's looking to me like analyzeCTE() is wrong here.
> > It should allow the case where the recursive result has typmod -1 while
> > the non-recursive output column has some more-specific typmod, so long
> > as they match on type OID.  That would correspond to what we do in
> > regular non-recursive UNION situations.
> 
> Oh, scratch that.  I was thinking that we could simply relax the error
> check, but that really doesn't work at all.  The problem is that by now,
> we have probably already generated Vars referencing the outputs of the
> recursive CTE, and those will have the more-specific typmod, which is
> wrong in this scenario.  Usually that wouldn't matter too much, but
> there are cases where it would matter.
> 
> We could imagine dealing with this by re-parse-analyzing the recursive
> term using typmod -1 for the CTE output column, but I don't much want
> to go there.  It wouldn't be cheap, and I'm not quite sure it's guaranteed
> to converge anyway.

Agreed.

> What's seeming like an attractive compromise is to change the HINT
> to recommend manually coercing the recursive term, instead of the
> non-recursive one.  Adjusting the error cursor to point to that side
> might be a bit painful, but it's probably doable.
> 
> Thoughts?

I agree to the direction, but if I'm not missing anything
transformSetOperationTree has the enough information and we won't
get the expected pain there. (The movement itself might be the
pain, though..)

| ERROR:  recursive query "foo" column 1 has type numeric(7,3) in non-recursive 
term but type numeric overall
| LINE 4: select f1+1  from foo
|^
| HINT:  Cast the output of the recursive term to the type of the non-recursive 
term.

# The hint gets a bit longer..

By the way a wrong collation still results in the previous hint
but it won't be a problem.

| ERROR:  recursive query "foo" column 1 has collation "it_IT" in non-recursive 
term but collation "ja_JP" overall
| LINE 2: select a from bar
|^
| HINT:  Use the COLLATE clause to set the collation of the non-recursive term.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
*** a/src/backend/parser/analyze.c
--- b/src/backend/parser/analyze.c
***
*** 42,47 
--- 42,49 
  #include "parser/parse_target.h"
  #include "parser/parsetree.h"
  #include "rewrite/rewriteManip.h"
+ #include "utils/builtins.h"
+ #include "utils/lsyscache.h"
  #include "utils/rel.h"
  
  
***
*** 1796,1801  transformSetOperationTree(ParseState *pstate, SelectStmt *stmt,
--- 1798,1804 
  		  bool isTopLevel, List **targetlist)
  {
  	bool		isLeaf;
+ 	int			varattno;
  
  	Assert(stmt && IsA(stmt, SelectStmt));
  
***
*** 1980,1985  transformSetOperationTree(ParseState *pstate, SelectStmt *stmt,
--- 1983,1989 
  		op->colTypmods = NIL;
  		op->colCollations = NIL;
  		op->groupClauses = NIL;
+ 		varattno = 0;
  		forboth(ltl, ltargetlist, rtl, rtargetlist)
  		{
  			TargetEntry *ltle = (TargetEntry *) lfirst(ltl);
***
*** 2008,2013  transformSetOperationTree(ParseState *pstate, SelectStmt *stmt,
--- 2012,2019 
  			else
  rescoltypmod = -1;
  
+ 			varattno++;
+ 
  			/*
  			 * Verify the coercions are actually possible.  If not, we'd fail
  			 * later anyway, but we want to fail now while we have sufficient
***
*** 2110,2115  transformSetOperationTree(ParseState *pstate, SelectStmt *stmt,
--- 2116,2161 
  			}
  
  			/*
+ 			 * Verify that the previously determined output column types and
+ 			 * collations match what the query really produced.  We have to
+ 			 * check this because the recursive term could have overridden the
+ 			 * non-recursive term, and we don't have any easy way to fix that.
+ 			 */
+ 			if (isTopLevel &&
+ pstate->p_parent_cte &&
+ pstate->p_parent_cte->cterecursive)
+ 			{
+ Oid	lcolcoll = exprCollation((Node *)ltle->expr);
+ 
+ /*
+  * This might somewhat confusing but we suggest to fix
+  * recursive term since non-recursive term may have the same
+  * type without typemod.
+  */
+ if (rescoltype != lcoltype || rescoltypmod != lcoltypmod)
+ 	ereport(ERROR,
+ 			(errcode(ERRCODE_DATATYPE_MISMATCH),
+ 			 errmsg("recursive query \"%s\" column %d has type %s in non-recursive term but type %s overall",
+ 	pstate->p_parent_cte->ctename, varattno,
+ 	format_type_with_typemod(lcoltype,
+ 			 lcoltypmod),
+ 	format_type_with_typemod

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

2017-09-21 Thread Kyotaro HORIGUCHI
At Thu, 21 Sep 2017 20:35:01 -0400, Robert Haas  wrote 
in 
> On Thu, Sep 21, 2017 at 8:16 PM, Kyotaro HORIGUCHI
>  wrote:
> > Though I don't see it's bug and agree that the message is not
> > proper, currently we can create hash indexes without no warning
> > on unlogged tables and it causes a problem with replication.
> 
> That's true, but I don't believe it's a sufficient reason to make a change.
> 
> Before 84aa8ba128a08e6fdebb2497c7a79ebf18093e12 (2014), we didn't
> issue a warning about hash indexes in any case whatsoever; we relied
> on people reading the documentation to find out about the limitations
> of hash indexes.  They can still do that in any cases that the warning
> doesn't adequately cover.  I really don't think it's worth kibitzing
> the cases where this message is emitted in released branches, or the
> text of the message, just as we didn't back-patch the message itself
> into older releases that are still supported.  We need a compelling
> reason to change things in stable branches, and the fact that a
> warning message added in 2014 doesn't cover every limitation of a
> pre-1996 hash index implementation is not an emergency.  Let's save
> back-patching for actual bugs, or we'll forever be fiddling with
> things in stable branches that would be better left alone.

Sorry for annoying you and thank you. I agree with that after
just knowing the reason is not precisely (3) (we already have
WARNING for the problematic ops).

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

2017-09-21 Thread Kyotaro HORIGUCHI
At Thu, 21 Sep 2017 16:19:27 -0400, Robert Haas  wrote 
in <694cb417-ef2c-4760-863b-aec4530c2...@gmail.com>
> On Sep 21, 2017, at 8:59 AM, Amit Kapila  wrote:.
> > I think giving an error message like "hash indexes are not WAL-logged
> > and .." for unlogged tables doesn't seem like a good behavior.
> 
> +1. This seems like deliberate behavior, not a bug.

Though I don't see it's bug and agree that the message is not
proper, currently we can create hash indexes without no warning
on unlogged tables and it causes a problem with replication.

The point here is that if we leave the behavior (failure on the
standby) for the reason that we see a warning on index creation,
a similar messages ought to be for unlogged tables.

Otherwise, our decision will be another option.

(4) Though we won't not see a warning on hash index creation on
  unlogged tables, it seems to have been a problem and won't
  mind.
 
regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

2017-09-21 Thread Kyotaro HORIGUCHI
Hello,

Following a bit older thread.

At Tue, 18 Jul 2017 08:33:07 +0200, Michael Paquier  
wrote in 
> On Tue, Jul 18, 2017 at 4:18 AM, Amit Kapila  wrote:
> > Thanks.  Do you have any suggestion for back-branches?  As of now, it
> > fails badly with below kind of error:
> >
> > test=> SELECT * FROM t_u_hash;
> > ERROR:  could not open file "base/16384/16392": No such file or directory
> >
> > It is explained in another thread [3] where it has been found that the
> > reason for such an error is that hash indexes are not WAL logged prior
> > to 10.  Now, we can claim that we don't recommend hash indexes to be
> > used prior to 10 in production, so such an error is okay even if there
> > is no crash has happened in the system.
> 
> There are a couple of approaches:
> 1) Marking such indexes as invalid at recovery and log information
> about the switch done.
> 2) Error at creation of hash indexes on unlogged tables.
> 3) Leave it as-is, because there is already a WARNING at creation.
> I don't mind seeing 3) per the amount of work done lately to support
> WAL on hash indexes.

I overlooked that but (3) is true as long as the table is
*logged* one.

postgres=# create table test (id int primary key, v text);
postgres=# create index on test using hash (id);
WARNING:  hash indexes are not WAL-logged and their use is discouraged

But not for for unlogged tables.

postgres=# create unlogged table test (id int primary key, v text);
postgres=# create index on test using hash (id);
postgres=# (no warning)

And fails on promotion in the same way.

postgres=# select * from test;
ERROR:  could not open file "base/13324/16446": No such file or directory

indexcmds.c@965:503
>   if (strcmp(accessMethodName, "hash") == 0 &&
> RelationNeedsWAL(rel))
> ereport(WARNING,
> (errmsg("hash indexes are not WAL-logged and their use is 
> discouraged")));

Using !RelationUsesLocalBuffers instead fixes that and the
attached patch is for 9.6. I'm a bit unconfident on the usage of
logical meaning of the macro but what it does fits there.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
*** a/src/backend/commands/indexcmds.c
--- b/src/backend/commands/indexcmds.c
***
*** 501,507  DefineIndex(Oid relationId,
  	amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
  
  	if (strcmp(accessMethodName, "hash") == 0 &&
! 		RelationNeedsWAL(rel))
  		ereport(WARNING,
  (errmsg("hash indexes are not WAL-logged and their use is discouraged")));
  
--- 501,507 
  	amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
  
  	if (strcmp(accessMethodName, "hash") == 0 &&
! 		!RelationUsesLocalBuffers(rel))
  		ereport(WARNING,
  (errmsg("hash indexes are not WAL-logged and their use is discouraged")));
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GUC for cleanup indexes threshold.

2017-09-21 Thread Kyotaro HORIGUCHI
Hi,

At Tue, 19 Sep 2017 16:55:38 -0700, Peter Geoghegan  wrote in 

> On Tue, Sep 19, 2017 at 4:47 PM, Claudio Freire  
> wrote:
> > Maybe this is looking at the problem from the wrong direction.
> >
> > Why can't the page be added to the FSM immediately and the check be
> > done at runtime when looking for a reusable page?
> >
> > Index FSMs currently store only 0 or 255, couldn't they store 128 for
> > half-recyclable pages and make the caller re-check reusability before
> > using it?
> 
> No, because it's impossible for them to know whether or not the page
> that their index scan just landed on recycled just a second ago, or
> was like this since before their xact began/snapshot was acquired.
> 
> For your reference, this RecentGlobalXmin interlock stuff is what
> Lanin & Shasha call "The Drain Technique" within "2.5 Freeing Empty
> Nodes". Seems pretty hard to do it any other way.

Anyway(:p) the attached first patch is a PoC for the
cleanup-state-in-stats method works only for btree. Some
LOG-level debugging messages are put in the patch to show how it
works.

The following steps makes a not-recyclable page but I'm not sure
it is general enough, and I couldn't generate half-dead pages.
The pg_sleep() in the following steps is inserted in order to see
the updated values in stats.


DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a int);
CREATE INDEX ON t1 (a);
INSERT INTO t1 (SELECT a FROM generate_series(0, 80) a);
DELETE FROM t1 WHERE a > 416700 AND a < 417250;
VACUUM t1;
DELETE FROM t1;
VACUUM t1;  -- 1 (or wait for autovacuum)
select pg_sleep(1);
VACUUM t1;  -- 2 (autovacuum doesn't work)
select pg_sleep(1);
VACUUM t1;  -- 3 (ditto)


The following logs are emited while the three VACUUMs are issued.

# VACUUM t1;  -- 1 (or wait for autovacuum)
 LOG:  btvacuumscan(t1_a_idx) result: deleted = 2185, notrecyclable = 1, 
hafldead = 0, no_cleanup_needed = false
 LOG:  Vacuum cleanup of index t1_a_idx is NOT skipped
 LOG:  btvacuumcleanup on index t1_a_idx is skipped since bulkdelete has run 
just before.
# VACUUM t1;  -- 2
 LOG:  Vacuum cleanup of index t1_a_idx is NOT skipped
 LOG:  btvacuumscan(t1_a_idx) result: deleted = 2192, notrecyclable = 0, 
hafldead = 0, no_cleanup_needed = true
# VACUUM t1;  -- 3
 LOG:  Vacuum cleanup of index t1_a_idx is skipped


VACUUM #1 leaves a unrecyclable page and requests the next cleanup.
VACUUM #2 leaves no unrecyclable page and inhibits the next cleanup.
VACUUM #3 (and ever after) no vacuum cleanup executed.

# I suppose it is a known issue that the cleanup cycles are not
# executed automatically unless new dead tuples are generated.

- Getting stats takes a very long time to fail during
  initdb. Since I couldn't find the right way to cope with this,
  I added a tentative function pgstat_live(), which checks that
  the backend has a valid stats socket.

- The patch calls pg_stat_get_vac_cleanup_needed using
  DirectFunctionCall. It might be better be wrapped.


As a byproduct, this enables us to run extra autovacuum rounds fo
r index cleanup. With the second attached, autovacuum works as
follows.

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a int);
CREATE INDEX ON t1 (a);
INSERT INTO t1 (SELECT a FROM generate_series(0, 80) a);
DELETE FROM t1 WHERE a > 416700 AND a < 417250;
(autovacuum on t1 runs)
> LOG:  btvacuumscan(t1_a_idx) result: deleted = 0, notrecyclable = 0, hafldead 
> = 0, no_cleanup_needed = true
> LOG:  Vacuum cleanup of index t1_a_idx is skipped
> LOG:  automatic vacuum of table "postgres.public.t1": index scans: 1
DELETE FROM t1;
(autovacuum on t1 runs)
> LOG:  btvacuumscan(t1_a_idx) result: deleted = 2185, notrecyclable = 1, 
> hafldead = 0, no_cleanup_needed = false
> LOG:  Vacuum cleanup of index t1_a_idx is NOT skipped
> LOG:  btvacuumcleanup on index t1_a_idx is skipped since bulkdelete has run 
> just before.
> LOG:  automatic vacuum of table "postgres.public.t1": index scans: 1
(cleanup vacuum runs for t1 in the next autovac timing)
> LOG:  Vacuum cleanup of index t1_a_idx is NOT skipped
> LOG:  btvacuumscan(t1_a_idx) result: deleted = 2192, notrecyclable = 0, 
> hafldead = 0, no_cleanup_needed = true
> LOG:  automatic vacuum of table "postgres.public.t1": index scans: 0


Any suggestions are welcome.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
*** a/src/backend/access/nbtree/nbtpage.c
--- b/src/backend/access/nbtree/nbtpage.c
***
*** 1110,1116  _bt_pagedel(Relation rel, Buffer buf)
  {
  	int			ndeleted = 0;
  	BlockNumber rightsib;
! 	bool		rightsib_empty;
  	Page		page;
  	BTPageOpaque opaque;
  
--- 1110,1116 
  {
  	int			ndeleted = 0;
  	BlockNumber rightsib;
! 	bool		rightsib_empty = false;
  	Page		page;
  	BTPageOpaque opaque;
  
*** a/src/backend/access/nbtree/nbtree.c
-

Re: [HACKERS] subscription worker signalling wal writer too much

2017-09-19 Thread Kyotaro HORIGUCHI
lWriterMain called XLogBackgroundFlush().
WB: # that XLogBackgroundFlush() skipped flushing in the WA case.


A: without this patch : 1745tps(wal_writer_flush_after = 1MB (default))

WB/WA =9/141100 : 0.9% of all XLogBackgroundFlush() did nothing.
(woke by SetLatch 141091 times)
TB/TA =27300/103000 : 26% of SetLatch is avoided. (12899 times called)


B: with this patch : 2041 tps (wal_writer_flush_after = 1MB (default))

WB/WA =46/ 1600 : 2.8% of all XLogBackgroundFlush() did nothing.
(woke by SetLatch 1553 times)
TB/TA =115822/118000: 97% of SetLatch is avoided. (4178 times called)


C: with this patch : 2033 tps (But wal_writer_flush_after = 0)

WB/WA =   117/1700  : 6.9% of all XLogBackgroundFlush() did nothing.
(woke by SetLatch 1582 times)
TB/TA =110615/118000: 93% of SetLatch is avoided. (7785 times called)


The condition for skipping SetLatch of the latest patch (v2)
looks reasonable.  And the effect is obvious at least in the
test. The "needless" SetLatch seems very effectively reduced and
actually improving performance. Increased size of bytes flushed
at once seems the main cause of the performance improvement.

Though a bit uneasy to have similar code on both side
(XLogBackgroundFlush and XLogSetAsyncXactLSN) but +1 to this from
me.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GUC for cleanup indexes threshold.

2017-09-18 Thread Kyotaro HORIGUCHI
I was just looking the thread since it is found left alone for a
long time in the CF app.

At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan  wrote in 

> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund  wrote:
> > Hi,
> >
> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote:
> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas  
> >> wrote:
> >> [ lots of valuable discussion ]
> >
> > I think this patch clearly still is in the design stage, and has
> > received plenty feedback this CF.  I'll therefore move this to the next
> > commitfest.
> 
> Does anyone have ideas on a way forward here? I don't, but then I
> haven't thought about it in detail in several months.

Is the additional storage in metapage to store the current status
of vaccum is still unacceptable even if it can avoid useless
full-page scan on indexes especially for stable tables?

Or, how about additional 1 bit in pg_stat_*_index to indicate
that the index *don't* require vacuum cleanup stage. (default
value causes cleanup)

index_bulk_delete (or ambulkdelete) returns the flag in
IndexBulkDeleteResult then lazy_scan_heap stores the flag in
stats and in the next cycle it is looked up to decide the
necessity of index cleanup.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Improve geometric types

2017-09-18 Thread Kyotaro HORIGUCHI
At Fri, 15 Sep 2017 11:25:30 -0400, Robert Haas  wrote 
in 
> On Fri, Sep 15, 2017 at 4:23 AM, Kyotaro HORIGUCHI
>  wrote:
> > /* don't merge the following same functions with different types
> >into single macros so that double evaluation won't happen */
> >
> > Is it still too verbose?
> 
> Personally, I don't think such a comment has much value, but I am not
> going to spend a lot of time arguing about it.  It's really up to the
> eventual committer to decide, and that probably won't be me in this
> case.  My knowledge of the geometric types isn't great, and I am a tad
> busy breaking^Wimproving things I do understand.

Ok, I agree with you.

# Though it is not a issue of geometric types :p

I'll withdrow the comment. Maybe someone notices of that when
reviewing such a patch.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Race between SELECT and ALTER TABLE NO INHERIT

2017-09-18 Thread Kyotaro HORIGUCHI
At Fri, 15 Sep 2017 15:36:26 +0900, Amit Langote 
 wrote in 

> Hi.
> 
> On 2017/08/28 18:28, Kyotaro HORIGUCHI wrote:
> > << the following is another topic >>
> > 
> >>>> BTW, in the partitioned table case, the parent is always locked first
> >>>> using an AccessExclusiveLock.  There are other considerations in that 
> >>>> case
> >>>> such as needing to recreate the partition descriptor upon termination of
> >>>> inheritance (both the DETACH PARTITION and also DROP TABLE child cases).
> >>>
> >>> Apart from the degree of concurrency, if we keep parent->children
> >>> order of locking, such recreation does not seem to be
> >>> needed. Maybe I'm missing something.
> >>
> >> Sorry to have introduced that topic in this thread, but I will try to
> >> explain anyway why things are the way they are currently:
> >>
> >> Once a table is no longer a partition of the parent (detached or dropped),
> >> we must make sure that the next commands in the transaction don't see it
> >> as one.  That information is currently present in the relcache
> >> (rd_partdesc), which is used by a few callers, most notably the
> >> tuple-routing code.  Next commands must recreate the entry so that the
> >> correct thing happens based on the updated information.  More precisely,
> >> we must invalidate the current entry.  RelationClearRelation() will either
> >> delete the entry or rebuild it.  If it's being referenced somewhere, it
> >> will be rebuilt.  The place holding the reference may also be looking at
> >> the content of rd_partdesc, which we don't require them to make a copy of,
> >> so we must preserve its content while other fields of RelationData are
> >> being read anew from the catalog.  We don't have to preserve it if there
> >> has been any change (partition added/dropped), but to make such a change
> >> one would need to take a strong enough lock on the relation (parent).  We
> >> assume here that anyone who wants to reference rd_partdesc takes at least
> >> AccessShareLock lock on the relation, and anyone who wants to change its
> >> content must take a lock that will conflict with it, so
> >> AccessExclusiveLock.  Note that in all of this, we are only talking about
> >> one relation, that is the parent, so parent -> child ordering of taking
> >> locks may be irrelevant.
> > 
> > I think I understand this, anyway DropInherit and DropPartition
> > is different-but-at-the-same-level operations so surely needs
> > amendment for drop/detach cases. Is there already a solution? Or
> > reproducing steps?
> 
> Sorry, I think I forgot to reply to this.  Since you seem to have chosen
> the other solution (checking that child is still a child), maybe this
> reply is a bit too late, but anyway.

I choosed it at that time for the reason mentioned upthread, but
haven't decided which is better.

> DropInherit or NO INHERIT is seen primarily as changing a child table's
> (which is the target table of the command) property that it is no longer a
> child of the parent, so we lock the child table to block concurrent
> operations from considering it a child of parent anymore.  The fact that
> parent is locked after the child and with ShareUpdateExclusiveLock instead
> of AccessExclusiveLock, we observe this race condition when SELECTing from
> the parent.
> 
> DropPartition or DETACH PARTITION is seen primarily as changing the parent
> table's (which is the target table of the command) property that one of
> the partitions is removed, so we lock the parent.   Any concurrent
> operations that rely on the parent's relcache to get the partition list
> will wait for the session that is dropping the partition to finish, so
> that they get the fresh information from the relcache (or more importantly
> do not end up with information obtained from the relcache going invalid
> under them without notice).  Note that the lock on the partition/child is
> also present and it plays more or less the the same role as it does in the
> DropInherit case, but due to different order of locking, reported race
> condition does not occur between SELECT on partitioned table and
> DROP/DETACH PARTITION.

Thank you for the explanation. I understand that the difference
comes from which of parent and children has the information about
inheritance/partitioning. DROP child and ALTER child NO INHERITS
are (I think) the only two operations that intiated from children
side. The parent-locking patch results in different solutions for
s

Re: [HACKERS] [PATCH] Improve geometric types

2017-09-15 Thread Kyotaro HORIGUCHI
Hello, just one point on 0001.

The patch replace pg_hypot with hypot in libc. The man page says
as follows.

man 3 hypot
>   If the result overflows, a range error occurs, and the functions return
>   HUGE_VAL, HUGE_VALF, or HUGE_VALL, respectively.
..
>ERRORS
>   See math_error(7) for information on how to determine whether an  error
>   has occurred when calling these functions.
>
>   The following errors can occur:
>
>   Range error: result overflow
>  errno  is  set  to ERANGE.  An overflow floating-point exception
>  (FE_OVERFLOW) is raised.
>
>   Range error: result underflow
>  An underflow floating-point exception (FE_UNDERFLOW) is raised.
>
>  These functions do not set errno for this case.

So, the code seems to need some amendments following to this
spec.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Improve geometric types

2017-09-15 Thread Kyotaro HORIGUCHI
At Fri, 15 Sep 2017 17:23:28 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170915.172328.97446299.horiguchi.kyot...@lab.ntt.co.jp>
> At Thu, 14 Sep 2017 16:19:13 -0400, Robert Haas  wrote 
> in 
> > On Thu, Sep 14, 2017 at 3:33 AM, Kyotaro HORIGUCHI
> >  wrote:
> > > I recall a bit about the double-evaluation hazards. I think the
> > > functions needs a comment describing the reasons so that anyone
> > > kind won't try to merge them into a macro again.
> > 
> > I think we can count on PostgreSQL developers to understand the
> > advantages of an inline function over a macro.  Even if they don't,
> > the solution can't be to put a comment in every place where an inline
> > function is used explaining it.  That would be very repetitive.
> 
> Of course putting such a comment to all inline functions is
> silly. The point here is that many pairs of two functions with
> exactly the same shape but handle different types are defined
> side by side. Such situation seems tempting to merge them into
> single macros, as the previous author did there.
> 
> So a simple one like the following would be enough.
> 
> /* don't merge the following same functions with different types
>into single macros so that double evaluation won't happen */
> 
> Is it still too verbose?

That being said, I'm not stick on that if Robert or others think
it as needless.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Improve geometric types

2017-09-15 Thread Kyotaro HORIGUCHI
At Thu, 14 Sep 2017 16:19:13 -0400, Robert Haas  wrote 
in 
> On Thu, Sep 14, 2017 at 3:33 AM, Kyotaro HORIGUCHI
>  wrote:
> > I recall a bit about the double-evaluation hazards. I think the
> > functions needs a comment describing the reasons so that anyone
> > kind won't try to merge them into a macro again.
> 
> I think we can count on PostgreSQL developers to understand the
> advantages of an inline function over a macro.  Even if they don't,
> the solution can't be to put a comment in every place where an inline
> function is used explaining it.  That would be very repetitive.

Of course putting such a comment to all inline functions is
silly. The point here is that many pairs of two functions with
exactly the same shape but handle different types are defined
side by side. Such situation seems tempting to merge them into
single macros, as the previous author did there.

So a simple one like the following would be enough.

/* don't merge the following same functions with different types
   into single macros so that double evaluation won't happen */

Is it still too verbose?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Race between SELECT and ALTER TABLE NO INHERIT

2017-09-14 Thread Kyotaro HORIGUCHI
At Wed, 13 Sep 2017 20:20:48 -0400, Robert Haas  wrote 
in 
> On Mon, Jun 26, 2017 at 4:46 AM, Kyotaro HORIGUCHI
>  wrote:
> > The cause is that NO INHERIT doesn't take an exlusive lock on the
> > parent. This allows expand_inherited_rtentry to add the child
> > relation into appendrel after removal from the inheritance but
> > still exists.
> >
> > I see two ways to fix this.
> >
> > The first patch adds a recheck of inheritance relationship if the
> > corresponding attribute is missing in the child in
> > make_inh_translation_list(). The recheck is a bit complex but it
> > is not performed unless the sequence above is happen. It checks
> > duplication of relid (or cycles in inheritance) following
> > find_all_inheritors (but doing a bit different) but I'm not sure
> > it is really useful.
> >
> >
> > The second patch lets ALTER TABLE NO INHERIT to acquire locks on
> > the parent first.
> >
> > Since the latter has a larger impact on the current behavior and
> > we already treat "DROP TABLE child" case in the similar way, I
> > suppose that the first approach would be preferable.
> >
> >
> > Any comments or thoughts?
> 
> I agree that the second has less user impact, but I wonder if we can
> think that will really fix the bug completely, or more generally if it
> will fix all of the bugs that come from ALTER TABLE .. NO INHERIT not
> locking the parent.  I have a sneaking suspicion that may be wishful
> thinking.

Thanks for the comment.

The recheck patch prevent planner from involving just-detached
children while inheritance expansion. No simultaneous detatching
of children doesn't affect the planning before the time.

Once planner (the select side) gets lock on the child, the alter
side cannot do anything until the select finishes. If the alter
side won, the select side detects detaching immediately after the
lock is released then excludes the children. No problem will
occur ever after. Even in the case a child is replaced with
another table, it is nothing different from simple detaching.

As the result, I think that the recheck patch saves all possible
problem caused by simultaneously detached children.

However, the parent-locking patch is far smaller and it doesn't
need such an explanation on its perfectness. If another problem
occurs by simlultaneous detaching, it must haven't taken
necessary locks in the right order.

The most significant reason for my decision on this ptach was the
fact that the DROP child case have been resolved by rechecking
without parent locks, which is a kind of waste of resources if it
could be resolved without it. (And I saw that the same solution
is taken at least several places)

I don't think there's noticeable difference of behavior
(excluding performance) for users between the two solutions.

Is there anyone who has an opinion on the point?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Improve geometric types

2017-09-14 Thread Kyotaro HORIGUCHI
 that
"%g" formatter requires up to 27 bytes (26 + terminating zero) so
it should use MINDOUBLEWIDTH regardless of the precision of the
value given to the formatter.

Addition to that if we are deliberately using %g in float4out, we
can use flot8out_internal instead of most of the stuff of
float4out.

| Datum
| float4out(PG_FUNCTION_ARGS)
| {
| float8 num = PG_GETARG_FLOAT4(0);
| 
| PG_RETURN_CSTRING(float8out_internal(num));
| }

> > ...
> > float4out@float.c:
> >>int  ndig = FLT_DIG + extra_float_digits;
> >>
> >>if (ndig < 1)
> >>   ndig = 1;
> >>
> >>len = snprintf(ascii, MINDOUBLEWIDTH + 1, "%+.*g", ndig, num);
> >> if (len > MINDOUBLEWIDTH + 1)
> >>{
> >>    ascii = (char *) repalloc(ascii, len);
> >>if (snprintf(ascii, len, "%+.*e", ndig, num) > len)
> >>   error(ERROR, "something wrong happens...");
> >> }
> >
> > I don't think the if part can be used so there would be no
> > performance degradation, I believe.
> 
> Wouldn't this change the output of the float datatypes?  That would be
> a backwards incompatible change.

It just reduces the unused part after terminator \0, and we won't
get incompatible result.

> > I'd like to pause here.
> 
> I will submit new versions after your are done with your review.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2017-09-13 Thread Kyotaro HORIGUCHI
At Wed, 13 Sep 2017 17:42:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170913.174239.25978735.horiguchi.kyot...@lab.ntt.co.jp>
> filterdiff seems to did something wrong..

# to did...

The patch is broken by filterdiff so I send a new patch made
directly by git format-patch. I confirmed that a build completes
with applying this.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 7086b5855080065f73de4d099cbaab09511f01fc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 12 Sep 2017 13:01:33 +0900
Subject: [PATCH] Fix WAL logging problem

---
 src/backend/access/heap/heapam.c| 113 +---
 src/backend/access/heap/pruneheap.c |   3 +-
 src/backend/access/heap/rewriteheap.c   |   4 +-
 src/backend/access/heap/visibilitymap.c |   3 +-
 src/backend/access/transam/xact.c   |   7 +
 src/backend/catalog/storage.c   | 318 +---
 src/backend/commands/copy.c |  13 +-
 src/backend/commands/createas.c |   9 +-
 src/backend/commands/matview.c  |   6 +-
 src/backend/commands/tablecmds.c|   8 +-
 src/backend/commands/vacuumlazy.c   |   6 +-
 src/backend/storage/buffer/bufmgr.c |  40 +++-
 src/backend/utils/cache/relcache.c  |  13 ++
 src/include/access/heapam.h |   8 +-
 src/include/catalog/storage.h   |   5 +-
 src/include/storage/bufmgr.h|   2 +
 src/include/utils/rel.h |   8 +
 17 files changed, 476 insertions(+), 90 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d20f038..e40254d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
  *	  the POSTGRES heap access method used for all POSTGRES
  *	  relations.
  *
+ * WAL CONSIDERATIONS
+ *	  All heap operations are normally WAL-logged. but there are a few
+ *	  exceptions. Temporary and unlogged relations never need to be
+ *	  WAL-logged, but we can also skip WAL-logging for a table that was
+ *	  created in the same transaction, if we don't need WAL for PITR or
+ *	  WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *	  the file to disk at COMMIT instead.
+ *
+ *	  The same-relation optimization is not employed automatically on all
+ *	  updates to a table that was created in the same transacton, because
+ *	  for a small number of changes, it's cheaper to just create the WAL
+ *	  records than fsyncing() the whole relation at COMMIT. It is only
+ *	  worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *	  or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *	  operation; it will cause any subsequent updates to the table to skip
+ *	  WAL-logging, if possible, and cause the heap to be synced to disk at
+ *	  COMMIT.
+ *
+ *	  To make that work, all modifications to heap must use
+ *	  HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ *	  for the given block.
+ *
  *-
  */
 #include "postgres.h"
@@ -56,6 +78,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -2373,12 +2396,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2409,6 +2426,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * TID where the tuple was stored.  But note that any toasting of fields
  * within the tuple data is NOT reflected into *tup.
  */
+extern HTAB *pendingSyncs;
 Oid
 heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 			int options, BulkInsertState bistate)
@@ -2482,7 +2500,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+	if (BufferNeedsWAL(relation, buffer))
 	{
 		xl_heap_insert xlrec;
 		xl_heap_header xlhdr;
@@ -2681,12 +2699,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 	int			ndone;
 	char	   *scratch = NULL;
 	Page		page;
-	bool		needwal;
 	Size		saveFreeSpace;
 	bool		need_tuple_data = RelationIsLogicallyLogged(relatio

Re: [HACKERS] WAL logging problem in 9.4.3?

2017-09-13 Thread Kyotaro HORIGUCHI
At Wed, 13 Sep 2017 15:05:31 +1200, Thomas Munro 
 wrote in 

> It doesn't compile for me, because your patch removed the definition
> of HEAP_INSERT_SKIP_WAL but hasn't removed that reference to it.
> 
> I'm not sure what happened.  Is it possible that your patch was not
> created by diffing against master?

It created using filterdiff.

> git diff master --patience | grep options
...
> -   if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))

but the line dissapears from the output of the following command

> git diff master --patience | filterdiff --format=context | grep options

filterdiff seems to did something wrong..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-09-13 Thread Kyotaro HORIGUCHI
Hello,

At Wed, 13 Sep 2017 17:28:20 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170913.172820.141647434.horiguchi.kyot...@lab.ntt.co.jp>
> The context exists there before the patch but anyway using the
> context as per-portal context that doesn't need freeing seems to
> result in memory leak.

It is released at the end of vacuum.
So it's no problem.
Sorry for the noise.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-09-13 Thread Kyotaro HORIGUCHI
At Wed, 13 Sep 2017 13:16:52 +0900, Michael Paquier  
wrote in 
> On Wed, Sep 13, 2017 at 1:13 PM, Kyotaro HORIGUCHI
>  wrote:
> > This patch creates a new memory context "Vacuum" under
> > PortalContext in vacuum.c, but AFAICS the current context there
> > is PortalHeapMemory, which has the same expected lifetime with
> > the new context (that is, a child of PotalContext and dropeed in
> > PortalDrop). On the other hand the PortalMemory's lifetime is not
> > PortalStart to PortaDrop but the backend lifetime (initialized in
> > InitPostgres).
> 
> Which patch are you looking to? This introduces no new memory context,
> be it in 0001 or 0002 in its last versions. I don't recall during the
> successive reviews seeing that pattern as well.

Sorry. 0001 patch is just using the context. The context has been
introduced by the commit e415b469 and used only for buffer access
strategy object. I was confused by the following comment in the
patch.

| +  * Move our relation list to our special memory context so that we do
| +  * not lose it among our per-relation transactions.


The context exists there before the patch but anyway using the
context as per-portal context that doesn't need freeing seems to
result in memory leak.

regrds,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Restricting maximum keep segments by repslots

2017-09-13 Thread Kyotaro HORIGUCHI
At Wed, 13 Sep 2017 11:43:06 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170913.114306.67844218.horiguchi.kyot...@lab.ntt.co.jp>
horiguchi.kyotaro> At Thu, 07 Sep 2017 21:59:56 +0900 (Tokyo Standard Time), 
Kyotaro HORIGUCHI  wrote in 
<20170907.215956.110216588.horiguchi.kyot...@lab.ntt.co.jp>
> > Hello,
> > 
> > At Thu, 07 Sep 2017 14:12:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
> >  wrote in 
> > <20170907.141212.227032666.horiguchi.kyot...@lab.ntt.co.jp>
> > > > I would like a flag in pg_replication_slots, and possibly also a
> > > > numerical column that indicates how far away from the critical point
> > > > each slot is.  That would be great for a monitoring system.
> > > 
> > > Great! I'll do that right now.
> > 
> > Done.
> 
> The CF status of this patch turned into "Waiting on Author".
> This is because the second patch is posted separately from the
> first patch. I repost them together after rebasing to the current
> master.

Hmm. I was unconsciously careless of regression test since it is
in a tentative shape. This must pass the regression..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 105,110  int			wal_level = WAL_LEVEL_MINIMAL;
--- 105,111 
  int			CommitDelay = 0;	/* precommit delay in microseconds */
  int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
  int			wal_retrieve_retry_interval = 5000;
+ int			max_slot_wal_keep_size_mb = 0;
  
  #ifdef WAL_DEBUG
  bool		XLOG_DEBUG = false;
***
*** 9365,9373  KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
--- 9366,9397 
  	if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
  	{
  		XLogSegNo	slotSegNo;
+ 		int			slotlimitsegs = ConvertToXSegs(max_slot_wal_keep_size_mb);
  
  		XLByteToSeg(keep, slotSegNo);
  
+ 		/*
+ 		 * ignore slots if too many wal segments are kept.
+ 		 * max_slot_wal_keep_size is just accumulated on wal_keep_segments.
+ 		 */
+ 		if (max_slot_wal_keep_size_mb > 0 && slotSegNo + slotlimitsegs < segno)
+ 		{
+ 			segno = segno - slotlimitsegs; /* must be positive */
+ 
+ 			/*
+ 			 * warn only if the checkpoint flushes the required segment.
+ 			 * we assume here that *logSegNo is calculated keep location.
+ 			 */
+ 			if (slotSegNo < *logSegNo)
+ ereport(WARNING,
+ 	(errmsg ("restart LSN of replication slots is ignored by checkpoint"),
+ 	 errdetail("Some replication slots have lost required WAL segnents to continue by up to %ld segments.",
+ 	   (segno < *logSegNo ? segno : *logSegNo) - slotSegNo)));
+ 
+ 			/* emergency vent */
+ 			slotSegNo = segno;
+ 		}
+ 
  		if (slotSegNo <= 0)
  			segno = 1;
  		else if (slotSegNo < segno)
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***
*** 2371,2376  static struct config_int ConfigureNamesInt[] =
--- 2371,2387 
  	},
  
  	{
+ 		{"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+ 			gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+ 		 NULL,
+ 		 GUC_UNIT_MB
+ 		},
+ 		&max_slot_wal_keep_size_mb,
+ 		0, 0, INT_MAX,
+ 		NULL, NULL, NULL
+ 	},
+ 
+ 	{
  		{"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
  			gettext_noop("Sets the maximum time to wait for WAL replication."),
  			NULL,
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***
*** 235,240 
--- 235,241 
  #max_wal_senders = 10		# max number of walsender processes
  # (change requires restart)
  #wal_keep_segments = 0		# in logfile segments, 16MB each; 0 disables
+ #max_slot_wal_keep_size = 0	# measured in bytes; 0 disables
  #wal_sender_timeout = 60s	# in milliseconds; 0 disables
  
  #max_replication_slots = 10	# max number of replication slots
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***
*** 97,102  extern bool reachedConsistency;
--- 97,103 
  extern int	min_wal_size_mb;
  extern int	max_wal_size_mb;
  extern int	wal_keep_segments;
+ extern int	max_slot_wal_keep_size_mb;
  extern int	XLOGbuffers;
  extern int	XLogArchiveTimeout;
  extern int	wal_retrieve_retry_interval;
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 9336,9341  CreateRestartPoint(int flags)
--- 9336,9420 
  }
  
  /*
+  * Check if the record on the given lsn will be preserved at the next
+  * checkpoint.
+  *
+  * Returns true if it will be preserved. If distance is given, the distance
+  * from origin to the beginning of the first segment kept at the next
+  * checkpoint. It means margin when this function returns true an

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-09-12 Thread Kyotaro HORIGUCHI
Hello, I began to look on this. (But it seems almost ready for committer..)

At Wed, 13 Sep 2017 11:47:11 +0900, Michael Paquier  
wrote in 
> On Wed, Sep 13, 2017 at 12:31 AM, Bossart, Nathan  wrote:
> > Sorry for the spam.  I am re-sending these patches with modified names so 
> > that
> > the apply order is obvious to the new automated testing framework (and to
> > everybody else).
> 
> - * relid, if not InvalidOid, indicate the relation to process; otherwise,
> - * the RangeVar is used.  (The latter must always be passed, because it's
> - * used for error messages.)
> [...]
> +typedef struct VacuumRelation
> +{
> +   NodeTag  type;
> +   RangeVar*relation;  /* single table to process */
> +   List*va_cols;   /* list of column names, or NIL for all */
> +   Oid  oid;   /* corresponding OID (filled in by [auto]vacuum.c) */
> +} VacuumRelation;
> We lose a bit of information here. I think that it would be good to
> mention in the declaration of VacuumRelation that the RangeVar is used
> for error processing, and needs to be filled. I have complained about
> that upthread already, perhaps this has slipped away when rebasing.
> 
> +   int i = attnameAttNum(rel, col, false);
> +
> +   if (i != InvalidAttrNumber)
> +   continue;
> Nit: allocating "i" makes little sense here. You are not using it for
> any other checks.
> 
>  /*
> - * Build a list of Oids for each relation to be processed
> + * Determine the OID for each relation to be processed
>   *
>   * The list is built in vac_context so that it will survive across our
>   * per-relation transactions.
>   */
> -static List *
> -get_rel_oids(Oid relid, const RangeVar *vacrel)
> +static void
> +get_rel_oids(List **vacrels)
> Yeah, that's not completely correct either. This would be more like
> "Fill in the list of VacuumRelation entries with their corresponding
> OIDs, adding extra entries for partitioned tables".
> 
> Those are minor points. The patch seems to be in good shape, and
> passes all my tests, including some pgbench'ing to make sure that
> nothing goes weird. So I'll be happy to finally switch both patches to
> "ready for committer" once those minor points are addressed.

May I ask one question?

This patch creates a new memory context "Vacuum" under
PortalContext in vacuum.c, but AFAICS the current context there
is PortalHeapMemory, which has the same expected lifetime with
the new context (that is, a child of PotalContext and dropeed in
PortalDrop). On the other hand the PortalMemory's lifetime is not
PortalStart to PortaDrop but the backend lifetime (initialized in
InitPostgres).

>  /*
>   * Create special memory context for cross-transaction storage.
>   *
>   * Since it is a child of PortalContext, it will go away eventually even
>   * if we suffer an error; there's no need for special abort cleanup logic.
>   */
>  vac_context = AllocSetContextCreate(PortalContext,
>"Vacuum",
>ALLOCSET_DEFAULT_SIZES);

So this seems to work as opposite to the expectation. Am I
missing something?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Restricting maximum keep segments by repslots

2017-09-12 Thread Kyotaro HORIGUCHI
At Thu, 07 Sep 2017 21:59:56 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170907.215956.110216588.horiguchi.kyot...@lab.ntt.co.jp>
> Hello,
> 
> At Thu, 07 Sep 2017 14:12:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20170907.141212.227032666.horiguchi.kyot...@lab.ntt.co.jp>
> > > I would like a flag in pg_replication_slots, and possibly also a
> > > numerical column that indicates how far away from the critical point
> > > each slot is.  That would be great for a monitoring system.
> > 
> > Great! I'll do that right now.
> 
> Done.

The CF status of this patch turned into "Waiting on Author".
This is because the second patch is posted separately from the
first patch. I repost them together after rebasing to the current
master.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 105,110  int			wal_level = WAL_LEVEL_MINIMAL;
--- 105,111 
  int			CommitDelay = 0;	/* precommit delay in microseconds */
  int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
  int			wal_retrieve_retry_interval = 5000;
+ int			max_slot_wal_keep_size_mb = 0;
  
  #ifdef WAL_DEBUG
  bool		XLOG_DEBUG = false;
***
*** 9365,9373  KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
--- 9366,9397 
  	if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
  	{
  		XLogSegNo	slotSegNo;
+ 		int			slotlimitsegs = ConvertToXSegs(max_slot_wal_keep_size_mb);
  
  		XLByteToSeg(keep, slotSegNo);
  
+ 		/*
+ 		 * ignore slots if too many wal segments are kept.
+ 		 * max_slot_wal_keep_size is just accumulated on wal_keep_segments.
+ 		 */
+ 		if (max_slot_wal_keep_size_mb > 0 && slotSegNo + slotlimitsegs < segno)
+ 		{
+ 			segno = segno - slotlimitsegs; /* must be positive */
+ 
+ 			/*
+ 			 * warn only if the checkpoint flushes the required segment.
+ 			 * we assume here that *logSegNo is calculated keep location.
+ 			 */
+ 			if (slotSegNo < *logSegNo)
+ ereport(WARNING,
+ 	(errmsg ("restart LSN of replication slots is ignored by checkpoint"),
+ 	 errdetail("Some replication slots have lost required WAL segnents to continue by up to %ld segments.",
+ 	   (segno < *logSegNo ? segno : *logSegNo) - slotSegNo)));
+ 
+ 			/* emergency vent */
+ 			slotSegNo = segno;
+ 		}
+ 
  		if (slotSegNo <= 0)
  			segno = 1;
  		else if (slotSegNo < segno)
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***
*** 2371,2376  static struct config_int ConfigureNamesInt[] =
--- 2371,2387 
  	},
  
  	{
+ 		{"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+ 			gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+ 		 NULL,
+ 		 GUC_UNIT_MB
+ 		},
+ 		&max_slot_wal_keep_size_mb,
+ 		0, 0, INT_MAX,
+ 		NULL, NULL, NULL
+ 	},
+ 
+ 	{
  		{"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
  			gettext_noop("Sets the maximum time to wait for WAL replication."),
  			NULL,
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***
*** 235,240 
--- 235,241 
  #max_wal_senders = 10		# max number of walsender processes
  # (change requires restart)
  #wal_keep_segments = 0		# in logfile segments, 16MB each; 0 disables
+ #max_slot_wal_keep_size = 0	# measured in bytes; 0 disables
  #wal_sender_timeout = 60s	# in milliseconds; 0 disables
  
  #max_replication_slots = 10	# max number of replication slots
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***
*** 97,102  extern bool reachedConsistency;
--- 97,103 
  extern int	min_wal_size_mb;
  extern int	max_wal_size_mb;
  extern int	wal_keep_segments;
+ extern int	max_slot_wal_keep_size_mb;
  extern int	XLOGbuffers;
  extern int	XLogArchiveTimeout;
  extern int	wal_retrieve_retry_interval;
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 9336,9341  CreateRestartPoint(int flags)
--- 9336,9420 
  }
  
  /*
+  * Check if the record on the given lsn will be preserved at the next
+  * checkpoint.
+  *
+  * Returns true if it will be preserved. If distance is given, the distance
+  * from origin to the beginning of the first segment kept at the next
+  * checkpoint. It means margin when this function returns true and gap of lost
+  * records when false.
+  *
+  * This function should return the consistent result with KeepLogSeg.
+  */
+ bool
+ GetMarginToSlotSegmentLimit(XLogRecPtr restartLSN, uint64 *distance)
+ {
+ 	XLogRecPtr currpos;
+ 	XLogRecPtr tailpos;
+ 	uint64 currSeg;
+ 	uint64 restByteInSeg;
+ 	uint64 restartSeg;
+ 	uint64 tailSeg;
+ 	uint64 keepSegs;
+ 
+ 	currpos = GetXLog

Re: [HACKERS] Race between SELECT and ALTER TABLE NO INHERIT

2017-09-12 Thread Kyotaro HORIGUCHI
At Mon, 28 Aug 2017 18:28:07 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170828.182807.98097766.horiguchi.kyot...@lab.ntt.co.jp>
> I'll add this to CF2017-09.

This patch got deadly atack from the commit 30833ba. I changed
the signature of expand_single_inheritance_child in addition to
make_inh_translation_list to notify that the specified child is
no longer a child of the parent.

This passes regular regression test and fixed the the problem.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
*** a/src/backend/catalog/pg_inherits.c
--- b/src/backend/catalog/pg_inherits.c
***
*** 42,47  typedef struct SeenRelsEntry
--- 42,49 
  	ListCell   *numparents_cell;	/* corresponding list cell */
  } SeenRelsEntry;
  
+ static bool is_descendent_of_internal(Oid parentId, Oid childId,
+ 	  HTAB *seen_rels);
  /*
   * find_inheritance_children
   *
***
*** 400,402  typeInheritsFrom(Oid subclassTypeId, Oid superclassTypeId)
--- 402,472 
  
  	return result;
  }
+ 
+ /*
+  * Check if the child is really a descendent of the parent
+  */
+ bool
+ is_descendent_of(Oid parentId, Oid childId)
+ {
+ 	HTAB	   *seen_rels;
+ 	HASHCTL		ctl;
+ 	bool		ischild = false;
+ 
+ 	memset(&ctl, 0, sizeof(ctl));
+ 	ctl.keysize = sizeof(Oid);
+ 	ctl.entrysize = sizeof(Oid);
+ 	ctl.hcxt = CurrentMemoryContext;
+ 
+ 	seen_rels = hash_create("is_descendent_of temporary table",
+ 			32, /* start small and extend */
+ 			&ctl,
+ 			HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ 
+ 	ischild = is_descendent_of_internal(parentId, childId, seen_rels);
+ 
+ 	hash_destroy(seen_rels);
+ 
+ 	return ischild;
+ }
+ 
+ static bool
+ is_descendent_of_internal(Oid parentId, Oid childId, HTAB *seen_rels)
+ {
+ 	Relation	inhrel;
+ 	SysScanDesc scan;
+ 	ScanKeyData key[1];
+ 	bool		ischild = false;
+ 	HeapTuple	inheritsTuple;
+ 
+ 	inhrel = heap_open(InheritsRelationId, AccessShareLock);
+ 	ScanKeyInit(&key[0], Anum_pg_inherits_inhparent,
+ BTEqualStrategyNumber, F_OIDEQ,	ObjectIdGetDatum(parentId));
+ 	scan = systable_beginscan(inhrel, InheritsParentIndexId, true,
+ 			  NULL, 1, key);
+ 
+ 	while ((inheritsTuple = systable_getnext(scan)) != NULL)
+ 	{
+ 		bool found;
+ 		Oid inhrelid = ((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhrelid;
+ 
+ 		hash_search(seen_rels, &inhrelid, HASH_ENTER, &found);
+ 
+ 		/*
+ 		 * Recursively check into children. Although there can't theoretically
+ 		 * be any cycles in the inheritance graph, check the cycles following
+ 		 * find_all_inheritors.
+ 		 */
+ 		if (inhrelid == childId ||
+ 			(!found && is_descendent_of_internal(inhrelid, childId, seen_rels)))
+ 		{
+ 			ischild = true;
+ 			break;
+ 		}
+ 	}
+ 
+ 	systable_endscan(scan);
+ 	heap_close(inhrel, AccessShareLock);
+ 
+ 	return ischild;
+ }
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
***
*** 108,123  static void expand_partitioned_rtentry(PlannerInfo *root,
  		   LOCKMODE lockmode,
  		   bool *has_child, List **appinfos,
  		   List **partitioned_child_rels);
! static void expand_single_inheritance_child(PlannerInfo *root,
  RangeTblEntry *parentrte,
  Index parentRTindex, Relation parentrel,
  PlanRowMark *parentrc, Relation childrel,
  bool *has_child, List **appinfos,
  List **partitioned_child_rels);
! static void make_inh_translation_list(Relation oldrelation,
  		  Relation newrelation,
! 		  Index newvarno,
! 		  List **translated_vars);
  static Bitmapset *translate_col_privs(const Bitmapset *parent_privs,
  	List *translated_vars);
  static Node *adjust_appendrel_attrs_mutator(Node *node,
--- 108,122 
  		   LOCKMODE lockmode,
  		   bool *has_child, List **appinfos,
  		   List **partitioned_child_rels);
! static bool expand_single_inheritance_child(PlannerInfo *root,
  RangeTblEntry *parentrte,
  Index parentRTindex, Relation parentrel,
  PlanRowMark *parentrc, Relation childrel,
  bool *has_child, List **appinfos,
  List **partitioned_child_rels);
! static List *make_inh_translation_list(Relation oldrelation,
  		  Relation newrelation,
! 		  Index newvarno);
  static Bitmapset *translate_col_privs(const Bitmapset *parent_privs,
  	List *translated_vars);
  static Node *adjust_appendrel_attrs_mutator(Node *node,
***
*** 1476,1481  expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
--- 1475,1482 
  		 * in which they appear in the PartitionDesc.  But first, expand the
  		 * parent itself.
  		 */
+ 
+ 		/* ignore the return value since this doesn't exclude the parent */
  		expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc,
  		oldrelation,
  		&has_child, &appinfos,
***
*** 1497,1502  

Re: [HACKERS] Patches that don't apply or don't compile: 2017-09-12

2017-09-12 Thread Kyotaro HORIGUCHI
Hello, aside from the discussion on the policy of usage of
automation CI, it seems having trouble applying patches.

https://travis-ci.org/postgresql-cfbot/postgresql/builds/27450
>1363  heapam.c:2502:18: error: ‘HEAP_INSERT_SKIP_WAL’ undeclared (first use in 
>this function)
>1364  if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))

These lines shows that the patch is applied halfway.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2017-09-12 Thread Kyotaro HORIGUCHI
Hello, (does this seem to be a top post?)

The CF status of this patch turned into "Waiting on Author" by
automated CI checking. However, I still don't get any error even
on the current master (69835bc) after make distclean. Also I
don't see any difference between the "problematic" patch and my
working branch has nothing different other than patching line
shifts. (So I haven't post a new one.)

I looked on the location heapam.c:2502 where the CI complains at
in my working branch and I found a different code with the
complaint.

https://travis-ci.org/postgresql-cfbot/postgresql/builds/27450

1363 heapam.c:2502:18: error: ‘HEAP_INSERT_SKIP_WAL’ undeclared (first use in 
this function)
1364   if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))

heapam.c:2502@work branch
2502:   /* XLOG stuff */
2503:   if (BufferNeedsWAL(relation, buffer))

So I conclude that the CI mechinery failed to applly the patch
correctly.


At Thu, 13 Apr 2017 15:29:35 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170413.152935.100104316.horiguchi.kyot...@lab.ntt.co.jp>
> > > > I'll post new patch in this way soon.
> > > 
> > > Here it is.
> > 
> > It contained tariling space and missing test script.  This is the
> > correct patch.
> > 
> > > - Relation has new members no_pending_sync and pending_sync that
> > >   works as instant cache of an entry in pendingSync hash.
> > > 
> > > - Commit-time synchronizing is restored as Michael's patch.
> > > 
> > > - If relfilenode is replaced, pending_sync for the old node is
> > >   removed. Anyway this is ignored on abort and meaningless on
> > >   commit.
> > > 
> > > - TAP test is renamed to 012 since some new files have been added.
> > > 
> > > Accessing pending sync hash occured on every calling of
> > > HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
> > > accessing relations has pending sync.  Almost of them are
> > > eliminated as the result.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patches that don't apply or don't compile: 2017-09-12

2017-09-12 Thread Kyotaro HORIGUCHI
At Wed, 13 Sep 2017 08:13:08 +0900, Michael Paquier  
wrote in 
> On Wed, Sep 13, 2017 at 7:39 AM, Daniel Gustafsson  wrote:
> >> On 12 Sep 2017, at 23:54, Tomas Vondra  
> >> wrote:
> >> With all due respect, it's hard not to see this as a disruption of the
> >> current CF. I agree automating the patch processing is a worthwhile
> >> goal, but we're not there yet and it seems somewhat premature.
> >>
> >> Let me explain why I think so:
> >>
> >> (1) You just changed the status of 10-15% open patches. I'd expect
> >> things like this to be consulted with the CF manager, yet I don't see
> >> any comments from Daniel. Considering he's been at the Oslo PUG meetup
> >> today, I doubt he was watching hackers very closely.
> >
> > Correct, I’ve been travelling and running a meetup today so had missed this 
> > on
> > -hackers.
> 
> FWIW, I tend to think that the status of a patch ought to be changed
> by either a direct lookup at the patch itself or the author depending
> on how the discussion goes on, not an automatic processing. Or at
> least have more delay to allow people to object as some patches can be
> applied, but do not apply automatically because of naming issues.
> There are as well people sending test patches to allow Postgres to
> fail on purpose, for example see the replication slot issue not able
> to retain a past segment because the beginning of a record was not
> tracked correctly on the receiver-side. This can make the recovery
> tests fail, but we want them to fail to reproduce easily the wanted
> failure.

Agreed. I'd like to have a means to exclude a part of patches
from the CI check. As another issue I can guess is that we
sometimes fix the commit on which a patch(set) applies for a
while especially for big patches, which gets many stubs
continuously by new commits.

> >> (2) You gave everyone about 4 hours to object, ending 3PM UTC, which
> >> excludes about one whole hemisphere where it's either too early or too
> >> late for people to respond. I'd say waiting for >24 hours would be more
> >> appropriate.
> >
> > Agreed.
> 
> Definitely. Any batch updates have to involve the CFM authorization at
> least, in this case Daniel.

+9(JST)

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Improve geometric types

2017-09-12 Thread Kyotaro HORIGUCHI
Hello, sorry to late for the party, but may I comment on this?

At Tue, 05 Sep 2017 13:18:12 +, Aleksander Alekseev 
 wrote in 
<20170905131812.18925.13551.p...@coridan.postgresql.org>
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   tested, passed
> Documentation:tested, passed
> 
> LGTM.
> 
> The new status of this patch is: Ready for Committer

The first patch reconstructs the operators in layers. These
functions are called very frequently when used. Some function are
already inlined in float.h but some static functions in float.h
also can be and are better be inlined. Some of *_internal,
point_construct, line_calculate_point and so on are the
candidates.

You removed some DirectFunctionCall to the functions within the
same file but other functions remain in the style,
ex. poly_center or on_sl. The function called from the former
seems large enough but the latter function calls a so small
function that it could be inlined. Would you like to make some
additional functions use C call (instead of DirectFunctionCall)
and inlining them?

This is not a fault of this patch, but some functions like on_pb
seems missing comment to describe what it is. Would you like to
add some?


In the second patch, the additional include fmgrprotos.h in
btree_gin.c seems needless. Some float[48] features were macros
so that they share the same expressions between float4 and
float8. They still seems sharing perfectly the same expressions
in float.h. Is there any reason for converting them into typed
inline functions?

In float.h, MAXDOUBLEWIDTH is redueced from 500 to 128, but the
exponent of double is up to 308 so it doesn't seem sufficient. On
the other hand we won't use non-scientific notation for extremely
large numbers and it requires (perhaps) up to 26 bytes in the
case. In the soruce code, most of them uses "%e" and one of them
uses '%g". %e always takes the format of
"-1.(17digits)e+308".. So it would be less than 26
characters. 

=# set extra_float_digits to 3;
=# select -1.221423424320453e308::float8;
 ?column?  
---
 -1.22142342432045302e+308

man printf: (linux)
> Style e is used if the exponent from its conversion is less than
> -4 or greater than or equal to the precision.

So we should be safe to have a buffer with 26 byte length and 500
bytes will apparently too large and even 128 will be too loose in
most cases. So how about something like the following?

#define MINDOUBLEWIDTH 32
...
float4out@float.c:
>int  ndig = FLT_DIG + extra_float_digits;
> 
>if (ndig < 1)
>   ndig = 1;
> 
>len = snprintf(ascii, MINDOUBLEWIDTH + 1, "%+.*g", ndig, num);
> if (len > MINDOUBLEWIDTH + 1)
>{
>ascii = (char *) repalloc(ascii, len);
>if (snprintf(ascii, len, "%+.*e", ndig, num) > len)
>   error(ERROR, "something wrong happens...");
>     }

I don't think the if part can be used so there would be no
performance degradation, I believe.



I'd like to pause here.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Walsender timeouts and large transactions

2017-09-12 Thread Kyotaro HORIGUCHI
Hello,

At Wed, 06 Sep 2017 13:46:16 +, Yura Sokolov  
wrote in <20170906134616.18925.88390.p...@coridan.postgresql.org>
> I've changed to "need review" to gain more attention from other.

I understand that the problem here is too fast network prohibits
walsender from sending replies.

In physical replication, WAL records are sent as soon as written
and the timeout is handled in the topmost loop in WalSndLoop. In
logical replication, data is sent at once at commit time in most
cases. So it can take a long time in ReorderBufferCommit without
returning to WalSndLoop (or even XLogSendLogical).

One problem here is that WalSndWriteData waits for the arrival of
the next *WAL record* in the slow-ptah because it is called by
cues of ReorderBuffer* functions (mainly *Commit) irrelevantly to
WAL insertion. This is I think the root cause of this problem.

On the other hand, it ought to take a sleep when network is
stalled, in other words, data to send remains after a flush. We
don't have a means to signal when the socket queue gets a new
room for another bytes. However, I suppose that such slow network
allows us to sleep several or several tens of milliseconds. Or,
if we could know how many bytes ps_flush_if_writable() pushed,
it's enough to wait only when the function returns pushing
nothing.

As the result, I think that the functions should be modified as
the following.

- Forcing slow-path if time elapses a half of a ping period is
  right. (GetCurrentTimestamp is anyway requried.)

- The slow-path should not sleep waiting Latch. It should just
  pg_usleep() for maybe 1-10ms.

- We should go to the fast path just after keepalive or response
  message has been sent. In other words, the "if (now <" block
  should be in the "for (;;)" loop. This avoids needless runs on
  the slow-path.


It would be refactorable as the following.

  prepare for the send buffer;

  for (;;)
  {
now = GetCurrentTimeStamp();
if (now < )...
{
  fast-path
}
else
{
  slow-path
}
return if finished
sleep for 1ms?
  }


What do you think about this?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2017-09-11 Thread Kyotaro HORIGUCHI
Hello,

At Fri, 08 Sep 2017 16:30:01 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170908.163001.53230385.horiguchi.kyot...@lab.ntt.co.jp>
> > >> 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
> > >> STATEMENT:  ANALYZE;
> > >> 2017-04-13 12:12:25.766 JST [85492] LOG:  BufferNeedsWAL: pendingSyncs
> > >> = 0x0, no_pending_sync = 0
> > >> 
> > >> -   lsn = XLogInsert(RM_SMGR_ID,
> > >> -XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
> > >> +   rel->no_pending_sync= false;
> > >> +   rel->pending_sync = pending;
> > >> +   }
> > >> 
> > >> It seems to me that those flags and the pending_sync data should be
> > >> kept in the context of backend process and not be part of the Relation
> > >> data...
> > > 
> > > I understand that the context of "backend process" means
> > > storage.c local. I don't mind the context on which the data is,
> > > but I found only there that can get rid of frequent hash
> > > searching. For pending deletions, just appending to a list is
> > > enough and costs almost nothing, on the other hand pendig syncs
> > > are required to be referenced, sometimes very frequently.
> > > 
> > >> +void
> > >> +RecordPendingSync(Relation rel)
> > >> I don't think that I agree that this should be part of relcache.c. The
> > >> syncs are tracked should be tracked out of the relation context.
> > > 
> > > Yeah.. It's in storage.c in the latest patch. (Sorry for the
> > > duplicate name). I think it is a kind of bond between smgr and
> > > relation.
> > > 
> > >> Seeing how invasive this change is, I would also advocate for this
> > >> patch as only being a HEAD-only change, not many people are
> > >> complaining about this optimization of TRUNCATE missing when wal_level
> > >> = minimal, and this needs a very careful review.
> > > 
> > > Agreed.
> > > 
> > >> Should I code something? Or Horiguchi-san, would you take care of it?
> > >> The previous crash I saw has been taken care of, but it's been really
> > >> some time since I looked at this patch...
> > > 
> > > My point is hash-search on every tuple insertion should be evaded
> > > even if it happens rearely. Once it was a bit apart from your
> > > original patch, but in the latest patch the significant part
> > > (pending-sync hash) is revived from the original one.
> > 
> > This patch has followed along since CF 2016-03, do we think we can reach a
> > conclusion in this CF?  It was marked as "Waiting on Author”, based on
> > developments since in this thread, I’ve changed it back to “Needs Review”
> > again.
> 
> I manged to reload its context into my head. It doesn't apply on
> the current master and needs some amendment. I'm going to work on
> this.

Rebased and slightly modified.

Michael's latest patch on which this patch is piggybacking seems
works perfectly. The motive of my addition is avoiding frequent
(I think specifically per tuple modification) hash accessing
occurs while pending-syncs exist. The hash contains at least 6 or
more entries.

The attached patch emits more log messages that will be removed
in the final shape to see how much the addition reduces the hash
access.  As a basis of determining the worthiness of the
additional mechanism, I'll show an example of a set of queries
below.

In the log messages, "r" is relation oid, "b" is buffer number,
"hash" is the pointer to the backend-global hash table for
pending syncs. "ent" is the entry in the hash belongs to the
relation, "neg" is a flag indicates that the existing pending
sync hash doesn't have an entry for the relation.

=# set log_min_message to debug2;
=# begin;
=# create table test1(a text primary key);
> DEBUG:  BufferNeedsWAL(r 2608, b 55): hash = (nil), ent=(nil), neg = 0
# relid=2608 buf=55, hash has not been created

=# insert into test1 values ('inserted row');
> DEBUG:  BufferNeedsWAL(r 24807, b 0): hash = (nil), ent=(nil), neg = 0
# relid=24807, fist buffer, hash has not bee created

=# copy test1 from '//copy_data.txt';
> DEBUG:  BufferNeedsWAL(r 24807, b 0): hash = 0x171de00, ent=0x171f390, neg = 0
# hash created, pending sync entry linked, no longer needs hash acess
# (repeats for the number of buffers)
COPY 200

=# create table test3(a text primary key);
> DEBUG:  BufferNeedsWAL(r 2608, b 55): hash = 0x171de0

Re: [HACKERS] WAL logging problem in 9.4.3?

2017-09-08 Thread Kyotaro HORIGUCHI
Thank you for your notification.

At Tue, 5 Sep 2017 12:05:01 +0200, Daniel Gustafsson  wrote in 

> > On 13 Apr 2017, at 11:42, Kyotaro HORIGUCHI 
> >  wrote:
> > 
> > At Thu, 13 Apr 2017 13:52:40 +0900, Michael Paquier 
> >  wrote in 
> > 
> >> On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI
> >>  wrote:
> >>> Sorry, what I have just sent was broken.
> >> 
> >> You can use PROVE_TESTS when running make check to select a subset of
> >> tests you want to run. I use that all the time when working on patches
> >> dedicated to certain code paths.
> > 
> > Thank you for the information. Removing unwanted test scripts
> > from t/ directories was annoyance. This makes me happy.
> > 
> >>>> - Relation has new members no_pending_sync and pending_sync that
> >>>>  works as instant cache of an entry in pendingSync hash.
> >>>> - Commit-time synchronizing is restored as Michael's patch.
> >>>> - If relfilenode is replaced, pending_sync for the old node is
> >>>>  removed. Anyway this is ignored on abort and meaningless on
> >>>>  commit.
> >>>> - TAP test is renamed to 012 since some new files have been added.
> >>>> 
> >>>> Accessing pending sync hash occurred on every calling of
> >>>> HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
> >>>> accessing relations has pending sync.  Almost of them are
> >>>> eliminated as the result.
> >> 
> >> Did you actually test this patch? One of the logs added makes the
> >> tests a long time to run:
> > 
> > Maybe this patch requires make clean since it extends the
> > structure RelationData. (Perhaps I saw the same trouble.)
> > 
> >> 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
> >> STATEMENT:  ANALYZE;
> >> 2017-04-13 12:12:25.766 JST [85492] LOG:  BufferNeedsWAL: pendingSyncs
> >> = 0x0, no_pending_sync = 0
> >> 
> >> -   lsn = XLogInsert(RM_SMGR_ID,
> >> -XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
> >> +   rel->no_pending_sync= false;
> >> +   rel->pending_sync = pending;
> >> +   }
> >> 
> >> It seems to me that those flags and the pending_sync data should be
> >> kept in the context of backend process and not be part of the Relation
> >> data...
> > 
> > I understand that the context of "backend process" means
> > storage.c local. I don't mind the context on which the data is,
> > but I found only there that can get rid of frequent hash
> > searching. For pending deletions, just appending to a list is
> > enough and costs almost nothing, on the other hand pendig syncs
> > are required to be referenced, sometimes very frequently.
> > 
> >> +void
> >> +RecordPendingSync(Relation rel)
> >> I don't think that I agree that this should be part of relcache.c. The
> >> syncs are tracked should be tracked out of the relation context.
> > 
> > Yeah.. It's in storage.c in the latest patch. (Sorry for the
> > duplicate name). I think it is a kind of bond between smgr and
> > relation.
> > 
> >> Seeing how invasive this change is, I would also advocate for this
> >> patch as only being a HEAD-only change, not many people are
> >> complaining about this optimization of TRUNCATE missing when wal_level
> >> = minimal, and this needs a very careful review.
> > 
> > Agreed.
> > 
> >> Should I code something? Or Horiguchi-san, would you take care of it?
> >> The previous crash I saw has been taken care of, but it's been really
> >> some time since I looked at this patch...
> > 
> > My point is hash-search on every tuple insertion should be evaded
> > even if it happens rearely. Once it was a bit apart from your
> > original patch, but in the latest patch the significant part
> > (pending-sync hash) is revived from the original one.
> 
> This patch has followed along since CF 2016-03, do we think we can reach a
> conclusion in this CF?  It was marked as "Waiting on Author”, based on
> developments since in this thread, I’ve changed it back to “Needs Review”
> again.

I manged to reload its context into my head. It doesn't apply on
the current master and needs some amendment. I'm going to work on
this.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Restricting maximum keep segments by repslots

2017-09-07 Thread Kyotaro HORIGUCHI
Hello,

At Thu, 07 Sep 2017 14:12:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170907.141212.227032666.horiguchi.kyot...@lab.ntt.co.jp>
> > I would like a flag in pg_replication_slots, and possibly also a
> > numerical column that indicates how far away from the critical point
> > each slot is.  That would be great for a monitoring system.
> 
> Great! I'll do that right now.

Done.

In the attached patch on top of the previous patch, I added two
columns in pg_replication_slots, "live" and "distance". The first
indicates the slot will "live" after the next checkpoint. The
second shows the how many bytes checkpoint lsn can advance before
the slot will "die", or how many bytes the slot have lost after
"death".


Setting wal_keep_segments = 1 and max_slot_wal_keep_size = 16MB.

=# select slot_name, restart_lsn, pg_current_wal_lsn(), live, distance from 
pg_replication_slots;

slot_name | restart_lsn | pg_current_wal_lsn | live | distance  
---+-++--+---
 s1| 0/162D388   | 0/162D3C0  | t| 0/29D2CE8

This shows that checkpoint can advance 0x29d2ce8 bytes before the
slot will die even if the connection stalls.

 s1| 0/4001180   | 0/6FFF2B8  | t| 0/DB8

Just before the slot loses sync.

 s1| 0/4001180   | 0/70008A8  | f| 0/FFEE80

The checkpoint after this removes some required segments.

2017-09-07 19:04:07.677 JST [13720] WARNING:  restart LSN of replication slots 
is ignored by checkpoint
2017-09-07 19:04:07.677 JST [13720] DETAIL:  Some replication slots have lost 
required WAL segnents to continue by up to 1 segments.

If max_slot_wal_keep_size if not set (0), live is always true and
distance is NULL.

slot_name | restart_lsn | pg_current_wal_lsn | live | distance  
---+-++--+---
 s1| 0/4001180   | 0/73117A8  | t| 



- The name (or its content) of the new columns should be arguable.

- pg_replication_slots view takes LWLock on ControlFile and
  spinlock on XLogCtl for every slot. But seems difficult to
  reduce it..

- distance seems mitakenly becomes 0/0 for certain condition..

- The result seems almost right but more precise check needed.
  (Anyway it cannot be perfectly exact.);

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From ac47e250cf88b1279556e27c33e9f29806fdc04d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 7 Sep 2017 19:13:22 +0900
Subject: [PATCH 2/2] Add monitoring aid for max_replication_slots.

Adds two columns "live" and "distance" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns shows how long a slot can live on or how many bytes a
slot have lost if max_slot_wal_keep_size is set.
---
 src/backend/access/transam/xlog.c| 80 
 src/backend/catalog/system_views.sql |  4 +-
 src/backend/replication/slotfuncs.c  | 16 +++-
 src/include/access/xlog.h|  1 +
 src/include/catalog/pg_proc.h|  2 +-
 5 files changed, 100 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ae70d7d..c4c8307 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9324,6 +9324,86 @@ CreateRestartPoint(int flags)
 }
 
 /*
+ * Check if the record on the given lsn will be preserved at the next
+ * checkpoint.
+ *
+ * Returns true if it will be preserved. If distance is given, the distance
+ * from origin to the beginning of the first segment kept at the next
+ * checkpoint. It means margin when this function returns true and gap of lost
+ * records when false.
+ *
+ * This function should return the consistent result with KeepLogSeg.
+ */
+bool
+GetMarginToSlotSegmentLimit(XLogRecPtr restartLSN, uint64 *distance)
+{
+	XLogRecPtr currpos;
+	XLogRecPtr tailpos;
+	uint64 currSeg;
+	uint64 restByteInSeg;
+	uint64 restartSeg;
+	uint64 tailSeg;
+	uint64 keepSegs;
+
+	currpos = GetXLogWriteRecPtr();
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	tailpos = ControlFile->checkPointCopy.redo;
+	LWLockRelease(ControlFileLock);
+
+	/* Move the pointer to the beginning of the segment*/
+	XLByteToSeg(currpos, currSeg);
+	XLByteToSeg(restartLSN, restartSeg);
+	XLByteToSeg(tailpos, tailSeg);
+	restByteInSeg = 0;
+
+	Assert(wal_keep_segments >= 0);
+	Assert(max_slot_wal_keep_size_mb >= 0);
+
+	/*
+	 * WAL are removed by the unit of segment.
+	 */
+	keepSegs = wal_keep_segments + ConvertToXSegs(max_slot_wal_keep_size_mb);
+
+	/*
+	 * If the latest checkpoint's redo point is older than the current head
+	 * minus keep segments, the next checkpoint keeps the redo point's
+	 * segment. Elsewise use current head minus number of segments to keep.
+	 */
+	if (cur

Re: [HACKERS] Restricting maximum keep segments by repslots

2017-09-06 Thread Kyotaro HORIGUCHI
Hello,

At Fri, 1 Sep 2017 23:49:21 -0400, Peter Eisentraut 
 wrote in 
<751e09c4-93e0-de57-edd2-e64c4950f...@2ndquadrant.com>
> I'm still concerned about how the critical situation is handled.  Your
> patch just prints a warning to the log and then goes on -- doing what?
> 
> The warning rolls off the log, and then you have no idea what happened,
> or how to recover.

The victims should be complaining in their log files, but, yes, I
must admit that it's extremely resembles /dev/null. And the
catastrophe comes suddenly.

> I would like a flag in pg_replication_slots, and possibly also a
> numerical column that indicates how far away from the critical point
> each slot is.  That would be great for a monitoring system.

Great! I'll do that right now.

> -- 
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 

Thanks.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-09-06 Thread Kyotaro HORIGUCHI
Hello,

At Wed, 6 Sep 2017 12:23:53 -0700, Andres Freund  wrote in 
<20170906192353.ufp2dq7wm5fd6...@alap3.anarazel.de>
> On 2017-09-06 17:36:02 +0900, Kyotaro HORIGUCHI wrote:
> > The problem is that the current ReadRecord needs the first one of
> > a series of continuation records from the same source with the
> > other part, the master in the case.
> 
> What's the problem with that?  We can easily keep track of the beginning
> of a record, and only confirm the address before that.

After failure while reading a record locally, ReadRecored tries
streaming to read from the beginning of a record, which is not on
the master, then retry locally and.. This loops forever.

> > A (or the) solution closed in the standby side is allowing to
> > read a seris of continuation records from muliple sources.
> 
> I'm not following. All we need to use is the beginning of the relevant
> records, that's easy enough to keep track of. We don't need to read the
> WAL or anything.

The beginning is already tracked and nothing more to do. 

I reconsider that way and found that it doesn't need such
destructive refactoring.

The first *problem* was WaitForWALToBecomeAvaialble requests the
beginning of a record, which is not on the page the function has
been told to fetch. Still tliRecPtr is required to determine the
TLI to request, it should request RecPtr to be streamed.

The rest to do is let XLogPageRead retry other sources
immediately. To do this I made ValidXLogPageHeader@xlogreader.c
public (and renamed to XLogReaderValidatePageHeader).

The patch attached fixes the problem and passes recovery
tests. However, the test for this problem is not added. It needs
to go to the last page in a segment then put a record continues
to the next segment, then kill the standby after receiving the
previous segment but before receiving the whole record.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 8932a390bd3d1acfe5722bc62f42fc7e381ca617 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 7 Sep 2017 12:14:55 +0900
Subject: [PATCH 1/2] Allow switch WAL source midst of record.

The corrent recovery machinary assumes the whole of a record is
avaiable from single source. This prevents a standby from restarting
under a certain condition. This patch allows source switching during
reading a series of continuation records.
---
 src/backend/access/transam/xlog.c   |  7 ++-
 src/backend/access/transam/xlogreader.c | 12 +---
 src/include/access/xlogreader.h |  5 +
 3 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index df4843f..eef3a97 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11566,6 +11566,11 @@ retry:
 	Assert(reqLen <= readLen);
 
 	*readTLI = curFileTLI;
+
+	if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
+	  (XLogPageHeader) readBuf))
+		goto next_record_is_invalid;
+
 	return readLen;
 
 next_record_is_invalid:
@@ -11700,7 +11705,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 		else
 		{
-			ptr = tliRecPtr;
+			ptr = RecPtr;
 			tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
 
 			if (curFileTLI > 0 && tli < curFileTLI)
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 0781a7b..aa05e3f 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -27,8 +27,6 @@
 
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 
-static bool ValidXLogPageHeader(XLogReaderState *state, XLogRecPtr recptr,
-	XLogPageHeader hdr);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 	  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -545,7 +543,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 		hdr = (XLogPageHeader) state->readBuf;
 
-		if (!ValidXLogPageHeader(state, targetSegmentPtr, hdr))
+		if (!XLogReaderValidatePageHeader(state, targetSegmentPtr, hdr))
 			goto err;
 	}
 
@@ -582,7 +580,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	/*
 	 * Now that we know we have the full header, validate it.
 	 */
-	if (!ValidXLogPageHeader(state, pageptr, hdr))
+	if (!XLogReaderValidatePageHeader(state, pageptr, hdr))
 		goto err;
 
 	/* update read state information */
@@ -709,9 +707,9 @@ ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr)
 /*
  * Validate a page header
  */
-static bool
-ValidXLogPageHeader(XLogReaderState *state, XLogRecPtr recptr,
-	XLogPageHeader hdr)
+bool
+XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recp

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-09-06 Thread Kyotaro HORIGUCHI
Hi,

At Mon, 4 Sep 2017 17:17:19 +0900, Michael Paquier  
wrote in 
> On Mon, Sep 4, 2017 at 4:04 PM, Andres Freund  wrote:
> > I've not read through the thread, but this seems like the wrong approach
> > to me. The receiving side should use a correct value, instead of putting
> > this complexity on the sender's side.
> 
> Yes I agree with that. The current patch gives me a bad feeling to be
> honest with the way it does things..

The problem is that the current ReadRecord needs the first one of
a series of continuation records from the same source with the
other part, the master in the case.

A (or the) solution closed in the standby side is allowing to
read a seris of continuation records from muliple sources. In
this case the first part from the standby's pg_wal and the second
part from the master via streaming replication.  ReadRecord
needed refactoring, (seems to me) breaking the concept of
XLogReader plug-in system to accomplish this behavior.

If it is preferable for you, I'll re-try that. Or hints for other
solutions are also welcome.

Is there any suggestions?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protect syscache from bloating with negative cache entries

2017-09-05 Thread Kyotaro HORIGUCHI
Thank you for the comment.

At Mon, 28 Aug 2017 21:31:58 -0400, Robert Haas  wrote 
in 
> On Mon, Aug 28, 2017 at 5:24 AM, Kyotaro HORIGUCHI
>  wrote:
> > This patch have had interferences from several commits after the
> > last submission. I amended this patch to follow them (up to
> > f97c55c), removed an unnecessary branch and edited some comments.
> 
> I think the core problem for this patch is that there's no consensus
> on what approach to take.  Until that somehow gets sorted out, I think
> this isn't going to make any progress.  Unfortunately, I don't have a
> clear idea what sort of solution everybody could tolerate.
> 
> I still think that some kind of slow-expire behavior -- like a clock
> hand that hits each backend every 10 minutes and expires entries not
> used since the last hit -- is actually pretty sensible.  It ensures
> that idle or long-running backends don't accumulate infinite bloat
> while still allowing the cache to grow large enough for good
> performance when all entries are being regularly used.  But Tom
> doesn't like it.  Other approaches were also discussed; none of them
> seem like an obvious slam-dunk.

I suppose that it slows intermittent lookup of non-existent
objects. I have tried a slight different thing. Removing entries
by 'age', preserving specified number (or ratio to live entries)
of younger negative entries. The problem of that approach was I
didn't find how to determine the number of entries to preserve,
or I didn't want to offer additional knobs for them. Finally I
proposed the patch upthread since it doesn't need any assumption
on usage.

Though I can make another patch that does the same thing based on
LRU, the same how-many-to-preserve problem ought to be resolved
in order to avoid slowing the inermittent lookup.

> Turning to the patch itself, I don't know how we decide whether the
> patch is worth it.  Scanning the whole (potentially large) cache to
> remove negative entries has a cost, mostly in CPU cycles; keeping
> those negative entries around for a long time also has a cost, mostly
> in memory.  I don't know how to decide whether these patches will help
> more people than it hurts, or the other way around -- and it's not
> clear that anyone else has a good idea about that either.

Scanning a hash on invalidation of several catalogs (hopefully
slightly) slows certain percentage of inavlidations on maybe most
of workloads. Holding no-longer-lookedup entries surely kills a
backend under certain workloads sooner or later.  This doesn't
save the pg_proc cases, but saves pg_statistic and pg_class
cases. I'm not sure what other catalogs can bloat.

I could reduce the complexity of this. Inval mechanism conveys
only a hash value so this scans the whole of a cache for the
target OIDs (with possible spurious targets). This will be
resolved by letting inval mechanism convey an OID. (but this may
need additional members in an inval entry.)

Still, the full scan perfomed in CleanupCatCacheNegEntries
doesn't seem easily avoidable. Separating the hash by OID of key
or provide special dlist that points tuples in buckets will
introduce another complexity.


> Typos: funciton, paritial.

Thanks. ispell told me of additional typos corresnpond, belive
and undistinguisable.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protect syscache from bloating with negative cache entries

2017-09-05 Thread Kyotaro HORIGUCHI
Thank you for reviewing this.

At Sat, 2 Sep 2017 12:12:47 +1200, Thomas Munro  
wrote in 
> On Mon, Aug 28, 2017 at 9:24 PM, Kyotaro HORIGUCHI
>  wrote:
> > This patch have had interferences from several commits after the
> > last submission. I amended this patch to follow them (up to
> > f97c55c), removed an unnecessary branch and edited some comments.
> 
> Hi Kyotaro-san,
> 
> This applies but several regression tests fail for me.  Here is a
> sample backtrace:

Sorry for the silly mistake. STAEXTNAMENSP and STATRELATTINH was
missing additional elements in their definitions. Somehow I've
removed them.

The attached patch doesn't crash by regression test. And fixed
some typos pointed by Robert and found by myself.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 215530f562eaa514437e811686f6e3e419fdd950 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 28 Aug 2017 11:36:21 +0900
Subject: [PATCH 1/2] Cleanup negative cache of pg_statistic when dropping a
 relation.

Accessing columns that don't have statistics leaves negative entries
in catcache for pg_statstic, but there's no chance to remove
them. Especially when repeatedly creating then dropping temporary
tables bloats catcache so much that memory pressure becomes
significant. This patch removes negative entries in STATRELATTINH,
ATTNAME and ATTNUM when corresponding relation is dropped.
---
 src/backend/utils/cache/catcache.c |  57 ++-
 src/backend/utils/cache/syscache.c | 302 +++--
 src/include/utils/catcache.h   |   3 +
 src/include/utils/syscache.h   |   2 +
 4 files changed, 281 insertions(+), 83 deletions(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index e092801..73e7a44 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -304,10 +304,11 @@ CatCachePrintStats(int code, Datum arg)
 
 		if (cache->cc_ntup == 0 && cache->cc_searches == 0)
 			continue;			/* don't print unused caches */
-		elog(DEBUG2, "catcache %s/%u: %d tup, %ld srch, %ld+%ld=%ld hits, %ld+%ld=%ld loads, %ld invals, %ld lsrch, %ld lhits",
+		elog(DEBUG2, "catcache %s/%u: %d tup, %d negtup, %ld srch, %ld+%ld=%ld hits, %ld+%ld=%ld loads, %ld invals, %ld lsrch, %ld lhits",
 			 cache->cc_relname,
 			 cache->cc_indexoid,
 			 cache->cc_ntup,
+			 cache->cc_nnegtup,
 			 cache->cc_searches,
 			 cache->cc_hits,
 			 cache->cc_neg_hits,
@@ -374,6 +375,10 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
 	/* free associated tuple data */
 	if (ct->tuple.t_data != NULL)
 		pfree(ct->tuple.t_data);
+
+	if (ct->negative)
+		--cache->cc_nnegtup;
+
 	pfree(ct);
 
 	--cache->cc_ntup;
@@ -572,6 +577,49 @@ ResetCatalogCache(CatCache *cache)
 }
 
 /*
+ *		CleanupCatCacheNegEntries
+ *
+ *	Remove negative cache tuples matching a partial key.
+ *
+ */
+void
+CleanupCatCacheNegEntries(CatCache *cache, ScanKeyData *skey)
+{
+	int i;
+
+	/* If this cache has no negative entries, nothing to do */
+	if (cache->cc_nnegtup == 0)
+		return;
+
+	/* searching with a partial key means scanning the whole cache */
+	for (i = 0; i < cache->cc_nbuckets; i++)
+	{
+		dlist_head *bucket = &cache->cc_bucket[i];
+		dlist_mutable_iter iter;
+
+		dlist_foreach_modify(iter, bucket)
+		{
+			CatCTup*ct = dlist_container(CatCTup, cache_elem, iter.cur);
+			bool		res;
+
+			if (!ct->negative)
+continue;
+
+			HeapKeyTest(&ct->tuple, cache->cc_tupdesc, 1, skey, res);
+			if (!res)
+continue;
+
+			/*
+			 * the negative cache entries can no longer be referenced, so we
+			 * can remove it unconditionally
+			 */
+			CatCacheRemoveCTup(cache, ct);
+		}
+	}
+}
+
+
+/*
  *		ResetCatalogCaches
  *
  * Reset all caches when a shared cache inval event forces it
@@ -718,6 +766,7 @@ InitCatCache(int id,
 	cp->cc_relisshared = false; /* temporary */
 	cp->cc_tupdesc = (TupleDesc) NULL;
 	cp->cc_ntup = 0;
+	cp->cc_nnegtup = 0;
 	cp->cc_nbuckets = nbuckets;
 	cp->cc_nkeys = nkeys;
 	for (i = 0; i < nkeys; ++i)
@@ -1215,8 +1264,8 @@ SearchCatCache(CatCache *cache,
 
 		CACHE4_elog(DEBUG2, "SearchCatCache(%s): Contains %d/%d tuples",
 	cache->cc_relname, cache->cc_ntup, CacheHdr->ch_ntup);
-		CACHE3_elog(DEBUG2, "SearchCatCache(%s): put neg entry in bucket %d",
-	cache->cc_relname, hashIndex);
+		CACHE4_elog(DEBUG2, "SearchCatCache(%s): put neg entry in bucket %d, total %d",
+	cache->cc_relname, hashIndex, cache->cc_nnegtup);
 
 		/*
 		 * We are not returning the negative entry to the caller, so leave its
@@ -1667,6 +1716,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp,
 
 	cache->cc_ntup++;
 	CacheHdr->ch_ntup++;
+	if (negative)
+		cache->cc_nnegtup++;
 
 	/*
 	 * If the hash table has become too 

Re: [HACKERS] multiple target of VACUUM command

2017-09-05 Thread Kyotaro HORIGUCHI
Ouch!

At Thu, 31 Aug 2017 23:09:20 +0900, Michael Paquier  
wrote in 
> On Thu, Aug 31, 2017 at 9:53 PM, Kyotaro HORIGUCHI
>  wrote:
> > I sometimes feel annoyed when trying to VACUUM multiple specific
> > tables.
> >
> > postgres=# vacuum a, b;
> > ERROR:  syntax error at or near ","
> > LINE 1: vacuum a, b;
> >
> > This patch just allows multiple targets for VACUUM command.
> 
> There is a patch for the same feature by Nathan Bossart which is being
> discussed already in this commit fest:
> https://www.postgresql.org/message-id/e061a8e3-5e3d-494d-94f0-e8a9b312b...@amazon.com

Sorry for the duplication.

> It had already a couple of rounds of reviews, and is getting close to
> something that could be committed. There is still a pending bug
> related to the use of RangeVar though with autovacuum.
> 
> Your approach is missing a couple of points. For example when
> specifying multiple targets, we have decided to check for an ERROR at
> the beginning of VACUUM, but we are issuing a WARNING if it goes
> missing in the middle of processing a list, so your set of patches
> would provide a frustrating experience. We have also discussed about
> reshaping a bit the API of vacuum(), so I would recommend looking at
> what has been already proposed if you are interested.

Thank you! I'll do that. I've mark this as "Rejected".

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] show "aggressive" or not in autovacuum logs

2017-09-05 Thread Kyotaro HORIGUCHI
Hello,

At Mon, 28 Aug 2017 20:07:32 -0700, "David G. Johnston" 
 wrote in 

> On Mon, Aug 28, 2017 at 2:26 AM, Kyotaro HORIGUCHI <
> horiguchi.kyot...@lab.ntt.co.jp> wrote:
> 
> > https://www.postgresql.org/docs/devel/static/runtime-config-client.html
> >
> > > VACUUM performs an aggressive scan
> >
> 
> Maybe this should gets its own thread/patch but I'll tack this on here
> since it all seems related.
> 
> That paragraph you linked has a couple of typos:
> 
> "Although users can set this value anywhere from zero to two billions,
> VACUUM" ...
> 
> should be 'two billion' (i.e., drop the "s")

Sure. (and I would make the same mistake..)

> "...so that a periodical manual VACUUM has..."
> 
> 'periodic' - though the description in the linked 24.1.5 is somewhat
> clearer (and longer) - the gap exists for the benefit of routine vacuum
> invocations to detect the need for an aggressive vacuum as part of a normal
> operating cycle rather than the last routine vacuum being non-aggressive
> and shortly thereafter an auto-vacuum anti-wraparound run is performed.
> 
> Current:
> 
> VACUUM will silently limit the effective value to 95% of
> autovacuum_freeze_max_age, so that a periodical manual VACUUM has a chance
> to run before an anti-wraparound autovacuum is launched for the table.
> 
> My interpretation:
> 
> VACUUM will silently limit the effective value to 95% of
> autovacuum_freeze_max_age so that a normal scan has a window within which
> to detect the need to convert itself to an aggressive scan and preempt the
> need for an untimely autovacuum initiated anti-wraparound scan.

I haven't find the phrase "normal scan" in the documentation.
Instaed, it seems to me that it should be "normal VACUUMs" as
seen in 24.1.5, which is VACUUM command without FREEZE
option. ("normal index scan" is another thing.)

So more verbosely, the interpretation would be the following

| VACUUM will silently limit the effective value to 95% of
| autovacuum_freeze_max_age so that a periodic (manual) VACUUM
| without FREEZE option has a window within which to detect the
| need to convert itself to an aggressive VACUUM and preempt the
| need for an untimely autovacuum initiated anti-wraparound scan.

> As noted in the 24.1.5 that "normal scan" can be time scheduled or an
> update driven auto-vacuum one.  It could be manual (though not really
> periodic) if one is monitoring the aging and is notified to run on manually
> when the age falls within the gap.
> 
> "Aggressive" sounds right throughout all of this.
> 
> Up-thread:
> INFO:  vacuuming "public.it" in aggressive mode
> 
> Maybe:
> INFO:  aggressively vacuuming "public.it"
> INFO:  vacuuming "public.it"

Hmm. I believed that the 'vacuuming' is a noun and the phrase
seen in my patch just submitted is 'aggressive vacuuming'. But of
course it's highly probable that I'm wrong.

> Likewise:
> LOG:  automatic vacuum of table "postgres.public.pgbench_branches": mode:
> aggressive, index scans: 0
> 
> could be:
> LOG:  automatic aggressive vacuum of table
> "postgres.public.pgbench_branches", index scans: 0

Yeah, this is just the same with mine.

> Having read the docs and come to understand what "aggressive" means two
> wordings work for me (i.e., leaving non-aggressive unadorned).



> David J.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] show "aggressive" or not in autovacuum logs

2017-09-04 Thread Kyotaro HORIGUCHI
Thank you for the opinions.

At Tue, 29 Aug 2017 15:00:57 +0900, Masahiko Sawada  
wrote in 
> On Tue, Aug 29, 2017 at 10:16 AM, Robert Haas  wrote:
> > On Mon, Aug 28, 2017 at 5:26 AM, Kyotaro HORIGUCHI
> >  wrote:
> >> Currently the message shows the '%d skipped-frozen' message but
> >> it is insufficient to verify the true effect. This is a patch to
> >> show mode as 'aggressive' or 'normal' in the closing message of
> >> vacuum. %d frozen-skipped when 'aggressive mode' shows the true
> >> effect of ALL_FROZEN.
> >>
> >> I will add this patch to CF2017-09.
> >
> > I would be a bit inclined to somehow show aggressive if it's
> > aggressive and not insert anything at all otherwise.  That'd probably
> > require two separate translatable strings in each case, but maybe
> > that's OK.
> >
> > What do other people think?
> 
> FWIW I prefer the Robert's idea; not insert anything if normal vacuum.

Though the form is intending to simplify parsing of the message,
keeping the previous format if nothing special is convincing.

How about the followings?

"automatic [agressive ]vacuum of table \"%s..."
"[aggressive ]vacuuming \"%s..."

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 8ced1199d69bd1f59bff9224cbdd62a7159d85ce Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 28 Aug 2017 13:12:25 +0900
Subject: [PATCH] Show "aggressive" or not in vacuum messages

VACUUM VERBOSE or autovacuum emits log message with "%u skipped
frozen" but we cannot tell whether the vacuum was non-freezing (or not
aggressive) vacuum or freezing (or aggressive) vacuum having no tuple
to freeze. This patch adds indication of aggressive (auto)vacuum in
log messages and VACUUM VERBOSE message.
---
 src/backend/commands/vacuumlazy.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 45b1859..472a43b 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -373,7 +373,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 			 * emitting individual parts of the message when not applicable.
 			 */
 			initStringInfo(&buf);
-			appendStringInfo(&buf, _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n"),
+			appendStringInfo(&buf, _("automatic %svacuum of table \"%s.%s.%s\": index scans: %d\n"),
+			 aggressive ? "aggressive " : "",
 			 get_database_name(MyDatabaseId),
 			 get_namespace_name(RelationGetNamespace(onerel)),
 			 RelationGetRelationName(onerel),
@@ -487,7 +488,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 
 	relname = RelationGetRelationName(onerel);
 	ereport(elevel,
-			(errmsg("vacuuming \"%s.%s\"",
+			(errmsg("%svacuuming \"%s.%s\"",
+	aggressive ? "aggressive " : "",
 	get_namespace_name(RelationGetNamespace(onerel)),
 	relname)));
 
-- 
2.9.2


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-09-03 Thread Kyotaro HORIGUCHI
Hello,

Thank you for reviewing this.

At Mon, 28 Aug 2017 20:14:54 +0900, Michael Paquier  
wrote in 
> On Mon, Aug 28, 2017 at 8:02 PM, Kyotaro HORIGUCHI
>  wrote:
> > The first patch (0001-) fixes this problem, preventing the
> > problematic state of WAL segments by retarding restart LSN of a
> > physical replication slot in a certain condition.
> 
> FWIW, I have this patch marked on my list of things to look at, so you
> can count me as a reviewer. There are also some approaches that I
> would like to test because I rely on replication slots for some
> infrastructure. Still...

This test patch modifies the code for easiness. The window for
this bug to occur is from receiving the first record of a segment
to in most cases receiving the second record or after receiving
several records. Intentionally emitting a record spanning two or
more segments would work?


> +if (oldFlushPtr != InvalidXLogRecPtr &&
> +(restartLSN == InvalidXLogRecPtr ?
> + oldFlushPtr / XLOG_SEG_SIZE != flushPtr / XLOG_SEG_SIZE :
> + restartLSN / XLOG_BLCKSZ != flushPtr / XLOG_BLCKSZ))
> I find such code patterns not readable.

Yeah, I agree. I rewrote there and the result in the attached
patch is far cleaner than the blob.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From c15473998467a17ac6070ed68405a14ca0f98ae6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 1 Feb 2017 16:07:22 +0900
Subject: [PATCH 1/2] Retard restart LSN of a slot when a segment starts with a
 contrecord.

A physical-replication standby can stop just at the boundary of WAL
segments. restart_lsn of a slot on the master can be assumed to be the
same location. The last segment on the master will be removed after
some checkpoints for the case. If the last record of the last
replicated segment continues to the next segment, the continuation
record is only on the master. The standby cannot start in the case
because the split record is not available from only one source.

This patch detains restart_lsn in the last sgement when the first page
of the next segment is a continuation record.
---
 src/backend/replication/walsender.c | 123 ++--
 1 file changed, 116 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index db346e6..30e222c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -217,6 +217,13 @@ static struct
 	WalTimeSample last_read[NUM_SYNC_REP_WAIT_MODE];
 }			LagTracker;
 
+/*
+ * This variable corresponds to restart_lsn in pg_replication_slots for a
+ * physical slot. This has a valid value only when it differs from the current
+ * flush pointer.
+ */
+static XLogRecPtr	   restartLSN = InvalidXLogRecPtr;
+
 /* Signal handlers */
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
@@ -251,7 +258,7 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static bool XLogRead(char *buf, XLogRecPtr startptr, Size count, bool noutfoundok);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -546,6 +553,9 @@ StartReplication(StartReplicationCmd *cmd)
 			ereport(ERROR,
 	(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 	 (errmsg("cannot use a logical replication slot for physical replication";
+
+		/* Restore restartLSN from replication slot */
+		restartLSN = MyReplicationSlot->data.restart_lsn;
 	}
 
 	/*
@@ -561,6 +571,10 @@ StartReplication(StartReplicationCmd *cmd)
 	else
 		FlushPtr = GetFlushRecPtr();
 
+	/* Set InvalidXLogRecPtr if catching up */
+	if (restartLSN == FlushPtr)
+		restartLSN = InvalidXLogRecPtr;
+
 	if (cmd->timeline != 0)
 	{
 		XLogRecPtr	switchpoint;
@@ -770,7 +784,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* now actually read the data, we know it's there */
-	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, false);
 
 	return count;
 }
@@ -1738,7 +1752,7 @@ static void
 ProcessStandbyReplyMessage(void)
 {
 	XLogRecPtr	writePtr,
-flushPtr,
+flushPtr, oldFlushPtr,
 applyPtr;
 	bool		replyRequested;
 	TimeOffset	writeLag,
@@ -1798,6 +1812,7 @@ ProcessStandbyReplyMessage(void)
 		WalSnd	   *walsnd = MyWalSnd;
 
 		SpinLockAcquire(&walsnd->mutex);
+		oldFlushPtr = walsnd->flush;
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
 		walsnd->apply = applyPtr;
@@ -1821,7 +1836,93 @@ ProcessStandbyReplyMessage(void)
 		if (SlotIsLogical(MyReplication

Re: [HACKERS] asynchronous execution

2017-09-03 Thread Kyotaro HORIGUCHI
At Thu, 31 Aug 2017 21:52:36 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170831.215236.135328985.horiguchi.kyot...@lab.ntt.co.jp>
> At Thu, 03 Aug 2017 09:30:57 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20170803.093057.261590619.horiguchi.kyot...@lab.ntt.co.jp>
> > > Unfortunately, that's probably another gigantic patch (that
> > > should probably be written by Andres).
> > 
> > Yeah, but async executor on the current style of executor seems
> > furtile work, or sitting until the patch comes is also waste of
> > time. So I'm planning to include the following sutff in the next
> > PoC patch. Even I'm not sure it can land on the coming
> > Andres'patch.
> > 
> > - Tuple passing outside call-stack. (I remember it was in the
> >   past of the thread around but not found)
> > 
> >   This should be included in the Andres' patch.
> > 
> > - Give executor an ability to run from data-source (or driver)
> >   nodes to the root.
> > 
> >   I'm not sure this is included, but I suppose he is aiming this
> >   kind of thing.
> > 
> > - Rebuid asynchronous execution on the upside-down executor.
> 
> Anyway, I modified ExecProcNode into push-up form and it *seems*
> working to some extent. But trigger and cursors are almost broken
> and several other regressions fail. Some nodes such like
> windowagg are terriblly difficult to change to this push-up form
> (using state machine). And of course it is terribly inefficient.
> 
> I'm afraid that all of this turns out to be in vain. But anyway,
> and FWIW, I'll show the work to here after some cleansing work on
> it.

So, this is that. Maybe this is really a bad way to go. Top of
the bads is it's terriblly hard to maintain because the behavior
of the state machine constructed in this patch is hardly
predictable so easily broken. During the 'cleansing work' I had
many crash or infinite-loop and they were a bit hard to
diagnose.. This will be soon broken by following commits.

Anyway and, again FWIW, this is that. I'll leave this for a while
(at least the period of this CF) and reconsider on async in
different forms.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


poc_pushexecutor_20170904_4faa1dc.tar.bz2
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] multiple target of VACUUM command

2017-08-31 Thread Kyotaro HORIGUCHI
Hello,

I sometimes feel annoyed when trying to VACUUM multiple specific
tables.

postgres=# vacuum a, b;
ERROR:  syntax error at or near ","
LINE 1: vacuum a, b;

This patch just allows multiple targets for VACUUM command.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 481690f0d84a21db755a986a7f785e8bbbe0769e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 28 Jul 2017 13:30:59 +0900
Subject: [PATCH 1/2] Make VaccumStmt capable to have multiple table parameters

---
 src/backend/commands/vacuum.c  | 18 +++---
 src/backend/nodes/copyfuncs.c  | 13 +
 src/backend/nodes/equalfuncs.c | 11 +++
 src/backend/parser/gram.y  | 29 -
 src/include/nodes/nodes.h  |  1 +
 src/include/nodes/parsenodes.h | 10 --
 6 files changed, 64 insertions(+), 18 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index faa1812..d6cd352 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -85,12 +85,12 @@ void
 ExecVacuum(VacuumStmt *vacstmt, bool isTopLevel)
 {
 	VacuumParams params;
+	ListCell*lc;
 
 	/* sanity checks on options */
 	Assert(vacstmt->options & (VACOPT_VACUUM | VACOPT_ANALYZE));
 	Assert((vacstmt->options & VACOPT_VACUUM) ||
 		   !(vacstmt->options & (VACOPT_FULL | VACOPT_FREEZE)));
-	Assert((vacstmt->options & VACOPT_ANALYZE) || vacstmt->va_cols == NIL);
 	Assert(!(vacstmt->options & VACOPT_SKIPTOAST));
 
 	/*
@@ -119,8 +119,20 @@ ExecVacuum(VacuumStmt *vacstmt, bool isTopLevel)
 	params.log_min_duration = -1;
 
 	/* Now go through the common routine */
-	vacuum(vacstmt->options, vacstmt->relation, InvalidOid, ¶ms,
-		   vacstmt->va_cols, NULL, isTopLevel);
+	if (list_length(vacstmt->relcols) == 0)
+		vacuum(vacstmt->options, NULL, InvalidOid, ¶ms,
+			   NIL, NULL, isTopLevel);
+	else
+	{
+		foreach (lc, vacstmt->relcols)
+		{
+			VacRelCols *relcol = (VacRelCols *) lfirst(lc);
+			Assert((vacstmt->options & VACOPT_ANALYZE) ||
+   relcol->va_cols == NIL);
+			vacuum(vacstmt->options, relcol->relation, InvalidOid, ¶ms,
+   relcol->va_cols, NULL, isTopLevel);
+		}
+	}
 }
 
 /*
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 7204169..761f758 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -3764,6 +3764,16 @@ _copyVacuumStmt(const VacuumStmt *from)
 	VacuumStmt *newnode = makeNode(VacuumStmt);
 
 	COPY_SCALAR_FIELD(options);
+	COPY_NODE_FIELD(relcols);
+
+	return newnode;
+}
+
+static VacRelCols *
+_copyVacRelCols(const VacRelCols *from)
+{
+	VacRelCols *newnode = makeNode(VacRelCols);
+
 	COPY_NODE_FIELD(relation);
 	COPY_NODE_FIELD(va_cols);
 
@@ -5527,6 +5537,9 @@ copyObjectImpl(const void *from)
 		case T_PartitionCmd:
 			retval = _copyPartitionCmd(from);
 			break;
+		case T_VacRelCols:
+			retval = _copyVacRelCols(from);
+			break;
 
 			/*
 			 * MISCELLANEOUS NODES
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 8d92c03..aea7168 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1663,6 +1663,14 @@ static bool
 _equalVacuumStmt(const VacuumStmt *a, const VacuumStmt *b)
 {
 	COMPARE_SCALAR_FIELD(options);
+	COMPARE_NODE_FIELD(relcols);
+
+	return true;
+}
+
+static bool
+_equalVacRelCols(const VacRelCols *a, const VacRelCols *b)
+{
 	COMPARE_NODE_FIELD(relation);
 	COMPARE_NODE_FIELD(va_cols);
 
@@ -3675,6 +3683,9 @@ equal(const void *a, const void *b)
 		case T_PartitionCmd:
 			retval = _equalPartitionCmd(a, b);
 			break;
+		case T_VacRelCols:
+			retval = _equalVacRelCols(a, b);
+			break;
 
 		default:
 			elog(ERROR, "unrecognized node type: %d",
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 7d0de99..844c691 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -10128,13 +10128,13 @@ VacuumStmt: VACUUM opt_full opt_freeze opt_verbose
 		n->options |= VACOPT_FREEZE;
 	if ($4)
 		n->options |= VACOPT_VERBOSE;
-	n->relation = NULL;
-	n->va_cols = NIL;
+	n->relcols = NIL;
 	$$ = (Node *)n;
 }
 			| VACUUM opt_full opt_freeze opt_verbose qualified_name
 {
 	VacuumStmt *n = makeNode(VacuumStmt);
+	VacRelCols *relcol = makeNode(VacRelCols);
 	n->options = VACOPT_VACUUM;
 	if ($2)
 		n->options |= VACOPT_FULL;
@@ -10142,8 +10142,9 @@ VacuumStmt: VACUUM opt_full opt_freeze opt_verbose
 		n->options |= VACOPT_FREEZE;
 	if ($4)
 		n->options |= VACOPT_VERBOSE;
-	n->relation = $5;
-	n->va_cols = NIL;
+	relcol->relation = $5;
+	relcol->va_cols = NIL;
+	n->relcols = list_make1(relcol);
 	$$ = (Node *)n;
 }
 			| VACUUM opt_full opt_freeze opt_verbose AnalyzeStmt
@@ -10162,18 +10163,19 @@ VacuumStmt: VACUUM opt_full opt_fr

Re: [HACKERS] asynchronous execution

2017-08-31 Thread Kyotaro HORIGUCHI
At Thu, 03 Aug 2017 09:30:57 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170803.093057.261590619.horiguchi.kyot...@lab.ntt.co.jp>
> > Unfortunately, that's probably another gigantic patch (that
> > should probably be written by Andres).
> 
> Yeah, but async executor on the current style of executor seems
> furtile work, or sitting until the patch comes is also waste of
> time. So I'm planning to include the following sutff in the next
> PoC patch. Even I'm not sure it can land on the coming
> Andres'patch.
> 
> - Tuple passing outside call-stack. (I remember it was in the
>   past of the thread around but not found)
> 
>   This should be included in the Andres' patch.
> 
> - Give executor an ability to run from data-source (or driver)
>   nodes to the root.
> 
>   I'm not sure this is included, but I suppose he is aiming this
>   kind of thing.
> 
> - Rebuid asynchronous execution on the upside-down executor.

Anyway, I modified ExecProcNode into push-up form and it *seems*
working to some extent. But trigger and cursors are almost broken
and several other regressions fail. Some nodes such like
windowagg are terriblly difficult to change to this push-up form
(using state machine). And of course it is terribly inefficient.

I'm afraid that all of this turns out to be in vain. But anyway,
and FWIW, I'll show the work to here after some cleansing work on
it.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-08-28 Thread Kyotaro HORIGUCHI
Hello,

This problem still occurs on the master.
I rebased this to the current master.

At Mon, 3 Apr 2017 08:38:47 +0900, Michael Paquier  
wrote in 
> On Mon, Apr 3, 2017 at 7:19 AM, Venkata B Nagothi  wrote:
> > As we are already past the commitfest, I am not sure, what should i change
> > the patch status to ?
> 
> The commit fest finishes on the 7th of April. Even with the deadline
> passed, there is nothing preventing to work on bug fixes. So this item
> ought to be moved to the next CF with the same category.

The steps to reproduce the problem follows.

- Apply the second patch (0002-) attached and recompile. It
  effectively reproduces the problematic state of database.

- M(aster): initdb the master with wal_keep_segments = 0
(default), log_min_messages = debug2
- M: Create a physical repslot.
- S(tandby): Setup a standby database.
- S: Edit recovery.conf to use the replication slot above then
 start it.
- S: touch /tmp/hoge
- M: Run pgbench ...
- S: After a while, the standby stops.
  > LOG:   STOP THE SERVER

- M: Stop pgbench.
- M: Do 'checkpoint;' twice.
- S: rm /tmp/hoge
- S: Fails to catch up with the following error.

  > FATAL:  could not receive data from WAL stream: ERROR:  requested WAL 
segment 0001002B has already been removed


The first patch (0001-) fixes this problem, preventing the
problematic state of WAL segments by retarding restart LSN of a
physical replication slot in a certain condition.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 3813599b74299f1da8d0567ed90542c5f35ed48b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 1 Feb 2017 16:07:22 +0900
Subject: [PATCH 1/2] Retard restart LSN of a slot when a segment starts with a
 contrecord.

A physical-replication standby can stop just at the boundary of WAL
segments. restart_lsn of a slot on the master can be assumed to be the
same location. The last segment on the master will be removed after
some checkpoints for the case. If the last record of the last
replicated segment continues to the next segment, the continuation
record is only on the master. The standby cannot start in the case
because the split record is not available from only one source.

This patch detains restart_lsn in the last sgement when the first page
of the next segment is a continuation record.
---
 src/backend/replication/walsender.c | 105 +---
 1 file changed, 98 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 03e1cf4..30c80af 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -217,6 +217,13 @@ static struct
 	WalTimeSample last_read[NUM_SYNC_REP_WAIT_MODE];
 }			LagTracker;
 
+/*
+ * This variable corresponds to restart_lsn in pg_replication_slots for a
+ * physical slot. This has a valid value only when it differs from the current
+ * flush pointer.
+ */
+static XLogRecPtr	   restartLSN = InvalidXLogRecPtr;
+
 /* Signal handlers */
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
@@ -251,7 +258,7 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static bool XLogRead(char *buf, XLogRecPtr startptr, Size count, bool noutfoundok);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -546,6 +553,9 @@ StartReplication(StartReplicationCmd *cmd)
 			ereport(ERROR,
 	(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 	 (errmsg("cannot use a logical replication slot for physical replication";
+
+		/* Restore restartLSN from replication slot */
+		restartLSN = MyReplicationSlot->data.restart_lsn;
 	}
 
 	/*
@@ -561,6 +571,10 @@ StartReplication(StartReplicationCmd *cmd)
 	else
 		FlushPtr = GetFlushRecPtr();
 
+	/* Set InvalidXLogRecPtr if catching up */
+	if (restartLSN == FlushPtr)
+		restartLSN = InvalidXLogRecPtr;
+
 	if (cmd->timeline != 0)
 	{
 		XLogRecPtr	switchpoint;
@@ -770,7 +784,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* now actually read the data, we know it's there */
-	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, false);
 
 	return count;
 }
@@ -1738,7 +1752,7 @@ static void
 ProcessStandbyReplyMessage(void)
 {
 	XLogRecPtr	writePtr,
-flushPtr,
+flushPtr, oldFlushPtr,
 applyPtr;
 	bool		replyRequested;
 	TimeOffset	writeLag,
@@ -1798,6 +1812,7 @@ ProcessStandbyReplyMessage(void)
 		WalSnd	   *walsnd = MyWalSnd;
 
 		SpinLockAcquire(&walsnd->mutex);
+		oldFlushPtr

Re: [HACKERS] Restricting maximum keep segments by repslots

2017-08-28 Thread Kyotaro HORIGUCHI
Hello,

I'll add this to CF2017-09.

At Mon, 06 Mar 2017 18:20:06 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170306.182006.172683338.horiguchi.kyot...@lab.ntt.co.jp>
> Thank you for the comment.
> 
> At Fri, 3 Mar 2017 14:47:20 -0500, Peter Eisentraut 
>  wrote in 
> 
> > On 3/1/17 19:54, Kyotaro HORIGUCHI wrote:
> > >> Please measure it in size, not in number of segments.
> > > It was difficult to dicide which is reaaonable but I named it
> > > after wal_keep_segments because it has the similar effect.
> > > 
> > > In bytes(or LSN)
> > >  max_wal_size
> > >  min_wal_size
> > >  wal_write_flush_after
> > > 
> > > In segments
> > >  wal_keep_segments
> > 
> > We have been moving away from measuring in segments.  For example,
> > checkpoint_segments was replaced by max_wal_size.
> > 
> > Also, with the proposed patch that allows changing the segment size more
> > easily, this will become more important.  (I wonder if that will require
> > wal_keep_segments to change somehow.)
> 
> Agreed. It is 'max_slot_wal_keep_size' in the new version.
> 
> wal_keep_segments might should be removed someday.

- Following to min/max_wal_size, the variable was renamed to
  "max_slot_wal_keep_size_mb" and used as ConvertToXSegs(x)"

- Stopped warning when checkpoint doesn't flush segments required
  by slots even if max_slot_wal_keep_size have worked.

- Avoided subraction that may be negative.

regards,

*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 105,110  int			wal_level = WAL_LEVEL_MINIMAL;
--- 105,111 
  int			CommitDelay = 0;	/* precommit delay in microseconds */
  int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
  int			wal_retrieve_retry_interval = 5000;
+ int			max_slot_wal_keep_size_mb = 0;
  
  #ifdef WAL_DEBUG
  bool		XLOG_DEBUG = false;
***
*** 9353,9361  KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
--- 9354,9385 
  	if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
  	{
  		XLogSegNo	slotSegNo;
+ 		int			slotlimitsegs = ConvertToXSegs(max_slot_wal_keep_size_mb);
  
  		XLByteToSeg(keep, slotSegNo);
  
+ 		/*
+ 		 * ignore slots if too many wal segments are kept.
+ 		 * max_slot_wal_keep_size is just accumulated on wal_keep_segments.
+ 		 */
+ 		if (max_slot_wal_keep_size_mb > 0 && slotSegNo + slotlimitsegs < segno)
+ 		{
+ 			segno = segno - slotlimitsegs; /* must be positive */
+ 
+ 			/*
+ 			 * warn only if the checkpoint flushes the required segment.
+ 			 * we assume here that *logSegNo is calculated keep location.
+ 			 */
+ 			if (slotSegNo < *logSegNo)
+ ereport(WARNING,
+ 	(errmsg ("restart LSN of replication slots is ignored by checkpoint"),
+ 	 errdetail("Some replication slots have lost required WAL segnents to continue by up to %ld segments.",
+ 	   (segno < *logSegNo ? segno : *logSegNo) - slotSegNo)));
+ 
+ 			/* emergency vent */
+ 			slotSegNo = segno;
+ 		}
+ 
  		if (slotSegNo <= 0)
  			segno = 1;
  		else if (slotSegNo < segno)
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***
*** 2366,2371  static struct config_int ConfigureNamesInt[] =
--- 2366,2382 
  	},
  
  	{
+ 		{"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+ 			gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+ 		 NULL,
+ 		 GUC_UNIT_MB
+ 		},
+ 		&max_slot_wal_keep_size_mb,
+ 		0, 0, INT_MAX,
+ 		NULL, NULL, NULL
+ 	},
+ 
+ 	{
  		{"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
  			gettext_noop("Sets the maximum time to wait for WAL replication."),
  			NULL,
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***
*** 235,240 
--- 235,241 
  #max_wal_senders = 10		# max number of walsender processes
  # (change requires restart)
  #wal_keep_segments = 0		# in logfile segments, 16MB each; 0 disables
+ #max_slot_wal_keep_size = 0	# measured in bytes; 0 disables
  #wal_sender_timeout = 60s	# in milliseconds; 0 disables
  
  #max_replication_slots = 10	# max number of replication slots
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***
*** 97,102  extern bool reachedConsistency;
--- 97,103 
  extern int	min_wal_size_mb;
  extern int	max_wal_size_mb;
  extern int	wal_keep_segments;
+ extern int	max_slot_wal_keep_size_mb;
  extern int	XLOGbuffers;
  extern int	XLogArchiveTimeout;
  extern int	wal_retrieve_retry_interval;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Race between SELECT and ALTER TABLE NO INHERIT

2017-08-28 Thread Kyotaro HORIGUCHI
Hello,

I'll add this to CF2017-09.

At Tue, 27 Jun 2017 16:27:18 +0900, Amit Langote 
 wrote in 
<75fe42df-b1d8-89ff-596d-d9da0749e...@lab.ntt.co.jp>
> On 2017/06/26 18:44, Kyotaro HORIGUCHI wrote:
> > At Mon, 26 Jun 2017 18:16:42 +0900, Amit Langote wrote:
> >>
> >> I recall I had proposed a fix for the same thing some time ago [1].
> > 
> > Wow. About 1.5 years ago and stuck? I have a concrete case where
> > this harms  so I'd like to fix that this time. How can we move on?
> 
> Agreed that this should be fixed.
> 
> Your proposed approach #1 to recheck the inheritance after obtaining the
> lock on the child table seems to be a good way forward.
> 
> Approach #2 of reordering locking is a simpler solution, but does not
> completely prevent the problem, because DROP TABLE child can still cause
> it to occur, as you mentioned.
> 
> >>> The cause is that NO INHERIT doesn't take an exlusive lock on the
> >>> parent. This allows expand_inherited_rtentry to add the child
> >>> relation into appendrel after removal from the inheritance but
> >>> still exists.
> >>
> >> Right.
> >>
> >>> I see two ways to fix this.
> >>>
> >>> The first patch adds a recheck of inheritance relationship if the
> >>> corresponding attribute is missing in the child in
> >>> make_inh_translation_list(). The recheck is a bit complex but it
> >>> is not performed unless the sequence above is happen. It checks
> >>> duplication of relid (or cycles in inheritance) following
> >>> find_all_inheritors (but doing a bit different) but I'm not sure
> >>> it is really useful.
> >>
> >> I had proposed adding a syscache on pg_inherits(inhrelid, inhparent), but
> >> I guess your hash table based solution will do the job as far as
> >> performance of this check is concerned, although I haven't checked the
> >> code closely.
> > 
> > The hash table is not crucial in the patch. Substantially it just
> > recurs down looking up pg_inherits for the child. I considered
> > the new index but abandoned because I thought that such case
> > won't occur so frequently.
> 
> Agreed.  BTW, the hash table in patch #1 does not seem to be really
> helpful.  In the while loop in is_descendant_of_internal(), does
> hash_search() ever return found = true?  AFAICS, it does not.
> 
> >>> The second patch lets ALTER TABLE NO INHERIT to acquire locks on
> >>> the parent first.
> >>>
> >>> Since the latter has a larger impact on the current behavior and
> >>> we already treat "DROP TABLE child" case in the similar way, I
> >>> suppose that the first approach would be preferable.
> >>
> >> That makes sense.

So, I attached only the first patch, rebased on the current
master (It actually failed to apply on it.) and fixed a typo in a
comment.

This still runs a closed-loop test using temporary hash but it
seems a bit paranoic. (this is the same check with what
find_all_inheritors is doing)


<< the following is another topic >>

> >> BTW, in the partitioned table case, the parent is always locked first
> >> using an AccessExclusiveLock.  There are other considerations in that case
> >> such as needing to recreate the partition descriptor upon termination of
> >> inheritance (both the DETACH PARTITION and also DROP TABLE child cases).
> > 
> > Apart from the degree of concurrency, if we keep parent->children
> > order of locking, such recreation does not seem to be
> > needed. Maybe I'm missing something.
> 
> Sorry to have introduced that topic in this thread, but I will try to
> explain anyway why things are the way they are currently:
> 
> Once a table is no longer a partition of the parent (detached or dropped),
> we must make sure that the next commands in the transaction don't see it
> as one.  That information is currently present in the relcache
> (rd_partdesc), which is used by a few callers, most notably the
> tuple-routing code.  Next commands must recreate the entry so that the
> correct thing happens based on the updated information.  More precisely,
> we must invalidate the current entry.  RelationClearRelation() will either
> delete the entry or rebuild it.  If it's being referenced somewhere, it
> will be rebuilt.  The place holding the reference may also be looking at
> the content of rd_partdesc, which we don't require them to make a copy of,
> so we must preserve its content while other fields of RelationData are
> being read anew f

Re: [HACKERS] show "aggressive" or not in autovacuum logs

2017-08-28 Thread Kyotaro HORIGUCHI
Hello,

Currently the message shows the '%d skipped-frozen' message but
it is insufficient to verify the true effect. This is a patch to
show mode as 'aggressive' or 'normal' in the closing message of
vacuum. %d frozen-skipped when 'aggressive mode' shows the true
effect of ALL_FROZEN.

I will add this patch to CF2017-09.

At Tue, 4 Apr 2017 20:29:38 +0900, Masahiko Sawada  
wrote in 
> On Tue, Apr 4, 2017 at 10:09 AM, Kyotaro HORIGUCHI
>  wrote:
> > | =# vacuum freeze verbose it;
> > | INFO:  vacuuming "public.it" in aggressive mode
> > | INFO:  "it": found 0 removable, 0 nonremovable row versions in 0 out of 0 
> > pages
> > ...
> > | Skipped 0 pages due to buffer pins, 0 frozen pages.
> >
> > I still feel a bit uneasy about the word "aggressive" here.
> 
> I think we can use the word "aggressive" here since we already use the
> word "aggressive vacuum" in docs[1], but it might be easily
> misunderstood.
> 
> [1] https://www.postgresql.org/docs/9.6/static/routine-vacuuming.html
> 
> >Is it better to be "freezing" or something?
> 
> An another idea can be something like "prevent wraparound". The
> autovaucum process doing aggressive vacuum appears in pg_stat_activity
> with the word " (to prevent wraparound)". This word might be more
> user friendly IMO.

Hmm. This appears to be in several form.

https://www.postgresql.org/docs/devel/static/sql-vacuum.html

> aggressive “freezing” of tuples. ... Aggressive freezing

https://www.postgresql.org/docs/devel/static/routine-vacuuming.html

> VACUUM will perform an aggressive vacuum,
> an anti-wraparound autovacuum

https://www.postgresql.org/docs/devel/static/runtime-config-client.html

> ACUUM performs an aggressive scan

ps title

> (to prevent wraparound)

The nearest common wording seems to be just aggressive (vacuum)
so I left it alone in the attached patch.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 009507d5ddb33229e4c866fef206962de39317cc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 28 Aug 2017 13:12:25 +0900
Subject: [PATCH] Show "aggressive" or not in vacuum messages

VACUUM VERBOSE or autovacuum emits log message with "n skipped-frozen"
but we cannot tell whether the vacuum was non-freezing (or not
aggressive) vacuum or freezing (or aggressive) vacuum having no tuple
to freeze. This patch adds indication of "aggressive" or "normal" in
autovacuum logs and VACUUM VERBOSE message
---
 src/backend/commands/vacuumlazy.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 45b1859..71ecd50 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -373,10 +373,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 			 * emitting individual parts of the message when not applicable.
 			 */
 			initStringInfo(&buf);
-			appendStringInfo(&buf, _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n"),
+			appendStringInfo(&buf, _("automatic vacuum of table \"%s.%s.%s\": mode: %s, index scans: %d\n"),
 			 get_database_name(MyDatabaseId),
 			 get_namespace_name(RelationGetNamespace(onerel)),
 			 RelationGetRelationName(onerel),
+			 aggressive ? "aggressive" : "normal",
 			 vacrelstats->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped frozen\n"),
 			 vacrelstats->pages_removed,
@@ -487,9 +488,9 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 
 	relname = RelationGetRelationName(onerel);
 	ereport(elevel,
-			(errmsg("vacuuming \"%s.%s\"",
+			(errmsg("vacuuming \"%s.%s\" in %s mode",
 	get_namespace_name(RelationGetNamespace(onerel)),
-	relname)));
+	relname, aggressive ? "aggressive" : "normal")));
 
 	empty_pages = vacuumed_pages = 0;
 	num_tuples = tups_vacuumed = nkeep = nunused = 0;
-- 
2.9.2


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protect syscache from bloating with negative cache entries

2017-08-28 Thread Kyotaro HORIGUCHI
Thank you for your attention.

At Mon, 14 Aug 2017 17:33:48 -0400, Peter Eisentraut 
 wrote in 
<09fa011f-4536-b05d-0625-11f3625d8...@2ndquadrant.com>
> On 1/24/17 02:58, Kyotaro HORIGUCHI wrote:
> >> BTW, if you set a slightly larger
> >> context size on the patch you might be able to avoid rebases; right
> >> now the patch doesn't include enough context to uniquely identify the
> >> chunks against cacheinfo[].
> > git format-patch -U5 fuses all hunks on cacheinfo[] together. I'm
> > not sure that such a hunk can avoid rebases. Is this what you
> > suggested? -U4 added an identifiable forward context line for
> > some elements so the attached patch is made with four context
> > lines.
> 
> This patch needs another rebase for the upcoming commit fest.

This patch have had interferences from several commits after the
last submission. I amended this patch to follow them (up to
f97c55c), removed an unnecessary branch and edited some comments.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 4887f7d178b41a1a4729931a12bd396b9a8e8ee0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 28 Aug 2017 11:36:21 +0900
Subject: [PATCH 1/2] Cleanup negative cache of pg_statistic when dropping a
 relation.

Accessing columns that don't have statistics leaves negative entries
in catcache for pg_statstic, but there's no chance to remove
them. Especially when repeatedly creating then dropping temporary
tables bloats catcache so much that memory pressure becomes
significant. This patch removes negative entries in STATRELATTINH,
ATTNAME and ATTNUM when corresponding relation is dropped.
---
 src/backend/utils/cache/catcache.c |  57 ++-
 src/backend/utils/cache/syscache.c | 299 +++--
 src/include/utils/catcache.h   |   3 +
 src/include/utils/syscache.h   |   2 +
 4 files changed, 279 insertions(+), 82 deletions(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index e092801..e50c997 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -304,10 +304,11 @@ CatCachePrintStats(int code, Datum arg)
 
 		if (cache->cc_ntup == 0 && cache->cc_searches == 0)
 			continue;			/* don't print unused caches */
-		elog(DEBUG2, "catcache %s/%u: %d tup, %ld srch, %ld+%ld=%ld hits, %ld+%ld=%ld loads, %ld invals, %ld lsrch, %ld lhits",
+		elog(DEBUG2, "catcache %s/%u: %d tup, %d negtup, %ld srch, %ld+%ld=%ld hits, %ld+%ld=%ld loads, %ld invals, %ld lsrch, %ld lhits",
 			 cache->cc_relname,
 			 cache->cc_indexoid,
 			 cache->cc_ntup,
+			 cache->cc_nnegtup,
 			 cache->cc_searches,
 			 cache->cc_hits,
 			 cache->cc_neg_hits,
@@ -374,6 +375,10 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
 	/* free associated tuple data */
 	if (ct->tuple.t_data != NULL)
 		pfree(ct->tuple.t_data);
+
+	if (ct->negative)
+		--cache->cc_nnegtup;
+
 	pfree(ct);
 
 	--cache->cc_ntup;
@@ -572,6 +577,49 @@ ResetCatalogCache(CatCache *cache)
 }
 
 /*
+ *		CleanupCatCacheNegEntries
+ *
+ *	Remove negative cache tuples matching a partial key.
+ *
+ */
+void
+CleanupCatCacheNegEntries(CatCache *cache, ScanKeyData *skey)
+{
+	int i;
+
+	/* If this cache has no negative entries, nothing to do */
+	if (cache->cc_nnegtup == 0)
+		return;
+
+	/* searching with a paritial key means scanning the whole cache */
+	for (i = 0; i < cache->cc_nbuckets; i++)
+	{
+		dlist_head *bucket = &cache->cc_bucket[i];
+		dlist_mutable_iter iter;
+
+		dlist_foreach_modify(iter, bucket)
+		{
+			CatCTup*ct = dlist_container(CatCTup, cache_elem, iter.cur);
+			bool		res;
+
+			if (!ct->negative)
+continue;
+
+			HeapKeyTest(&ct->tuple, cache->cc_tupdesc, 1, skey, res);
+			if (!res)
+continue;
+
+			/*
+			 * the negative cache entries can no longer be referenced, so we
+			 * can remove it unconditionally
+			 */
+			CatCacheRemoveCTup(cache, ct);
+		}
+	}
+}
+
+
+/*
  *		ResetCatalogCaches
  *
  * Reset all caches when a shared cache inval event forces it
@@ -718,6 +766,7 @@ InitCatCache(int id,
 	cp->cc_relisshared = false; /* temporary */
 	cp->cc_tupdesc = (TupleDesc) NULL;
 	cp->cc_ntup = 0;
+	cp->cc_nnegtup = 0;
 	cp->cc_nbuckets = nbuckets;
 	cp->cc_nkeys = nkeys;
 	for (i = 0; i < nkeys; ++i)
@@ -1215,8 +1264,8 @@ SearchCatCache(CatCache *cache,
 
 		CACHE4_elog(DEBUG2, "SearchCatCache(%s): Contains %d/%d tuples",
 	cache->cc_relname, cache->cc_ntup, CacheHdr->ch_ntup);
-		CACHE3_elog(DEBUG2, "SearchCatCache(%s): put neg entry in bucket %d",
-	cache->cc_relname, hashIndex);
+		CACHE4_elog(DEBUG2, "SearchCatCache(%s): put neg entry in bucket %d, total %d",
+	cache->cc_relname, hashIndex, cache->cc_nnegtup);
 
 		/*
 		 * We are not retur

Re: [HACKERS] asynchronous execution

2017-08-02 Thread Kyotaro HORIGUCHI
Thank you for the comment.

At Tue, 1 Aug 2017 16:27:41 -0400, Robert Haas  wrote in 

> On Mon, Jul 31, 2017 at 5:42 AM, Kyotaro HORIGUCHI
>  wrote:
> > Another is getting rid of recursive call to run an execution
> > tree.
> 
> That happens to be exactly what Andres did for expression evaluation
> in commit b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755, and I think
> generalizing that to include the plan tree as well as expression trees
> is likely to be the long-term way forward here.

I read it in the source tree. The patch implements converting
expression tree to an intermediate expression then run it on a
custom-made interpreter. Guessing from the word "upside down"
from Andres, the whole thing will become source-driven.

> Unfortunately, that's probably another gigantic patch (that
> should probably be written by Andres).

Yeah, but async executor on the current style of executor seems
furtile work, or sitting until the patch comes is also waste of
time. So I'm planning to include the following sutff in the next
PoC patch. Even I'm not sure it can land on the coming
Andres'patch.

- Tuple passing outside call-stack. (I remember it was in the
  past of the thread around but not found)

  This should be included in the Andres' patch.

- Give executor an ability to run from data-source (or driver)
  nodes to the root.

  I'm not sure this is included, but I suppose he is aiming this
  kind of thing.

- Rebuid asynchronous execution on the upside-down executor.


regrds,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-07-31 Thread Kyotaro HORIGUCHI
At Fri, 28 Jul 2017 17:31:05 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170728.173105.238045591.horiguchi.kyot...@lab.ntt.co.jp>
> Thank you for the comment.
> 
> At Wed, 26 Jul 2017 17:16:43 -0400, Robert Haas  wrote 
> in 
> > regression all the same.  Every type of intermediate node will have to
> > have a code path where it uses ExecAsyncRequest() /
> > ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, and
> 
> I understand what Robert concerns and I share the same
> opinion. It needs further different framework.
> 
> At Thu, 27 Jul 2017 06:39:51 -0400, Robert Haas  wrote 
> in 
> > On Wed, Jul 26, 2017 at 5:43 PM, Tom Lane  wrote:
> > > The scheme he has allows $extra_stuff to be injected into ExecProcNode at
> > > no cost when $extra_stuff is not needed, because you simply don't insert
> > > the wrapper function when it's not needed.  I'm not sure that it will
...
> > Yeah, I don't quite see how that would apply in this case -- what we
> > need here is not as simple as just conditionally injecting an extra
> > bit.
> 
> Thank you for the pointer, Tom. The subject (segfault in HEAD...)
> haven't made me think that this kind of discussion was held
> there. Anyway it seems very closer to asynchronous execution so
> I'll catch up it considering how I can associate with this.

I understand the executor change which has just been made at
master based on the pointed thread. This seems to have the
capability to let exec-node switch to async-aware with no extra
cost on non-async processing. So it would be doable to (just)
*shrink* the current framework by detaching the async-aware side
of the API. But to get the most out of asynchrony, it is required
that multiple async-capable nodes distributed under async-unaware
nodes run simultaneously.

There seems two ways to achieve this.

One is propagating required-async-nodes bitmap up to the topmost
node and waiting for the all required nodes to become ready. In
the long run this requires all nodes to be async-aware and that
apparently would have bad effect on performance to async-unaware
nodes containing async-capable nodes.

Another is getting rid of recursive call to run an execution
tree. It is perhaps the same to what mentioned as "data-centric
processing" in a previous threads [1], [2], but I'd like to I pay
attention on the aspect of "enableing to resume execution tree
from arbitrary leaf node".  So I'm considering to realize it
still in one-tuple-by-one manner instead of collecting all tuples
of a leaf node first. Even though I'm not sure it is doable.


[1] 
https://www.postgresql.org/message-id/bf2827dcce55594c8d7a8f7ffd3ab77159a9b...@szxeml521-mbs.china.huawei.com
[2] 
https://www.postgresql.org/message-id/20160629183254.frcm3dgg54ud5...@alap3.anarazel.de

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-07-28 Thread Kyotaro HORIGUCHI
Thank you for the comment.

At Wed, 26 Jul 2017 17:16:43 -0400, Robert Haas  wrote 
in 
> But if we do, then I fear we'll just be reintroducing the same
> performance regression that we introduced by switching to this
> framework from the previous one - or maybe a different one, but a
> regression all the same.  Every type of intermediate node will have to
> have a code path where it uses ExecAsyncRequest() /
> ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, and

I understand what Robert concerns and I think I share the same
opinion. It needs further different framework.

At Thu, 27 Jul 2017 06:39:51 -0400, Robert Haas  wrote 
in 
> On Wed, Jul 26, 2017 at 5:43 PM, Tom Lane  wrote:
> > I have not been paying any attention to this thread whatsoever,
> > but I wonder if you can address your problem by building on top of
> > the ExecProcNode replacement that Andres is working on,
> > https://www.postgresql.org/message-id/20170726012641.bmhfcp5ajpudi...@alap3.anarazel.de
> >
> > The scheme he has allows $extra_stuff to be injected into ExecProcNode at
> > no cost when $extra_stuff is not needed, because you simply don't insert
> > the wrapper function when it's not needed.  I'm not sure that it will
> > scale well to several different kinds of insertions though, for instance
> > if you wanted both instrumentation and async support on the same node.
> > But maybe those two couldn't be arms-length from each other anyway,
> > in which case it might be fine as-is.
> 
> Yeah, I don't quite see how that would apply in this case -- what we
> need here is not as simple as just conditionally injecting an extra
> bit.

Thank you for the pointer, Tom. The subject (segfault in HEAD...)
haven't made me think that this kind of discussion was held
there. Anyway it seems very closer to asynchronous execution so
I'll catch up it considering how I can associate with this.

Regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Make ExplainBeginGroup()/ExplainEndGroup() public.

2017-07-25 Thread Kyotaro HORIGUCHI
At Fri, 21 Jul 2017 10:09:25 -0400, Hadi Moshayedi  wrote 
in 
> Hello,
> 
> The attached patch moves declarations of
> ExplainBeginGroup()/ExplainEndGroup() from explain.c to explain.h.
> 
> This can be useful for extensions that need explain groups in their
> custom-scan explain output.
> 
> For example, Citus uses groups in its custom explain outputs [1]. But it
> achieves it by having a copy of
> ExplainBeginGroup()/ExplainEndGroup() in its source code, which is not the
> best way.
> 
> Please review.
> 
> [1]
> https://github.com/citusdata/citus/blob/master/src/backend/distributed/planner/multi_explain.c

The patch is a kind of straightforward and looks fine for me.

I think they ought to be public, like many ExplainProperty*()
functions. On the other hand this patch can cause symbol
conflicts with some external modules but I think such breakage
doesn't matter so much.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-07-25 Thread Kyotaro HORIGUCHI
Hello,

8bf58c0d9bd33686 badly conflicts with this patch, so I'll rebase
this and added a patch to refactor the function that Anotonin
pointed. This would be merged into 0002 patch.

At Tue, 18 Jul 2017 16:24:52 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170718.162452.221576658.horiguchi.kyot...@lab.ntt.co.jp>
> I'll put an upper limit to the number of waiters processed at
> once. Then add a comment like that.
> 
> > Actually the reason I thought of simplification was that I noticed small
> > inefficiency in the way you do the compaction. In particular, I think it's 
> > not
> > always necessary to swap the tail and head entries. Would something like 
> > this
> > make sense?
> 
> I'm not sure, but I suppose that it is rare that all of the first
> many elements in the array are not COMPLETE. In most cases the
> first element gets a response first.
...
> Yeah, but maybe the "head" is still confusing even if reversed
> because it is still not a head of something.  It might be less
> confusing by rewriting it in more verbose-but-straightforwad way.
> 
> 
> |  int npending = 0;
> | 
> |  /* Skip over not-completed items at the beginning */
> |  while (npending < estate->es_num_pending_async &&
> | estate->es_pending_async[npending] != ASYNCREQ_COMPLETE)
> |npending++;
> | 
> |  /* Scan over the rest for not-completed items */
> |  for (i = npending + 1 ; i < estate->es_num_pending_async; ++i)
> |  {
> |PendingAsyncRequest *tmp;
> |PendingAsyncRequest *curr = estate->es_pending_async[i];
> |
> |if (curr->state == ASYNCREQ_COMPLETE)
> |  continue;
> |
> |  /* Move the not-completed item to the tail of the first chunk */
> |tmp = estate->es_pending_async[i];
> |estate->es_pending_async[nepending] = tmp;
> |estate->es_pending_async[i] = tmp;
> |++npending;
> |  }

The last patch does something like this (with apparent bugs
fixed)

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 41ad9a7518c066da619363e6cdf8574fa00ee1e5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 22 Feb 2017 09:07:49 +0900
Subject: [PATCH 1/5] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c|  2 +-
 src/backend/storage/ipc/latch.c   | 18 ++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c | 68 +++
 src/include/storage/latch.h   |  4 +-
 src/include/utils/resowner_private.h  |  8 
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 4452ea4..ed71e7c 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -220,7 +220,7 @@ pq_init(void)
 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 	  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 07b1364..9543397 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -518,12 +521,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int neven

Re: [HACKERS] PgFDW connection invalidation by ALTER SERVER/ALTER USER MAPPING

2017-07-25 Thread Kyotaro HORIGUCHI
At Mon, 24 Jul 2017 10:23:07 +0530, Ashutosh Bapat 
 wrote in 

> On Fri, Jul 21, 2017 at 10:39 PM, Tom Lane  wrote:
> > Ashutosh Bapat  writes:
> >> On Fri, Jul 21, 2017 at 10:55 AM, Kyotaro HORIGUCHI
> >>  wrote:
> >>> The attached patch differs only in this point.
> >
> >> +1. The patch looks good to me.
> >
> > Pushed with a couple additional changes: we'd all missed that the header
> > comment for GetConnection was obsoleted by this change, and the arguments
> > for GetSysCacheHashValue really need to be coerced to Datum.  (I think
> > OID to Datum is the same as what the compiler would do anyway, but best
> > not to assume that.)
> 
> Thanks and sorry for not noticing the prologue.

Ditto.

> >
> > Back-patching was more exciting than I could wish.  It seems that
> > before 9.6, we did not have struct UserMapping storing the OID of the
> > pg_user_mapping row it had been made from.  I changed GetConnection to
> > re-look-up that row and get the OID.  But that's ugly, and there's a
> > race condition: if user mappings are being added or deleted meanwhile,
> > we might locate a per-user mapping when we're really using a PUBLIC
> > mapping or vice versa, causing the ConnCacheEntry to be labeled with
> > the wrong hash value so that it might not get invalidated properly later.
> > Still, it's significantly better than it was, and that corner case seems
> > unlikely to get hit in practice --- for one thing, you'd have to then
> > revert the mapping addition/deletion before the ConnCacheEntry would be
> > found and used again.  I don't want to take the risk of modifying struct
> > UserMapping in stable branches, so it's hard to see a way to make that
> > completely bulletproof before 9.6.
> 
> +1.

Agreed.

> -- 
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Mishandling of WCO constraints in direct foreign table modification

2017-07-21 Thread Kyotaro HORIGUCHI
At Fri, 21 Jul 2017 12:00:03 +0900, Etsuro Fujita  
wrote in <15aa9936-9bd8-c9e3-7ca1-394861073...@lab.ntt.co.jp>
> On 2017/07/21 3:24, Robert Haas wrote:
> > I think that's reasonable.  This should be committed and back-patched
> > to 9.6, right?
> 
> Yeah, because direct modify was introduced in 9.6.
> 
> Attached is the second version which updated docs in postgres-fdw.sgml
> as well.

!no local joins for the query, no row-level local BEFORE or
!AFTER triggers on the target table, and no
!CHECK OPTION constraints from parent views.
!In UPDATE,

Might be a silly question, is CHECK OPTION a "constraint"?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-07-21 Thread Kyotaro HORIGUCHI
At Fri, 21 Jul 2017 11:39:38 +0530, Neha Sharma  
wrote in 
> Here is the back trace from the core dump attached.
> 
> (gdb) bt
> #0  0x7f4a71424495 in raise () from /lib64/libc.so.6
> #1  0x7f4a71425c75 in abort () from /lib64/libc.so.6
> #2  0x009dc18a in ExceptionalCondition (conditionName=0xa905d0
> "!(TransactionIdPrecedesOrEquals(oldestXact,
> ShmemVariableCache->oldestXid))",
> errorType=0xa9044f "FailedAssertion", fileName=0xa90448 "clog.c",
> lineNumber=683) at assert.c:54
> #3  0x00524215 in TruncateCLOG (oldestXact=150036635,
> oldestxid_datoid=13164) at clog.c:682

In vac_truncate_clog, TruncateCLOG is called before
SetTransactionIdLimit, which advances
ShmemVariableCache->oldestXid. Given that the assertion in
TruncateCLOG is valid, they should be called in reverse order. I
suppose that CLOG files can be safely truncated after advancing
XID limits.

By the way, the attached patch is made by "git diff --patience".

filterdiff converts it into somewhat wrong shape. Specifically,
the result is missing the addition part of the difference, as the
second attached patch. I'm not sure which of git(2.9.2) or
filterdiff (0.3.3), (or me?) is doing wrong..


> #4  0x006a6be8 in vac_truncate_clog (frozenXID=150036635,
> minMulti=1, lastSaneFrozenXid=200562449, lastSaneMinMulti=1) at
> vacuum.c:1197
> #5  0x006a6948 in vac_update_datfrozenxid () at vacuum.c:1063
> #6  0x007ce0a2 in do_autovacuum () at autovacuum.c:2625
> #7  0x007cc987 in AutoVacWorkerMain (argc=0, argv=0x0) at
> autovacuum.c:1715
> #8  0x007cc562 in StartAutoVacWorker () at autovacuum.c:1512
> #9  0x007e2acd in StartAutovacuumWorker () at postmaster.c:5414
> #10 0x007e257e in sigusr1_handler (postgres_signal_arg=10) at
> postmaster.c:5111
> #11 
> #12 0x7f4a714d3603 in __select_nocancel () from /lib64/libc.so.6
> #13 0x007dde88 in ServerLoop () at postmaster.c:1717
> #14 0x007dd67d in PostmasterMain (argc=3, argv=0x2eb8b00) at
> postmaster.c:1361
> #15 0x0071a218 in main (argc=3, argv=0x2eb8b00) at main.c:228
> (gdb) print ShmemVariableCache->oldestXid
> $3 = 548
> 
> 
> Regards,
> Neha Sharma
> 
> On Fri, Jul 21, 2017 at 11:01 AM, Thomas Munro <
> thomas.mu...@enterprisedb.com> wrote:
> 
> > On Fri, Jul 21, 2017 at 4:16 PM, Neha Sharma
> >  wrote:
> > >
> > > Attached is the core dump file received on PG 10beta2 version.
> >
> > Thanks Neha.  It's be best to post the back trace and if possible
> > print oldestXact and ShmemVariableCache->oldestXid from the stack
> > frame for TruncateCLOG.
> >
> > The failing assertion in TruncateCLOG() has a comment that says
> > "vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
> > calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
> > *after* it calls TruncateCLOG().  What am I missing here?
> >
> > What actually prevents ShmemVariableCache->oldestXid from going
> > backwards anyway?  Suppose there are two or more autovacuum processes
> > that reach vac_truncate_clog() concurrently.  They do a scan of
> > pg_database whose tuples they access without locking through a
> > pointer-to-volatile because they expect concurrent in-place writers,
> > come up with a value for frozenXID, and then arrive at
> > SetTransactionIdLimit() in whatever order and clobber
> > ShmemVariableCache->oldestXid.  What am I missing here?
> >
> > --
> > Thomas Munro
> > http://www.enterprisedb.com
> >

-- 
堀口恭太郎

日本電信電話株式会社 NTTオープンソースソフトウェアセンタ
Phone: 03-5860-5115 / Fax: 03-5463-5490
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index faa1812..cd8be92 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1192,13 +1192,6 @@ vac_truncate_clog(TransactionId frozenXID,
 	AdvanceOldestCommitTsXid(frozenXID);
 
 	/*
-	 * Truncate CLOG, multixact and CommitTs to the oldest computed value.
-	 */
-	TruncateCLOG(frozenXID, oldestxid_datoid);
-	TruncateCommitTs(frozenXID);
-	TruncateMultiXact(minMulti, minmulti_datoid);
-
-	/*
 	 * Update the wrap limit for GetNewTransactionId and creation of new
 	 * MultiXactIds.  Note: these functions will also signal the postmaster
 	 * for an(other) autovac cycle if needed.   XXX should we avoid possibly
@@ -1206,6 +1199,14 @@ vac_truncate_clog(TransactionId frozenXID,
 	 */
 	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
 	SetMultiXactIdLimit(minMulti, minmulti_datoid, false);
+
+	/*
+	 * Truncate CLOG, multixact and CommitTs to the oldest computed value
+	 * after advancing xid limits.
+	 */
+	TruncateCLOG(frozenXID, oldestxid_datoid);
+	TruncateCommitTs(frozenXID);
+	TruncateMultiXact(minMulti, minmulti_datoid);
 }
 
 
*** a/src/backend/commands/vacuum.c
--- b/src/backend/commands/vacuum.c
***
*** 1192,1204  vac_truncate_clog(TransactionId frozenXID,
  	AdvanceOldestCommitTsXid(frozenXID);
  
  	/*
- 	 * Truncate CLOG, multixact and CommitTs to the oldest computed value.
- 

Re: [HACKERS] PgFDW connection invalidation by ALTER SERVER/ALTER USER MAPPING

2017-07-20 Thread Kyotaro HORIGUCHI
At Thu, 20 Jul 2017 18:23:05 -0400, Alvaro Herrera  
wrote in <20170720222305.ij3pk7qw5im3wozr@alvherre.pgsql>
> Kyotaro HORIGUCHI wrote:
> 
> > Finally, I added a new TAP test library PsqlSession. It offers
> > interactive psql sessions. Then added a simple test to
> > postgres_fdw using it.
> 
> Hmm, I think this can be very useful for other things.  Let's keep this
> in mind to use in the future, even if we find another way to fix the
> issue at hand.  In fact, I had a problem a couple of weeks ago in which
> I needed two concurrent sessions and one of them disconnected in the
> middle of the test.  Can't do that with isolationtester ...

Thanks. I agree that it still useful to write more complex
tests. The most significant issue on this (PsqlSession.pm) comes
from the fact that I didn't find the way to detect the end of an
query execution without attaching a bogus query.. And this kind
of things tend to be unstable on an high-load environment.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PgFDW connection invalidation by ALTER SERVER/ALTER USER MAPPING

2017-07-20 Thread Kyotaro HORIGUCHI
At Thu, 20 Jul 2017 18:15:42 -0400, Tom Lane  wrote in 
<18927.1500588...@sss.pgh.pa.us>
> This seems like overkill.  We can test it reasonably easily within the
> existing framework, as shown in the attached patch.  I'm also fairly

It checks for a disconnection caused in a single session. I
thought that its inter-process characteristics is important
(since I had forgot that in the previous version), but it is
reasonable enough if we can rely on the fact that it surely works
through invalidation mechanism.

In shoft, I agree to the test in your patch.

> concerned that what you're showing here would be unstable in the buildfarm
> as a result of race conditions between the multiple sessions.

Sure. It is what I meant by 'fragile'.

> I made some cosmetic updates to the code patch, as well.

Thank you for leaving the hashvalue staff and revising the comment.

By the way I mistakenly had left the following code in the
previous patch.

+ /* hashvalue == 0 means a cache reset, must clear all state */
+ if (hashvalue == 0)
+   entry->invalidated = true;
+ else if ((cacheid == FOREIGNSERVEROID &&
+   entry->server_hashvalue == hashvalue) ||
+  (cacheid == USERMAPPINGOID &&
+   entry->mapping_hashvalue == hashvalue))
+   entry->invalidated = true;

The reason for the redundancy was that it had used switch-case in
the else block just before. However, it is no longer
reasonable. I'd like to change here as the follows.

+ /* hashvalue == 0 means a cache reset, must clear all state */
+ if ((hashvalue == 0) ||
+ ((cacheid == FOREIGNSERVEROID &&
+   entry->server_hashvalue == hashvalue) ||
+  (cacheid == USERMAPPINGOID &&
+   entry->mapping_hashvalue == hashvalue)))
+   entry->invalidated = true;

The attached patch differs only in this point.

> I think this is actually a bug fix, and should not wait for the next
> commitfest.

Agreed.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
*** a/contrib/postgres_fdw/connection.c
--- b/contrib/postgres_fdw/connection.c
***
*** 22,27 
--- 22,28 
  #include "pgstat.h"
  #include "storage/latch.h"
  #include "utils/hsearch.h"
+ #include "utils/inval.h"
  #include "utils/memutils.h"
  #include "utils/syscache.h"
  
***
*** 48,58  typedef struct ConnCacheEntry
--- 49,63 
  {
  	ConnCacheKey key;			/* hash key (must be first) */
  	PGconn	   *conn;			/* connection to foreign server, or NULL */
+ 	/* Remaining fields are invalid when conn is NULL: */
  	int			xact_depth;		/* 0 = no xact open, 1 = main xact open, 2 =
   * one level of subxact open, etc */
  	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
  	bool		have_error;		/* have any subxacts aborted in this xact? */
  	bool		changing_xact_state;	/* xact state change in process */
+ 	bool		invalidated;	/* true if reconnect is pending */
+ 	uint32		server_hashvalue;	/* hash value of foreign server OID */
+ 	uint32		mapping_hashvalue;	/* hash value of user mapping OID */
  } ConnCacheEntry;
  
  /*
***
*** 69,74  static bool xact_got_connection = false;
--- 74,80 
  
  /* prototypes of private functions */
  static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+ static void disconnect_pg_server(ConnCacheEntry *entry);
  static void check_conn_params(const char **keywords, const char **values);
  static void configure_remote_session(PGconn *conn);
  static void do_sql_command(PGconn *conn, const char *sql);
***
*** 78,83  static void pgfdw_subxact_callback(SubXactEvent event,
--- 84,90 
  	   SubTransactionId mySubid,
  	   SubTransactionId parentSubid,
  	   void *arg);
+ static void pgfdw_inval_callback(Datum arg, int cacheid, uint32 hashvalue);
  static void pgfdw_reject_incomplete_xact_state_change(ConnCacheEntry *entry);
  static bool pgfdw_cancel_query(PGconn *conn);
  static bool pgfdw_exec_cleanup_query(PGconn *conn, const char *query,
***
*** 130,135  GetConnection(UserMapping *user, bool will_prep_stmt)
--- 137,146 
  		 */
  		RegisterXactCallback(pgfdw_xact_callback, NULL);
  		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
+ 		CacheRegisterSyscacheCallback(FOREIGNSERVEROID,
+ 	  pgfdw_inval_callback, (Datum) 0);
+ 		CacheRegisterSyscacheCallback(USERMAPPINGOID,
+ 	  pgfdw_inval_callback, (Datum) 0);
  	}
  
  	/* Set flag that we did GetConnection during the current transaction */
***
*** 144,161  GetConnection(UserMapping *user, bool will_prep_stmt)
  	entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
  	if (!found)
  	{
! 		/* initialize new hashtable entry (key is already filled in) */
  		entry->conn = NULL;
- 		entry-

Re: [HACKERS] PgFDW connection invalidation by ALTER SERVER/ALTER USER MAPPING

2017-07-19 Thread Kyotaro HORIGUCHI
Finally, I added new TAP test library PsqlSession.

At Tue, 18 Jul 2017 18:12:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170718.181213.206979369.horiguchi.kyot...@lab.ntt.co.jp>
> > * How about some regression test cases?  You couldn't really exercise
> > cross-session invalidation easily, but I don't think you need to.
> 
> Ha Ha. You got me. I will add some test cases for this in the
> next version. Thanks.

Here it is. First I tried this using ordinary regression
framework but the effect of this patch is shown only in log and
it contains variable parts so I gave up it before trying more
complex way.

Next I tried existing TAP test but this test needs continuous
session to achieve alternating operation on two sessions but
PostgresNode::psql doesn't offer such a functionality.

Finally, I added a new TAP test library PsqlSession. It offers
interactive psql sessions. Then added a simple test to
postgres_fdw using it.

The first patch is the PsqlSession.pm and the second is the new
test for postgres_fdw.

- The current PsqlSession is quite fragile but seems working
  enough for this usage at the present.

- I'm afraid this might not work on Windows according to the
  manpage of IPC::Run, but I haven't confirmed yet.

  http://search.cpan.org/~toddr/IPC-Run-0.96/lib/IPC/Run.pm#Win32_LIMITATIONS


Any comment or suggestions are welcome.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From fdb5cbab3375d9d2e4da078cf6ee7eaf7de5c8fd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 20 Jul 2017 14:56:51 +0900
Subject: [PATCH 1/2] Add new TAP test library PsqlSession.pm

PostgreNode::psql makes temporary session to run commands so it is not
usable when more interactive operation on a continued session. This
library offers continuous sessions feature that can execute multiple
sql commands separately.
---
 src/test/perl/PsqlSession.pm | 341 +++
 1 file changed, 341 insertions(+)
 create mode 100644 src/test/perl/PsqlSession.pm

diff --git a/src/test/perl/PsqlSession.pm b/src/test/perl/PsqlSession.pm
new file mode 100644
index 000..d69fd14
--- /dev/null
+++ b/src/test/perl/PsqlSession.pm
@@ -0,0 +1,341 @@
+
+=pod
+
+=head1 NAME
+
+PsqlSession - class representing PostgreSQL psql instance
+
+=head1 SYNOPSIS
+
+  use PsqlSession;
+
+  my $session = get_new_session('session1', $server);
+
+  to connect to a PostgreNode as $server, or
+
+  my $session = get_new_session('session1', 'localhost', '5432', 'postgres');
+
+  to specify each options explicitly.
+
+  # Connect to the server
+  $session->open();
+
+  # Execute an SQL query
+  $ret = $session->execsql('SELECT now();');
+
+  Returns a pair of output of stdout, and stderr in array context.
+
+  ($out, $err) = $session->execsql('SELECT now();');
+
+  $session->execsql_multi('SELECT 1;', 'SELECT now();');
+
+  is just a shortcut of writing many execsqls.
+
+  # End the session
+  $session->close();
+
+=head1 DESCRIPTION
+
+PsqlSession contains a set of routines able to work on a psql session,
+allowing to connect, send a command and receive the result and close.
+
+The IPC::Run module is required.
+
+=cut
+
+package PsqlSession;
+
+use strict;
+use warnings;
+
+use Exporter 'import';
+use Test::More;
+use TestLib ();
+use Scalar::Util qw(blessed);
+
+our @EXPORT = qw(
+  get_new_session
+);
+
+
+=pod
+
+=head1 METHODS
+
+=over
+
+=item PsqlSession::new($class, $name, $pghost, $pgport, $dbname)
+
+Create a new PsqlSession instance. Does not connect.
+
+You should generally prefer to use get_new_session() instead since it
+takes care of finding host name, port number or database name.
+
+=cut
+
+sub new
+{
+	my ($class, $name, $pghost, $pgport, $dbname) = @_;
+
+	my $self = {
+		_name => $name,
+		_host => $pghost,
+		_port => $pgport,
+		_dbname => $dbname };
+
+	bless $self, $class;
+
+#	$self->dump_info;
+
+	return $self;
+}
+
+=pod
+
+=item $session->name()
+
+The name assigned to the session at creation time.
+
+=cut
+
+sub name
+{
+	return $_[0]->{_name};
+}
+
+=pod
+
+=item $session->host()
+
+Return the host (like PGHOST) for this instance. May be a UNIX socket path.
+
+=cut
+
+sub host
+{
+	return $_[0]->{_host};
+}
+
+=pod
+
+=item $session->port()
+
+Get the port number connects to. This won't necessarily be a TCP port
+open on the local host since we prefer to use unix sockets if
+possible.
+
+=cut
+
+sub port
+{
+	return $_[0]->{_port};
+}
+
+=pod
+
+=item $session->dbname()
+
+Get the database name this session connects to.
+
+=cut
+
+sub dbname
+{
+	return $_[0]->{_dbname};
+}
+
+=pod
+
+=item $session->errstate()
+
+Get the error state of this session. 0 means no error and 1 means
+error. This value is reset at the starting of every execution of a

Re: [HACKERS] PgFDW connection invalidation by ALTER SERVER/ALTER USER MAPPING

2017-07-18 Thread Kyotaro HORIGUCHI
Thank you for the comments.

At Mon, 17 Jul 2017 16:09:04 -0400, Tom Lane  wrote in 
<9897.1500322...@sss.pgh.pa.us>
> Kyotaro HORIGUCHI  writes:
> > This is the revased and revised version of the previous patch.
> 
> A few more comments:
> 
> * Per the spec for CacheRegisterSyscacheCallback, a zero hash value means
> to flush all associated state.  This isn't handling that case correctly.

Right, fixed.

> Even when you are given a specific hash value, I think exiting after
> finding one match is incorrect: there could be multiple connections
> to the same server with different user mappings, or vice versa.

Sure. I'm confused that key hash value nails an entry in "the
connection cache". Thank you for pointing out that.

> * I'm not really sure that it's worth the trouble to pay attention
> to the hash value at all.  Very few other syscache callbacks do,
> and the pg_server/pg_user_mapping catalogs don't seem likely to
> have higher than average traffic.

Agreed to the points. But there is another point that reconection
is far intensive than re-looking up of a system catalog or maybe
even than replanning. For now I choosed to avoid a possibility of
causing massive number of simultaneous reconnection.

> * Personally I'd be inclined to register the callbacks at the same
> time the hashtables are created, which is a pattern we use elsewhere
> ... in fact, postgres_fdw itself does it that way for the transaction
> related callbacks, so it seems a bit weird that you are going in a
> different direction for these callbacks.  That way avoids the need to
> depend on a _PG_init function and means that the callbacks don't have to
> worry about the hashtables not being valid.

Sure, seems more reasonable than it is now. Changed the way of
registring a callback in the attached.

>  Also, it seems a bit
> pointless to have separate layers of postgresMappingSysCallback and
> InvalidateConnectionForMapping functions.

It used to be one function but it seemed a bit wierd that the
function is called from two places (two catalogs) then branchs
according to the caller. I don't have a firm opinion on this so
changed.

As the result the changes in postgres_fdw.c has been disappeard.

> * How about some regression test cases?  You couldn't really exercise
> cross-session invalidation easily, but I don't think you need to.

Ha Ha. You got me. I will add some test cases for this in the
next version. Thanks.


Ashutosh, I'll register this to the next CF after providing a
regression, thanks.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
*** a/contrib/postgres_fdw/connection.c
--- b/contrib/postgres_fdw/connection.c
***
*** 22,27 
--- 22,28 
  #include "pgstat.h"
  #include "storage/latch.h"
  #include "utils/hsearch.h"
+ #include "utils/inval.h"
  #include "utils/memutils.h"
  #include "utils/syscache.h"
  
***
*** 53,58  typedef struct ConnCacheEntry
--- 54,62 
  	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
  	bool		have_error;		/* have any subxacts aborted in this xact? */
  	bool		changing_xact_state;	/* xact state change in process */
+ 	bool		invalidated;	/* true if reconnect is requried */
+ 	uint32		server_hashvalue;	/* hash value of foreign server oid */
+ 	uint32		mapping_hashvalue;  /* hash value of user mapping oid */
  } ConnCacheEntry;
  
  /*
***
*** 69,74  static bool xact_got_connection = false;
--- 73,79 
  
  /* prototypes of private functions */
  static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+ static void disconnect_pg_server(ConnCacheEntry *entry);
  static void check_conn_params(const char **keywords, const char **values);
  static void configure_remote_session(PGconn *conn);
  static void do_sql_command(PGconn *conn, const char *sql);
***
*** 78,83  static void pgfdw_subxact_callback(SubXactEvent event,
--- 83,89 
  	   SubTransactionId mySubid,
  	   SubTransactionId parentSubid,
  	   void *arg);
+ static void pgfdw_inval_callback(Datum arg, int cacheid, uint32 hashvalue);
  static void pgfdw_reject_incomplete_xact_state_change(ConnCacheEntry *entry);
  static bool pgfdw_cancel_query(PGconn *conn);
  static bool pgfdw_exec_cleanup_query(PGconn *conn, const char *query,
***
*** 130,135  GetConnection(UserMapping *user, bool will_prep_stmt)
--- 136,145 
  		 */
  		RegisterXactCallback(pgfdw_xact_callback, NULL);
  		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
+ 		CacheRegisterSyscacheCallback(FOREIGNSERVEROID,
+ 	  pgfdw_inval_callback, (Datum)0);
+ 		CacheRegisterSyscacheCallback(USERMAPPINGOID,
+ 	  pgfdw_inval_callback, (Datum)0);
  	}
  
  	/* Set flag that we did Ge

Re: [HACKERS] asynchronous execution

2017-07-18 Thread Kyotaro HORIGUCHI
Hello,

At Tue, 11 Jul 2017 10:28:51 +0200, Antonin Houska  wrote in 
<6448.1499761731@localhost>
> Kyotaro HORIGUCHI  wrote:
> > Effectively it is a waiting-queue followed by a
> > completed-list. The point of the compaction is keeping the order
> > of waiting or not-yet-completed requests, which is crucial to
> > avoid kind-a precedence inversion. We cannot keep the order by
> > using bitmapset in such way.
> 
> > The current code waits all waiters at once and processes all
> > fired events at once. The order in the waiting-queue is
> > inessential in the case. On the other hand I suppoese waiting on
> > several-tens to near-hundred remote hosts is in a realistic
> > target range. Keeping the order could be crucial if we process a
> > part of the queue at once in the case.
> > 
> > Putting siginificance on the deviation of response time of
> > remotes, process-all-at-once is effective. In turn we should
> > consider the effectiveness of the lifecycle of the larger wait
> > event set.
> 
> ok, I missed the fact that the order of es_pending_async entries is
> important. I think this is worth adding a comment.

I'll put an upper limit to the number of waiters processed at
once. Then add a comment like that.

> Actually the reason I thought of simplification was that I noticed small
> inefficiency in the way you do the compaction. In particular, I think it's not
> always necessary to swap the tail and head entries. Would something like this
> make sense?

I'm not sure, but I suppose that it is rare that all of the first
many elements in the array are not COMPLETE. In most cases the
first element gets a response first.

> 
>   /* If any node completed, compact the array. */
>   if (any_node_done)
>   {
...
>   for (tidx = 0; tidx < estate->es_num_pending_async; 
> ++tidx)
>   {
...
>   if (tail->state == ASYNCREQ_COMPLETE)
>   continue;
> 
>   /*
>* If the array starts with one or more 
> incomplete requests,
>* both head and tail point at the same item, 
> so there's no
>* point in swapping.
>*/
>   if (tidx > hidx)
>   {

This works to skip the first several elements when all of them
are ASYNCREQ_COMPLETE. I think it makes sense as long as it
doesn't harm the loop. The optimization is more effective by
putting out of the loop like this.

|  for (tidx = 0; tidx < estate->es_num_pending_async &&
  estate->es_pending_async[tidx] == ASYNCREQ_COMPLETE; ++tidx);
|  for (; tidx < estate->es_num_pending_async; ++tidx)
...


> And besides that, I think it'd be more intuitive if the meaning of "head" and
> "tail" was reversed: if the array is iterated from lower to higher positions,
> then I'd consider head to be at higher position, not tail.

Yeah, but maybe the "head" is still confusing even if reversed
because it is still not a head of something.  It might be less
confusing by rewriting it in more verbose-but-straightforwad way.


|  int npending = 0;
| 
|  /* Skip over not-completed items at the beginning */
|  while (npending < estate->es_num_pending_async &&
| estate->es_pending_async[npending] != ASYNCREQ_COMPLETE)
|npending++;
| 
|  /* Scan over the rest for not-completed items */
|  for (i = npending + 1 ; i < estate->es_num_pending_async; ++i)
|  {
|PendingAsyncRequest *tmp;
|PendingAsyncRequest *curr = estate->es_pending_async[i];
|
|if (curr->state == ASYNCREQ_COMPLETE)
|  continue;
|
|/* Move the not-completed item to the tail of the first chunk */
|tmp = estate->es_pending_async[i];
|estate->es_pending_async[nepending] = tmp;
|estate->es_pending_async[i] = tmp;
|++npending;
|  }


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PgFDW connection invalidation by ALTER SERVER/ALTER USER MAPPING

2017-07-14 Thread Kyotaro HORIGUCHI
Thank you for the comments.

At Thu, 13 Jul 2017 16:54:42 +0530, Ashutosh Bapat 
 wrote in 

> On Thu, Jul 13, 2017 at 2:53 PM, Kyotaro HORIGUCHI
>  wrote:
> > Hello, moved to pgsql-hackers.
> >
> > This is the revased and revised version of the previous patch.
> >
> > At Thu, 13 Jul 2017 13:42:49 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
> >  wrote in 
> > <20170713.134249.97825982.horiguchi.kyot...@lab.ntt.co.jp>
> > This patch is postgres_fdw-private but it's annoying that hash
> > value of syscache is handled in connection.c. If we allow to add
> > something to the core for this feature, I could add a new member
> > in FdwRoutine to notify invalidation of mapping and server by
> > oid. (Of course it is not back-patcheable, though)
> >
> > Does anyone have opinitons or suggestions?
> >
> 
> The patch and the idea looks good to me. I haven't reviewed it
> thoroughly though.
> 
> I noticed a type "suporious", I think you meant "spurious"? Probably

Right, it is too bad typo, but fixed it as "unnecessary", which
would more appropriate here.

> that comment should be part of the function which marks the connection
> as invalid e.g. InvalidateConnectionForMapping().

Agreed. It'd been there but somehow I moved it to there. I have
moved it back to the place it used to be.

> pgfdw_xact_callback() reports the reason for disconnection while
> closing a connection. May be we want to report the reason for
> disconnection here as well. Also, may be we want to create a function

Agreed. Also, I had placed LOG message there but removedxs. Now it
emits a DEBUG3 message as shown below.

| DEBUG: closing connection 0 for option changes to take effect
| DEBUG: new postgres_fdw connection 0 for server ".." (user mapping oid 

> to discard connection from an entry so that we consistently do the
> same things while discarding a connection.

Sure. Now there's two places a connection is closed intentionally.

I'm a bit uneasy that many menbers of entry is getting reset in
so many places. Since the validity of an entry is checked only by
conn so it is enough to clear the flags of ConnCacheEntry only at
the time of connection creation. Instead,
pgfdw_reject_incomplete_xact_state_chanbe is no longer complains
on an inactive (conn == NULL) entry. I think this is safe but a
bit inconfident..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
*** a/contrib/postgres_fdw/connection.c
--- b/contrib/postgres_fdw/connection.c
***
*** 53,58  typedef struct ConnCacheEntry
--- 53,61 
  	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
  	bool		have_error;		/* have any subxacts aborted in this xact? */
  	bool		changing_xact_state;	/* xact state change in process */
+ 	bool		invalidated;	/* true if reconnect is requried */
+ 	uint32		server_hashvalue;	/* hash value of foreign server oid */
+ 	uint32		mapping_hashvalue;  /* hash value of user mapping oid */
  } ConnCacheEntry;
  
  /*
***
*** 69,74  static bool xact_got_connection = false;
--- 72,78 
  
  /* prototypes of private functions */
  static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+ static void disconnect_pg_server(ConnCacheEntry *entry);
  static void check_conn_params(const char **keywords, const char **values);
  static void configure_remote_session(PGconn *conn);
  static void do_sql_command(PGconn *conn, const char *sql);
***
*** 144,160  GetConnection(UserMapping *user, bool will_prep_stmt)
  	entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
  	if (!found)
  	{
! 		/* initialize new hashtable entry (key is already filled in) */
  		entry->conn = NULL;
- 		entry->xact_depth = 0;
- 		entry->have_prep_stmt = false;
- 		entry->have_error = false;
- 		entry->changing_xact_state = false;
  	}
  
  	/* Reject further use of connections which failed abort cleanup. */
  	pgfdw_reject_incomplete_xact_state_change(entry);
  
  	/*
  	 * We don't check the health of cached connection here, because it would
  	 * require some overhead.  Broken connection will be detected when the
--- 148,176 
  	entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
  	if (!found)
  	{
! 		/*
! 		 * key is already filled in, flags well be initialized at the time of
! 		 * making a new connection, so just clear conn here.
! 		 */
  		entry->conn = NULL;
  	}
  
  	/* Reject further use of connections which failed abort cleanup. */
  	pgfdw_reject_incomplete_xact_state_change(entry);
  
+ 
+ 	/*
+ 	 * This connection is no longer valid. Disconnect such connections if no
+ 	 * transaction is running.
+ 	 */
+ 	if (entry->conn != NULL && entry->invalidated && entry->xact_depth == 0)
+ 

  1   2   3   4   5   6   7   8   9   10   >