Re: Tab completion for ALTER INDEX|TABLE ALTER COLUMN SET STATISTICS

2018-12-20 Thread Michael Paquier
On Fri, Dec 21, 2018 at 02:51:51PM +0900, Tatsuro Yamada wrote:
> Attached file is a WIP patch.

Before sorting out the exotic part of the feature, why not first fixing
the simple cases where we know that tab completion is broken and can be
improved?  This is what I meant in this email:
https://www.postgresql.org/message-id/20181219092255.gc...@paquier.xyz

The attached patch implements the idea.  What do you think?
--
Michael
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 5ba6ffba8c..7b603508db 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1566,9 +1566,20 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH("PARTITION");
 	else if (Matches("ALTER", "INDEX", MatchAny, "ATTACH", "PARTITION"))
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_indexes, NULL);
-	/* ALTER INDEX  ALTER COLUMN  */
-	else if (Matches("ALTER", "INDEX", MatchAny, "ALTER", "COLUMN", MatchAny))
+	/* ALTER INDEX  ALTER [COLUMN]  */
+	else if (Matches("ALTER", "INDEX", MatchAny, "ALTER", MatchAny) ||
+			 Matches("ALTER", "INDEX", MatchAny, "ALTER", "COLUMN", MatchAny))
 		COMPLETE_WITH("SET STATISTICS");
+	/* ALTER INDEX  ALTER [COLUMN]  SET */
+	else if (Matches("ALTER", "INDEX", MatchAny, "ALTER", MatchAny, "SET") ||
+			 Matches("ALTER", "INDEX", MatchAny, "ALTER", "COLUMN", MatchAny, "SET"))
+		COMPLETE_WITH("STATISTICS");
+	/* ALTER INDEX  ALTER [COLUMN]  SET STATISTICS */
+	else if (Matches("ALTER", "INDEX", MatchAny, "ALTER", MatchAny, "SET", "STATISTICS") ||
+			 Matches("ALTER", "INDEX", MatchAny, "ALTER", "COLUMN", MatchAny, "SET", "STATISTICS"))
+	{
+		/* Enforce no completion here, as an integer has to be specified */
+	}
 	/* ALTER INDEX  SET */
 	else if (Matches("ALTER", "INDEX", MatchAny, "SET"))
 		COMPLETE_WITH("(", "TABLESPACE");
@@ -1874,6 +1885,12 @@ psql_completion(const char *text, int start, int end)
 	else if (Matches("ALTER", "TABLE", MatchAny, "ALTER", "COLUMN", MatchAny, "SET", "STORAGE") ||
 			 Matches("ALTER", "TABLE", MatchAny, "ALTER", MatchAny, "SET", "STORAGE"))
 		COMPLETE_WITH("PLAIN", "EXTERNAL", "EXTENDED", "MAIN");
+	/* ALTER TABLE ALTER [COLUMN]  SET STATISTICS */
+	else if (Matches("ALTER", "TABLE", MatchAny, "ALTER", MatchAny, "SET", "STATISTICS") ||
+			 Matches("ALTER", "TABLE", MatchAny, "ALTER", "COLUMN", MatchAny, "SET", "STATISTICS"))
+	{
+		/* Enforce no completion here, as an integer has to be specified */
+	}
 	/* ALTER TABLE ALTER [COLUMN]  DROP */
 	else if (Matches("ALTER", "TABLE", MatchAny, "ALTER", "COLUMN", MatchAny, "DROP") ||
 			 Matches("ALTER", "TABLE", MatchAny, "ALTER", MatchAny, "DROP"))


signature.asc
Description: PGP signature


RE: Timeout parameters

2018-12-20 Thread Nagaura, Ryohei
Hi, Fabien.

The next CF will start so I want to restart the discussion.

> About "socket_timeout"
> If you face the following situation, this parameter will be needed.
If you feel that this situation can't happen or the use case is too limited, 
please point out so.

> > I think that there is some kind of a misnomer: this is not a
> > socket-level timeout, but a client-side query timeout, so it should be
> named differently?
> Yes, I think so.
> 
> > I'm not sure how to name it, though.
> Me too.
Since I want to use the monitoring target as the parameter name, let's decide 
the parameter name while designing.

> > I think that the way it works is a little extreme, basically the
> > connection is aborted from within pqWait, and then restarted from scratch.
Which motion seems to be uncomfortable?
Or both?

> > There is no clear way to know about the value of the setting (SHOW,
> > \set, \pset...).
That is a nice idea!
If this parameter implementation is decide, I'll also add these features.

> About "TCP_USER_TIMEOUT"
> I introduce the test methods of TCP_USER_TIMEOUT.
I only came up with this test methods with "iptables".
Since this command can be used only by root, I didn't create a script.

Best regards,
-
Ryohei Nagaura





Speeding up creating UPDATE/DELETE generic plan for partitioned table into a lot

2018-12-20 Thread Kato, Sho
Hi,
I want to speed up the creation of UPDATE/DELETE generic plan for tables 
partitioned into a lot.

Currently, creating a generic plan of UPDATE/DELTE for such table, planner 
creates a plan to scan all partitions.
So it takes a very long time.
I tried with a table partitioned into 8192, it took 12 seconds. 

*setup*

postgres=# create table t(aid int, abalance int) partition by range(aid);
CREATE TABLE
postgres=# \o /dev/null
postgres=# select 'create table t_' || x || ' partition of t for values from (' 
|| x || ') to (' || x+1 || ')' from generate_series(1, 8192) x;
postgres=# \gexec

*explan analyze*

I use master(commit 71a05b2232 Wed Dec 5) + v8 patch[1] + v1 patch[2]

postgres=# explain analyze execute update_stmt(999);
   QUERY PLAN
-
 Update on t  (cost=0.00..313569.28 rows=90112 width=14) (actual 
time=42.805..42.805 rows=0 loops=1)
   Update on t_1
   Update on t_2
   Update on t_3
   ...
   ->  Seq Scan on t_1  (cost=0.00..38.28 rows=11 width=14) (actual 
time=0.021..0.022 rows=0 loops=1)
 Filter: (aid = $1)
   ->  Seq Scan on t_2  (cost=0.00..38.28 rows=11 width=14) (actual 
time=0.004..0.005 rows=0 loops=1)
 Filter: (aid = $1)
   ->  Seq Scan on t_3  (cost=0.00..38.28 rows=11 width=14) (actual 
time=0.004..0.004 rows=0 loops=1)
 Filter: (aid = $1)
   ->  Seq Scan on t_4  (cost=0.00..38.28 rows=11 width=14) (actual 
time=0.004..0.005 rows=0 loops=1)
   ...
 Planning Time: 12367.833 ms
 Execution Time: 490.082 ms
(24579 rows)

In most cases, since the partitions to access are partial, I think planner does 
not need to create a Scan path for every partition.
Is there any better way? For example, can planner create generic plans from the 
parameters specified for EXECUTE?

[1]:https://www.postgresql.org/message-id/9d7c5112-cb99-6a47-d3be-cf1ee6862...@lab.ntt.co.jp
[2]:https://www.postgresql.org/message-id/CAKJS1f-=fnmqmqp6qitkd+xeddxw22yslp-0xfk3jaqux2y...@mail.gmail.com

regards,
Sho Kato




Re: Remove Deprecated Exclusive Backup Mode

2018-12-20 Thread David Steele

On 12/21/18 2:01 AM, Michael Paquier wrote:

On Thu, Dec 20, 2018 at 12:29:48PM +0200, David Steele wrote:

Cannot move patch to the same commitfest, and multiple future commitfests
exist!


I am not sure what it means either.  What if you just mark the existing
entry as returned with feedback, and create a new one ahead?


Yeah, I can do that, but it would be nice to know why this is not 
working.  It would also be nice to preserve the history.


Magnus?

--
-David
da...@pgmasters.net



Re: Change pgarch_readyXlog() to return .history files first

2018-12-20 Thread David Steele

On 12/21/18 6:49 AM, Kyotaro HORIGUCHI wrote:

"else if (!historyFound || ishistory)"




strcpy(xlog, newxlog);


The caller prepares sufficient memory for basename, and we no
longer copy ".ready" into newxlog. Douldn't we work directly on
xlog instead of allocating newxlog?


I thought about doing that, but wanted to focus on the task at hand.  It 
does save a strcpy and a bit of stack space, so seems like a win.


Overall, the patch looks good to me.  I think breaking up the if does 
make the code more readable.


Regards,
--
-David
da...@pgmasters.net



Re: Add timeline to partial WAL segments

2018-12-20 Thread David Steele

On 12/20/18 10:56 PM, Robert Haas wrote:

On Fri, Dec 14, 2018 at 6:05 PM David Steele  wrote:

The question in my mind: is it safe to back-patch?


I cannot imagine it being a good idea to stick a behavioral change
like this into a minor release.  Yeah, it lets people get out from
under this problem a lot sooner, but it potentially breaks backup
scripts even for people who were not suffering in the first place.  I
don't think solving this problem for the people who have it is worth
inflicting that kind of breakage on everybody.


Fair enough -- figured I would make an argument and see what people thought.

It's easy enough to solve on my end, but I'll wait and see what naming 
scheme we come up with.


--
-David
da...@pgmasters.net



Re: Tab completion for ALTER INDEX|TABLE ALTER COLUMN SET STATISTICS

2018-12-20 Thread Tatsuro Yamada

On 2018/12/20 10:47, Tatsuro Yamada wrote:

On 2018/12/20 10:38, Michael Paquier wrote:

On Thu, Dec 20, 2018 at 10:05:30AM +0900, Tatsuro Yamada wrote:

Alright, I'll create new patches including these:

   - No completion after "ALTER TABLE/INDEX SET STATISTICS" instead of schema 
names
   - Complete "ALTER INDEX foo ALTER COLUMN SET" with STATISTICS by
   using *column_numbers*


Thanks for considering it!


My pleasure, Neo. :)
Please wait for new WIP patches.


Attached file is a WIP patch.

*Example of after patching

create table hoge (a integer, b integer, c integer);
create index ind_hoge on hoge(a, b, c, (c*1), (c*2), (c*3), (c*4), (c*5), 
(c*6), (c*7), (c*8), (c*9));

# \d+ ind_hoge
Index "public.ind_hoge"
 Column |  Type   | Key? | Definition | Storage | Stats target
+-+--++-+--
 a  | integer | yes  | a  | plain   |
 b  | integer | yes  | b  | plain   |
 c  | integer | yes  | c  | plain   |
 expr   | integer | yes  | (c * 1)| plain   |
 expr1  | integer | yes  | (c * 2)| plain   |
 expr2  | integer | yes  | (c * 3)| plain   |
 expr3  | integer | yes  | (c * 4)| plain   |
 expr4  | integer | yes  | (c * 5)| plain   |
 expr5  | integer | yes  | (c * 6)| plain   |
 expr6  | integer | yes  | (c * 7)| plain   |
 expr7  | integer | yes  | (c * 8)| plain   |
 expr8  | integer | yes  | (c * 9)| plain   |
btree, for table "public.hoge"

# alter index ind_hoge alter column 
1   10  11  12  2   3   4   5   6   7   8   9

# alter index ind_hoge alter column 1 
1   10  11  12

# alter index ind_hoge alter column 10 SET STATISTICS 


# alter index ind_hoge alter column 10 SET STATISTICS 100;
ALTER INDEX

# \d+ ind_hoge
Index "public.ind_hoge"
 Column |  Type   | Key? | Definition | Storage | Stats target
+-+--++-+--
 a  | integer | yes  | a  | plain   |
 b  | integer | yes  | b  | plain   |
 c  | integer | yes  | c  | plain   |
 expr   | integer | yes  | (c * 1)| plain   |
 expr1  | integer | yes  | (c * 2)| plain   |
 expr2  | integer | yes  | (c * 3)| plain   |
 expr3  | integer | yes  | (c * 4)| plain   |
 expr4  | integer | yes  | (c * 5)| plain   |
 expr5  | integer | yes  | (c * 6)| plain   |
 expr6  | integer | yes  | (c * 7)| plain   | 100
 expr7  | integer | yes  | (c * 8)| plain   |
 expr8  | integer | yes  | (c * 9)| plain   |
btree, for table "public.hoge"


As you know above completed 1, 2 and 3 are not expression columns,
so it might better to remove these from the completion.
However, I didn't do that because a query for getting more suitable
attnum of index are became complicated.

Then, the patch includes new query to get attribute_numbers like this:


+#define Query_for_list_of_attribute_numbers \
+"SELECT attnum "\
+"  FROM pg_catalog.pg_attribute a, "\
+"   pg_catalog.pg_class c "\
+" WHERE c.oid = a.attrelid "\
+"   AND a.attnum > 0 "\
+"   AND NOT a.attisdropped "\
+"   /* %d %s */" \
+"   AND a.attrelid = (select oid from pg_catalog.pg_class where relname = '%s') 
"\
+"   AND pg_catalog.pg_table_is_visible(c.oid) "\
+"order by a.attnum asc "


I have a question.
I read following comment of _complete_from_query(), however I'm not sure whether 
"%d" is
needed or not in above query. Any advices welcome!


 * 1. A simple query which must contain a %d and a %s, which will be replaced
 * by the string length of the text and the text itself. The query may also
 * have up to four more %s in it; the first two such will be replaced by the
 * value of completion_info_charp, the next two by the value of
 * completion_info_charp2.


Thanks,
Tatsuro Yamada


diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 5ba6ffba8c..28f2e9a0f9 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -583,6 +583,18 @@ static const SchemaQuery Query_for_list_of_statistics = {
 "OR '\"' || relname || '\"'='%s') "\
 "   AND pg_catalog.pg_table_is_visible(c.oid)"
 
+#define Query_for_list_of_attribute_numbers \
+"SELECT attnum "\
+"  FROM pg_catalog.pg_attribute a, "\
+"   pg_catalog.pg_class c "\
+" WHERE c.oid = a.attrelid "\
+"   AND a.attnum > 0 "\
+"   AND NOT a.attisdropped "\
+"   /* %d %s */" \
+"   AND a.attrelid = (select oid from pg_catalog.pg_class where relname = '%s') "\
+"   AND pg_catalog.pg_table_is_visible(c.oid) "\
+"order by a.attnum asc "
+
 #define Query_for_list_of_attributes_with_schema \
 "SELECT 

Re: Change pgarch_readyXlog() to return .history files first

2018-12-20 Thread Kyotaro HORIGUCHI
At Fri, 21 Dec 2018 14:17:25 +0900, Michael Paquier  wrote 
in <20181221051724.gg1...@paquier.xyz>
> On Fri, Dec 21, 2018 at 01:49:18PM +0900, Kyotaro HORIGUCHI wrote:
> > FWIW it seems to me a bug that making an inconsistent set of
> > files in archive directory.
> 
> Okay, point taken!  FWIW, I have no actual objections in not
> back-patching that.

I maybe(?) know.

> > At Fri, 21 Dec 2018 12:19:21 +0900, Michael Paquier  
> > wrote in <20181221031921.ge1...@paquier.xyz>
> >> On Thu, Dec 20, 2018 at 01:57:30PM +0200, David Steele wrote:
> >> > Good point.  The new patch uses IsTLHistoryFileName().
> >> 
> >> OK, I have been reviewing the patch and the logic is correct, still I
> >> could not resist reducing the number of inner if's for readability.  I
> > 
> > +1 basically. But I think that tail(name, 6) != ".ready" can
> > happen with a certain frequency but strspn(name) < basenamelen
> > rarely in the normal case.
> 
> So that +0.5 if "basically" means a partial agreement? :p

Mmm. No, +0.9.

> > In the else if condition, ishisotry must be false in the right
> > hand of ||. What we do here is ignoring non-history files once
> > history file found. (Just a logic condensing and it would be done
> > by compiler, though)
> 
> Yes, this can be simplified.  So let's do so.
> 
> > The caller prepares sufficient memory for basename, and we no
> > longer copy ".ready" into newxlog. Wouldn't we work directly on

Sorry for silly typo, but the 'W' was 'C', I meant:p

> > xlog instead of allocating newxlog?
> 
> Okay, let's simplify that as you suggest.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Change pgarch_readyXlog() to return .history files first

2018-12-20 Thread Michael Paquier
On Fri, Dec 21, 2018 at 01:49:18PM +0900, Kyotaro HORIGUCHI wrote:
> FWIW it seems to me a bug that making an inconsistent set of
> files in archive directory.

Okay, point taken!  FWIW, I have no actual objections in not
back-patching that.

> At Fri, 21 Dec 2018 12:19:21 +0900, Michael Paquier  
> wrote in <20181221031921.ge1...@paquier.xyz>
>> On Thu, Dec 20, 2018 at 01:57:30PM +0200, David Steele wrote:
>> > Good point.  The new patch uses IsTLHistoryFileName().
>> 
>> OK, I have been reviewing the patch and the logic is correct, still I
>> could not resist reducing the number of inner if's for readability.  I
> 
> +1 basically. But I think that tail(name, 6) != ".ready" can
> happen with a certain frequency but strspn(name) < basenamelen
> rarely in the normal case.

So that +0.5 if "basically" means a partial agreement? :p

> In the else if condition, ishisotry must be false in the right
> hand of ||. What we do here is ignoring non-history files once
> history file found. (Just a logic condensing and it would be done
> by compiler, though)

Yes, this can be simplified.  So let's do so.

> The caller prepares sufficient memory for basename, and we no
> longer copy ".ready" into newxlog. Wouldn't we work directly on
> xlog instead of allocating newxlog?

Okay, let's simplify that as you suggest.
--
Michael


signature.asc
Description: PGP signature


Re: Change pgarch_readyXlog() to return .history files first

2018-12-20 Thread Kyotaro HORIGUCHI
Hello.

FWIW it seems to me a bug that making an inconsistent set of
files in archive directory.

At Fri, 21 Dec 2018 12:19:21 +0900, Michael Paquier  wrote 
in <20181221031921.ge1...@paquier.xyz>
> On Thu, Dec 20, 2018 at 01:57:30PM +0200, David Steele wrote:
> > Good point.  The new patch uses IsTLHistoryFileName().
> 
> OK, I have been reviewing the patch and the logic is correct, still I
> could not resist reducing the number of inner if's for readability.  I

+1 basically. But I think that tail(name, 6) != ".ready" can
happen with a certain frequency but strspn(name) < basenamelen
rarely in the normal case.

> also did not like the high-jacking of rlde->d_name so instead let's use
> an intermediate variable to store the basename of a scanned entry.  The
> format of the if/elif with comments in-between was not really consistent
> with the common practice as well.  pg_indent has also been applied.
> 
> > Ah, I see.  Yes, that's exactly how I tested it, in addition to doing real
> > promotions.
> 
> OK, so am I doing.
> 
> Attached is an updated patch.  Does that look fine to you?  The base
> logic is unchanged, and after a promotion history files get archived
> before the last partial segment.

Renaming history to ishistory looks good.

if (!found || (ishistory && !historyFound))
{
  /* init */
  found = true;
  historyFound = ishistory;
}
else if (ishistory || (!ishstory && !historyFound))
 /* compare/replace */

In the else if condition, ishisotry must be false in the right
hand of ||. What we do here is ignoring non-history files once
history file found. (Just a logic condensing and it would be done
by compiler, though)

"else if (!historyFound || ishistory)"



>   strcpy(xlog, newxlog);

The caller prepares sufficient memory for basename, and we no
longer copy ".ready" into newxlog. Douldn't we work directly on
xlog instead of allocating newxlog?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Clean up some elog messages and comments for do_pg_stop_backup and do_pg_start_backup

2018-12-20 Thread Michael Paquier
Hi all,

Alvaro has cleaned up a couple of error messages recently so as they do
not include the function name in what gets translated as per 68f6f2b7.

While looking in the code for similar patterns, I have been reminded
that pg_stop_backup() is included in some messages when waiting for
segments to be archived.  This has resulted in an exchange between Tom
and me here:
https://www.postgresql.org/message-id/e1gzg2w-0008lq...@gemulon.postgresql.org

The thing is that the current messages are actually misleading, because
for base backups taken by the replication protocol pg_stop_backup is
never called, which is I think confusing.  While looking around I have
also noticed that the top comments of do_pg_start_backup and
do_pg_stop_backup also that they are used with BASE_BACKUP.

Attached is a patch to reduce the confusion and improve the related
comments and messages.

Thoughts?
--
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c80b14ed97..57b5a91f96 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10163,9 +10163,10 @@ XLogFileNameP(TimeLineID tli, XLogSegNo segno)
 }
 
 /*
- * do_pg_start_backup is the workhorse of the user-visible pg_start_backup()
- * function. It creates the necessary starting checkpoint and constructs the
- * backup label file.
+ * do_pg_start_backup
+ *
+ * Utility function called at the start of an online backup. It creates the
+ * necessary starting checkpoint and constructs the backup label file.
  *
  * There are two kind of backups: exclusive and non-exclusive. An exclusive
  * backup is started with pg_start_backup(), and there can be only one active
@@ -10705,8 +10706,10 @@ get_backup_status(void)
 }
 
 /*
- * do_pg_stop_backup is the workhorse of the user-visible pg_stop_backup()
- * function.
+ * do_pg_stop_backup
+ *
+ * Utility function called at the end of an online backup. It cleans up the
+ * backup state and can optionally wait for WAL segments to be archived.
  *
  * If labelfile is NULL, this stops an exclusive backup. Otherwise this stops
  * the non-exclusive backup specified by 'labelfile'.
@@ -11077,7 +11080,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 			if (!reported_waiting && waits > 5)
 			{
 ereport(NOTICE,
-		(errmsg("pg_stop_backup cleanup done, waiting for required WAL segments to be archived")));
+		(errmsg("waiting for required WAL segments to be archived")));
 reported_waiting = true;
 			}
 
@@ -11087,16 +11090,16 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 			{
 seconds_before_warning *= 2;	/* This wraps in >10 years... */
 ereport(WARNING,
-		(errmsg("pg_stop_backup still waiting for all required WAL segments to be archived (%d seconds elapsed)",
+		(errmsg("still waiting for all required WAL segments to be archived (%d seconds elapsed)",
 waits),
 		 errhint("Check that your archive_command is executing properly.  "
- "pg_stop_backup can be canceled safely, "
+ "Backup can be canceled safely, "
  "but the database backup will not be usable without all the WAL segments.")));
 			}
 		}
 
 		ereport(NOTICE,
-(errmsg("pg_stop_backup complete, all required WAL segments have been archived")));
+(errmsg("all required WAL segments have been archived")));
 	}
 	else if (waitforarchive)
 		ereport(NOTICE,


signature.asc
Description: PGP signature


Re: Tid scan improvements

2018-12-20 Thread Tom Lane
Edmund Horner  writes:
> For the forward scan, I seem to recall, from your merge join example,
> that it's useful to set the pathkeys even when there are no
> query_pathkeys.  We just have to unconditionally set them so that the
> larger plan can make use of them.

No.  Look at indxpath.c: it does not worry about pathkeys unless
has_useful_pathkeys is true, and it definitely does not generate
pathkeys that don't get past truncate_useless_pathkeys.  Those
functions are responsible for worrying about whether mergejoin
can use the pathkeys.  It's not tidpath.c's job to outthink them.

regards, tom lane



Re: Improve selectivity estimate for range queries

2018-12-20 Thread Kyotaro HORIGUCHI
Hello.

At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" 
 wrote in 
<008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp>
> In my environment, the selectivity for id > 0 was 0.1,
> and the selectivity for id < 1 was 0.1. Then, the
> value of rqlist->hibound and rqlist->lobound are set respectively.
> On the other hand, DEFAULT_INEQ_SEL is 0..  As a result,
> the final selectivity is computed according to DEFAULT_RANGE_INEQ_SEL,
> since the condition of the second if statement was satisfied. However,
> these selectivities were computed according to the statistics, so the
> final selectivity had to be calculated from rqlist->hibound + rqlist->lobound 
> - 1.0.
> My test case might be uncommon, but I think it would cause some problems.

Yeah, its known behavior as the comment just above. If in your
example query, the table weer very large and had an index it
could run faultly an index scan over a fair amount of tuples. But
I'm not sure how much it matters and your example doesn't show
that.

The behvavior discussed here is introduced by this commit.

| commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a
| Author: Tom Lane 
| Date:   Tue Nov 9 00:34:46 2004 +
| 
| Use a hopefully-more-reliable method of detecting default selectivity
| estimates when combining the estimates for a range query.  As pointed out
| by Miquel van Smoorenburg, the existing check for an impossible combined
| result would quite possibly fail to detect one default and one non-default
| input.  It seems better to use the default range query estimate in such
| cases.  To do so, add a check for an estimate of exactly DEFAULT_INEQ_SEL.
| This is a bit ugly because it introduces additional coupling between
| clauselist_selectivity and scalarltsel/scalargtsel, but it's not like
| there wasn't plenty already...

> To handle such cases I've thought up of an idea based on a previous commit[1]
> which solved a similar problem related to DEFAULT_NUM_DISTINCT.  Just like
> this modification, I added a flag which shows whether the selectivity

The commit [1] added almost no complexity, but this does. Affects
many functions to introduce tighter coupling between
operator-selectivity functions and clause selectivity functions.
isdefault travells alone too long just to bind remote
functions. We might need more pricipled way to handle that.

> was calculated according to the statistics or not to clauselist_selectivity,
> and used it as a condition of the if statement instead of 
> rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL.
> I am afraid that changes were more than I had expected, but we can estimate
> selectivity correctly.
> 
> Patch attached.  Do you have any thoughts?

I've just run over the patch, but some have comments.

+   if (isdefault)
+   rqelem->lobound_isdefault = true;

This is taking the disjunction of lobounds ever added, I suppose
it should be the last lobounds' isdefault.

As you see, four other instances of "selec ==/!= DEFAULT_*" exist
in mergejoinsel. Don't they lead to similar kind of problem?


 Selectivity lobound;/* Selectivity of a var > something clause */
 Selectivity hibound;/* Selectivity of a var < something clause */
+boollobound_isdefault;/* lobound is a default selectivity? */
+boolhibound_isdefault;/* hibound is a default selectivity? */
 } RangeQueryClause;

Around the [1], there was a discussion about introducing the
notion of uncertainty to selectivity. The isdefualt is a kind of
uncertainty value indicating '0/100% uncertain'. So my feeling is
saying that it's better that Selectivity has certinty component
then building a selectivity arithmetics involving uncertainty,
though I don't have a clear picture.


> [1]
> https://git.postgresql.org/gitweb/?p=postgresql.git=commitdiff=4c2777d0b733220d9029f78817af8ce

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] Macros bundling RELKIND_* conditions

2018-12-20 Thread Kyotaro HORIGUCHI
Mmm. My mail on this topic seems to have sent to nowhere..

At Fri, 21 Dec 2018 07:50:04 +0530, Ashutosh Bapat 
 wrote in 

> On Wed, Dec 19, 2018 at 11:37 PM Alvaro Herrera 
> wrote:
> 
> > On 2017-Aug-02, Tom Lane wrote:
> >
> > > I think Peter's got the error and the detail backwards.  It should be
> > > more like
> > >
> > > ERROR: "someview" cannot have constraints
> > > DETAIL: "someview" is a view.
> > >
> > > If we do it like that, we need one ERROR message per error reason,
> > > and one DETAIL per relkind, which should be manageable.
> >
> > I support this idea.  Here's a proof-of-concept patch that corresponds
> > to one of the cases that Ashutosh was on about (specifically, the one
> > that uses the RELKIND_CAN_HAVE_STORAGE macro I just added).  If there
> > are no objections to this approach, I'm going to complete it along these
> > lines.
> >
> > I put the new function at the bottom of heapam.c but I think it probably
> > needs a better place.
> >
> > BTW are there other opinions on the RELKIND_HAS_STORAGE vs.
> > RELKIND_CAN_HAVE_STORAGE debate?  I'm inclined to change it to the
> > former.
> >
> 
> +1 I liked the idea.

+1. as I posted to another thread [1]. 

[1] 
https://www.postgresql.org/message-id/20181218.145600.172055615.horiguchi.kyot...@lab.ntt.co.jp

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Tid scan improvements

2018-12-20 Thread Edmund Horner
On Fri, 21 Dec 2018 at 13:25, David Rowley  wrote:
> On Fri, 21 Dec 2018 at 13:09, Edmund Horner  wrote:
> > On Fri, 21 Dec 2018 at 11:21, Tom Lane  wrote:
> > > I'm having a hard time wrapping my mind around why you'd bother with
> > > backwards TID scans.  The amount of code needed versus the amount of
> > > usefulness seems like a pretty bad cost/benefit ratio, IMO.  I can
> > > see that there might be value in knowing that a regular scan has
> > > "ORDER BY ctid ASC" pathkeys (mainly, that it might let us mergejoin
> > > on TID without an explicit sort).  It does not, however, follow that
> > > there's any additional value in supporting the DESC case.
> >
> > I have occasionally found myself running "SELECT MAX(ctid) FROM t"
> > when I was curious about why a table is so big after vacuuming.
> >
> > Perhaps that's not a common enough use case to justify the amount of
> > code, especially the changes to heapam.c and explain.c.
> >
> > We'd still need the pathkeys to make good use of forward scans.  (And
> > I think the executor still needs to support seeking backward for
> > cursors.)
>
> I think the best thing to do here is separate out all the additional
> backwards scan code into a separate patch to allow it to be easier
> considered and approved, or rejected. I think if there's any hint of
> this blocking the main patch then it should be a separate patch to
> allow it's worth to be considered independently.

Yeah I think you're right.  I'll separate those parts into the basic
forward scan, and then the optional backward scan support.  I think
we'll still only generate a backward scan if the query_pathkeys makes
use of it.

For the forward scan, I seem to recall, from your merge join example,
that it's useful to set the pathkeys even when there are no
query_pathkeys.  We just have to unconditionally set them so that the
larger plan can make use of them.

> Also, my primary interest in this patch is to find tuples that are
> stopping the heap being truncated during a vacuum. Generally, when I'm
> looking for that I have a good idea of what size I expect the relation
> should be, (otherwise I'd not think it was bloated), in which case I'd
> be doing WHERE ctid >= '(N,1)'. However, it might be easier to write
> some auto-bloat-removal script if we could have an ORDER BY ctid DESC
> LIMIT n.



Re: Change pgarch_readyXlog() to return .history files first

2018-12-20 Thread Michael Paquier
On Thu, Dec 20, 2018 at 01:57:30PM +0200, David Steele wrote:
> Good point.  The new patch uses IsTLHistoryFileName().

OK, I have been reviewing the patch and the logic is correct, still I
could not resist reducing the number of inner if's for readability.  I
also did not like the high-jacking of rlde->d_name so instead let's use
an intermediate variable to store the basename of a scanned entry.  The
format of the if/elif with comments in-between was not really consistent
with the common practice as well.  pg_indent has also been applied.

> Ah, I see.  Yes, that's exactly how I tested it, in addition to doing real
> promotions.

OK, so am I doing.

Attached is an updated patch.  Does that look fine to you?  The base
logic is unchanged, and after a promotion history files get archived
before the last partial segment.
--
Michael
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index e88d545d65..bbd6311b35 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -695,11 +695,12 @@ pgarch_archiveXlog(char *xlog)
  * 2) because the oldest ones will sooner become candidates for
  * recycling at time of checkpoint
  *
- * NOTE: the "oldest" comparison will presently consider all segments of
- * a timeline with a smaller ID to be older than all segments of a timeline
- * with a larger ID; the net result being that past timelines are given
- * higher priority for archiving.  This seems okay, or at least not
- * obviously worth changing.
+ * NOTE: the "oldest" comparison will consider any .history file to be older
+ * than any other file except another .history file.  Segments on a timeline
+ * with a smaller ID will be older than all segments on a timeline with a
+ * larger ID; the net result being that past timelines are given higher
+ * priority for archiving.  This seems okay, or at least not obviously worth
+ * changing.
  */
 static bool
 pgarch_readyXlog(char *xlog)
@@ -715,6 +716,7 @@ pgarch_readyXlog(char *xlog)
 	DIR		   *rldir;
 	struct dirent *rlde;
 	bool		found = false;
+	bool		historyFound = false;
 
 	snprintf(XLogArchiveStatusDir, MAXPGPATH, XLOGDIR "/archive_status");
 	rldir = AllocateDir(XLogArchiveStatusDir);
@@ -722,32 +724,54 @@ pgarch_readyXlog(char *xlog)
 	while ((rlde = ReadDir(rldir, XLogArchiveStatusDir)) != NULL)
 	{
 		int			basenamelen = (int) strlen(rlde->d_name) - 6;
+		char		basename[MAX_XFN_CHARS + 1];
+		bool		ishistory;
 
-		if (basenamelen >= MIN_XFN_CHARS &&
-			basenamelen <= MAX_XFN_CHARS &&
-			strspn(rlde->d_name, VALID_XFN_CHARS) >= basenamelen &&
-			strcmp(rlde->d_name + basenamelen, ".ready") == 0)
+		/* Ignore entries with unexpected number of characters */
+		if (basenamelen < MIN_XFN_CHARS ||
+			basenamelen > MAX_XFN_CHARS)
+			continue;
+
+		/* Ignore entries with unexpected characters */
+		if (strspn(rlde->d_name, VALID_XFN_CHARS) < basenamelen)
+			continue;
+
+		/* Ignore anything not suffixed with .ready */
+		if (strcmp(rlde->d_name + basenamelen, ".ready") != 0)
+			continue;
+
+		/* Truncate off the .ready */
+		memcpy(basename, rlde->d_name, basenamelen);
+		basename[basenamelen] = '\0';
+
+		/* Is this a history file? */
+		ishistory = IsTLHistoryFileName(basename);
+
+		/*
+		 * Consume the file to archive.  History files have the highest
+		 * priority.  If this is the first file or the first history file
+		 * ever, copy it.  In the presence of a history file already chosen as
+		 * target, ignore all other files except history files which have been
+		 * generated for an older timeline than what is already chosen as
+		 * target to archive.
+		 */
+		if (!found || (ishistory && !historyFound))
 		{
-			if (!found)
-			{
-strcpy(newxlog, rlde->d_name);
-found = true;
-			}
-			else
-			{
-if (strcmp(rlde->d_name, newxlog) < 0)
-	strcpy(newxlog, rlde->d_name);
-			}
+			strcpy(newxlog, basename);
+			found = true;
+			historyFound = ishistory;
+		}
+		else if (ishistory || (!ishistory && !historyFound))
+		{
+			if (strcmp(basename, newxlog) < 0)
+strcpy(newxlog, basename);
 		}
 	}
 	FreeDir(rldir);
 
 	if (found)
-	{
-		/* truncate off the .ready */
-		newxlog[strlen(newxlog) - 6] = '\0';
 		strcpy(xlog, newxlog);
-	}
+
 	return found;
 }
 


signature.asc
Description: PGP signature


Re: [HACKERS] Macros bundling RELKIND_* conditions

2018-12-20 Thread Ashutosh Bapat
On Wed, Dec 19, 2018 at 11:37 PM Alvaro Herrera 
wrote:

> On 2017-Aug-02, Tom Lane wrote:
>
> > I think Peter's got the error and the detail backwards.  It should be
> > more like
> >
> > ERROR: "someview" cannot have constraints
> > DETAIL: "someview" is a view.
> >
> > If we do it like that, we need one ERROR message per error reason,
> > and one DETAIL per relkind, which should be manageable.
>
> I support this idea.  Here's a proof-of-concept patch that corresponds
> to one of the cases that Ashutosh was on about (specifically, the one
> that uses the RELKIND_CAN_HAVE_STORAGE macro I just added).  If there
> are no objections to this approach, I'm going to complete it along these
> lines.
>
> I put the new function at the bottom of heapam.c but I think it probably
> needs a better place.
>
> BTW are there other opinions on the RELKIND_HAS_STORAGE vs.
> RELKIND_CAN_HAVE_STORAGE debate?  I'm inclined to change it to the
> former.
>

+1 I liked the idea.

--
Best Wishes,
Ashutosh Bapat


RE: [Proposal] Add accumulated statistics

2018-12-20 Thread Yotsunaga, Naoki
On Wed, Nov 21, 2018 at 9:27 PM, Bruce Momjian wrote:

Hi, thank you for the information.
I understood that sampling is effective for investigation of waiting events. 

By the way, you can see the number of wait events with "LWLOCK_STATS", right?
Is this function implemented because it is necessary to know the number of 
waiting events for investigation?
If so, is not that the number of wait events is useful information?
Now, I need to rebuild to get this information and I feel inconvenience.

So, how about checking the number of wait events in the view?
Also, I think that it will be useful if you know the waiting time.
I think that it is easy to investigate when it is clearly known how long 
waiting time is occupied with processing time.

-- 
Naoki Yotsunaga




Re: A few new options for vacuumdb

2018-12-20 Thread Michael Paquier
On Thu, Dec 20, 2018 at 04:48:11PM +, Bossart, Nathan wrote:
> The --skip-locked option in vacuumdb is part of 0002, so I don't think
> there's much precedent here.

It looks like I was not looking at the master branch here ;)

> We do currently fall back to the
> unparenthesized syntax for VACUUM for versions before 9.0, so there is
> some version handling already.  What do you think about removing
> version checks for unsupported major versions?

I am not exactly sure down to which version this needs to be supported
and what's the policy about that (if others have an opinion to share,
please feel free) but surely it does not hurt to keep those code paths
either from a maintenance point of view.

> If we went that route, these new patches could add version checks only
> for options that were added in a supported major version (e.g.
> mxid_age() was added in 9.5). Either way, we'll still have to decide
> whether to fail or to silently skip the option if you do something
> like specify --min-mxid-age for a 9.4 server.

There are downsides and upsides for each approach.  Reporting a failure
makes it clear that an option is not available with a clear error
message, however it complicates a bit the error handling for parallel
runs.  And vacuum_one_database should be the code path doing as that's
where all the connections are taken even when all databases are
processed.

Ignoring the option, as your patch set does, has the disadvantage to
potentially confuse users, say we may see complains like "why is VACUUM
locking even if I specified --skip-locked?".  Still this keeps the code
minimalistic and simple.

Just passing down the options and seeing a failure in the query
generated is also a confusing experience I guess for not-so-advanced
users even if it may simplify the error handling.

Issuing a proper error feels more natural to me I think, as users can
react on that properly.  Opinions from others are welcome.
--
Michael


signature.asc
Description: PGP signature


Re: lock level for DETACH PARTITION looks sketchy

2018-12-20 Thread Amit Langote
On 2018/12/21 6:07, Alvaro Herrera wrote:
> On 2018-Dec-20, Robert Haas wrote:
> 
>> On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera  
>> wrote:
>>> I think what prompted the lock to be AccessShareLock for the child rel
>>> in the first place is the fact that ATExecDropInherit() (ALTER TABLE NO
>>> INHERIT) uses AccessShare for the *parent* relation.
>>
>> Seems like apples and oranges, 
> 
> Yes, but I can understand someone looking at ATExecDropInherit while
> writing ATExecDetachRelation and thinking "ah, I have to grab
> AccessShareLock on the other relation" without stopping to think in what
> direction the parenthood between the rels goes.

That was definitely wrong.  Partition's schema changes when it's detached,
whereas a (regular) inheritance parent's doesn't when one of its children
is removed.

>> and also maybe not that safe.
> 
> I think it's strange, but I'm not interested in analyzing that at this
> time.  Its comment do say that DROP TABLE (of the child, I surmise) does
> not acquire *any* lock on the parent, which is also strange.

Per comments in ATExecDropInherit, the reason we lock parent with
AccessShareLock in the DROP INHERIT case is that RemoveInheritance
inspects parent's schema to see which of the child's columns to mark as
having one less parent.  DROP TABLE child doesn't need to do that as you
can obviously imagine why.

Thank you both for working on this.

Thanks,
Amit




Re: Tid scan improvements

2018-12-20 Thread David Rowley
On Fri, 21 Dec 2018 at 13:09, Edmund Horner  wrote:
> On Fri, 21 Dec 2018 at 11:21, Tom Lane  wrote:
> > I'm having a hard time wrapping my mind around why you'd bother with
> > backwards TID scans.  The amount of code needed versus the amount of
> > usefulness seems like a pretty bad cost/benefit ratio, IMO.  I can
> > see that there might be value in knowing that a regular scan has
> > "ORDER BY ctid ASC" pathkeys (mainly, that it might let us mergejoin
> > on TID without an explicit sort).  It does not, however, follow that
> > there's any additional value in supporting the DESC case.
>
> I have occasionally found myself running "SELECT MAX(ctid) FROM t"
> when I was curious about why a table is so big after vacuuming.
>
> Perhaps that's not a common enough use case to justify the amount of
> code, especially the changes to heapam.c and explain.c.
>
> We'd still need the pathkeys to make good use of forward scans.  (And
> I think the executor still needs to support seeking backward for
> cursors.)

I think the best thing to do here is separate out all the additional
backwards scan code into a separate patch to allow it to be easier
considered and approved, or rejected. I think if there's any hint of
this blocking the main patch then it should be a separate patch to
allow it's worth to be considered independently.

Also, my primary interest in this patch is to find tuples that are
stopping the heap being truncated during a vacuum. Generally, when I'm
looking for that I have a good idea of what size I expect the relation
should be, (otherwise I'd not think it was bloated), in which case I'd
be doing WHERE ctid >= '(N,1)'. However, it might be easier to write
some auto-bloat-removal script if we could have an ORDER BY ctid DESC
LIMIT n.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Remove Deprecated Exclusive Backup Mode

2018-12-20 Thread Michael Paquier
On Thu, Dec 20, 2018 at 12:29:48PM +0200, David Steele wrote:
> Cannot move patch to the same commitfest, and multiple future commitfests
> exist!

I am not sure what it means either.  What if you just mark the existing
entry as returned with feedback, and create a new one ahead?
--
Michael


signature.asc
Description: PGP signature


Re: gist microvacuum doesn't appear to care about hot standby?

2018-12-20 Thread Alexander Korotkov
On Thu, Dec 20, 2018 at 1:41 AM Alexander Korotkov
 wrote:
> Please, find attached two patches I'm going to commit: for master and
> for backpatching.

So, pushed.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Tid scan improvements

2018-12-20 Thread Andres Freund
Hi,

On 2018-12-20 18:06:41 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-12-20 17:21:07 -0500, Tom Lane wrote:
> >> I'm having a hard time wrapping my mind around why you'd bother with
> >> backwards TID scans.
> 
> > I've not followed this thread, but wouldn't that be quite useful to be
> > able to move old tuples to free space earlier in the table?
> > I've written multiple scripts that update the later pages in a table, to
> > force reuse of earlier free pages (in my case by generating ctid = ANY()
> > style queries with all possible tids for the last few pages, the most
> > efficient way I could think of).
> 
> Sure, but wouldn't you now write those using something on the order of
> 
>   WHERE ctid >= '(cutoff_page_here, 1)'
> 
> ?  I don't see that you'd want to write "ORDER BY ctid DESC LIMIT n"
> because you wouldn't know what value of n to use to get all the
> tuples on some-number-of-ending-pages.

I think you'd want both, to make sure there's not more tuples than
estimated. With the limit calculated to ensure there's enough free space
for them to actually fit.

Greetings,

Andres Freund



Re: Tid scan improvements

2018-12-20 Thread Tom Lane
Andres Freund  writes:
> On 2018-12-20 17:21:07 -0500, Tom Lane wrote:
>> I'm having a hard time wrapping my mind around why you'd bother with
>> backwards TID scans.

> I've not followed this thread, but wouldn't that be quite useful to be
> able to move old tuples to free space earlier in the table?
> I've written multiple scripts that update the later pages in a table, to
> force reuse of earlier free pages (in my case by generating ctid = ANY()
> style queries with all possible tids for the last few pages, the most
> efficient way I could think of).

Sure, but wouldn't you now write those using something on the order of

  WHERE ctid >= '(cutoff_page_here, 1)'

?  I don't see that you'd want to write "ORDER BY ctid DESC LIMIT n"
because you wouldn't know what value of n to use to get all the
tuples on some-number-of-ending-pages.

regards, tom lane



Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

2018-12-20 Thread Andres Freund
Hi,

On 2018-12-19 16:56:36 -0800, Andres Freund wrote:
> On 2018-12-19 19:39:33 -0500, Tom Lane wrote:
> > Robert Haas  writes:
> > > On Wed, Dec 19, 2018 at 5:37 PM Andres Freund  wrote:
> > >> What's gained by the logic of emitting that warning in VACUUM after a
> > >> crash? I don't really see any robustness advantages in it.
> > 
> > > I don't know.  I am just normally reluctant to change things
> > > precipitously that are of long tenure.
> > 
> > Me too, but I think Andres has a point here.  Those warnings in VACUUM
> > are ancient, probably predating the introduction of WAL :-(.  At the
> > time there was good reason to be suspicious of zeroed pages in tables.
> > Now, though, we have (what we think is) a bulletproof crash recovery
> > procedure in which possibly-zeroed pages are to be expected; so we're
> > just causing users unnecessary alarm by warning about them.  I think
> > it's reasonable to, if not remove the messages entirely, at least
> > downgrade them to a low DEBUG level.
> 
> Yea, I think that'd be reasonable.
> 
> I think we ought to add them, as new/zero pages, to the FSM. If we did
> so both during vacuum and RelationAddExtraBlocks() we'd avoid the
> redundant writes, and avoid depending on heap layout in
> RelationAddExtraBlocks().
> 
> I wonder if it'd make sense to only log a DEBUG if there's no
> corresponding FSM entry. That'd obviously not be bulltproof, but it'd
> reduce the spammyness considerably.

Here's a prototype patch implementing this change.  I'm not sure the new
DEBUG message is really worth it?


Looking at the surrounding code made me wonder about the wisdom of
entering empty pages as all-visible and all-frozen into the VM. That'll
mean we'll never re-discover them on a primary, after promotion. There's
no mechanism to add such pages to the FSM on a standby (in contrast to
e.g. pages where tuples are modified), as vacuum will never visit that
page again.  Now obviously it has the advantage of avoiding
re-processing the page in the next vacuum, but is that really an
important goal? If there's frequent vacuums, there got to be a fair
amount of modifications, which ought to lead to re-use of such pages at
a not too far away point?

Greetings,

Andres Freund
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index b8b5871559b..aa0bc4d8f9c 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -216,18 +216,15 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
  BufferGetBlockNumber(buffer),
  RelationGetRelationName(relation));
 
-		PageInit(page, BufferGetPageSize(buffer), 0);
-
 		/*
-		 * We mark all the new buffers dirty, but do nothing to write them
-		 * out; they'll probably get used soon, and even if they are not, a
-		 * crash will leave an okay all-zeroes page on disk.
+		 * Add the page to the FSM without initializing. If we were to
+		 * initialize here the page would potentially get flushed out to disk
+		 * before we add any useful content. There's no guarantee that that'd
+		 * happen however, so we need to deal with unitialized pages anyway,
+		 * thus let's avoid the potential for unnecessary writes.
 		 */
-		MarkBufferDirty(buffer);
-
-		/* we'll need this info below */
 		blockNum = BufferGetBlockNumber(buffer);
-		freespace = PageGetHeapFreeSpace(page);
+		freespace = BufferGetPageSize(buffer) - SizeOfPageHeaderData;
 
 		UnlockReleaseBuffer(buffer);
 
@@ -479,6 +476,18 @@ loop:
 		 * we're done.
 		 */
 		page = BufferGetPage(buffer);
+
+		/*
+		 * Initialize page, it'll be used soon.  We could avoid dirtying the
+		 * buffer here, and rely on the caller to do so whenever it puts a
+		 * tuple onto the page, but there seems not much benefit in doing so.
+		 */
+		if (PageIsNew(page))
+		{
+			PageInit(page, BufferGetPageSize(buffer), 0);
+			MarkBufferDirty(buffer);
+		}
+
 		pageFreeSpace = PageGetHeapFreeSpace(page);
 		if (len + saveFreeSpace <= pageFreeSpace)
 		{
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8134c52253e..2d8d3569372 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -862,42 +862,42 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 		if (PageIsNew(page))
 		{
 			/*
-			 * An all-zeroes page could be left over if a backend extends the
-			 * relation but crashes before initializing the page. Reclaim such
-			 * pages for use.
+			 * All-zeroes pages can be left over if either a backend extends
+			 * the relation by a single page, but crashes before initializing
+			 * the page, or when bulk-extending the relation (which creates a
+			 * number of empty pages at the tail end of the relation, but
+			 * enters them into the FSM).
 			 *
-			 * We have to be careful here because we could be looking at a
-			 * page that someone has just added to the relation and not yet
-			 * been able to initialize (see RelationGetBufferForTuple). To
-			 * 

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-20 Thread Tom Lane
John Naylor  writes:
> On 12/18/18, Tom Lane  wrote:
>> I'd be kind of inclined to convert all uses of ScanKeyword to the new way,
>> if only for consistency's sake.  On the other hand, I'm not the one
>> volunteering to do the work.

> That's reasonable, as long as the design is nailed down first. Along
> those lines, attached is a heavily WIP patch that only touches plpgsql
> unreserved keywords, to test out the new methodology in a limited
> area. After settling APIs and name/directory bikeshedding, I'll move
> on to the other four keyword types.

Let the bikeshedding begin ...

> There's a new Perl script, src/common/gen_keywords.pl,

I'd be inclined to put the script in src/tools, I think.  IMO src/common
is for code that actually gets built into our executables.

> which takes
> pl_unreserved_kwlist.h as input and outputs
> pl_unreserved_kwlist_offset.h and pl_unreserved_kwlist_string.h.

I wonder whether we'd not be better off producing just one output
file, in which we have the offsets emitted as PG_KEYWORD macros
and then the giant string emitted as a macro definition, ie
something like

#define PG_KEYWORD_STRING \
"absolute\0" \
"alias\0" \
...

That simplifies the Makefile-hacking, at least, and it possibly gives
callers more flexibility about what they actually want to do with the
string.

> The
> output headers are not installed or symlinked anywhere. Since the
> input keyword lists will never be #included directly, they might be
> better as .txt files, like errcodes.txt. If we went that far, we might
> also remove the PG_KEYWORD macros (they'd still be in the output
> files) and rename parser/kwlist.h to common/core_kwlist.txt. There's
> also a case for not changing things unnecessarily, especially if
> there's ever a new reason to include the base keyword list directly.

I'm for "not change things unnecessarily".  People might well be
scraping the keyword list out of parser/kwlist.h for other purposes
right now --- indeed, it's defined the way it is exactly to let
people do that.  I don't see a good reason to force them to redo
whatever tooling they have that depends on that.  So let's build
kwlist_offsets.h alongside that, but not change kwlist.h itself.

> To keep the other keyword types functional, I had to add a separate
> new struct ScanKeywordOffset and new function
> ScanKeywordLookupOffset(), so the patch is a bit messier than the
> final will be.

Check.

> I used the global .gitignore, but maybe that's an abuse of it.

Yeah, I'd say it is.

> +# TODO: Error out if the keyword names are not in ASCII order.

+many for including such a check.

Also note that we don't require people to have Perl installed when
building from a tarball.  Therefore, these derived headers must get
built during "make distprep" and removed by maintainer-clean but
not distclean.  I think this also has some implications for VPATH
builds, but as long as you follow the pattern used for other
derived header files (e.g. fmgroids.h), you should be fine.

regards, tom lane



Re: Tid scan improvements

2018-12-20 Thread Andres Freund
Hi,

On 2018-12-20 17:21:07 -0500, Tom Lane wrote:
> Edmund Horner  writes:
> > [ tid scan patches ]
> 
> I'm having a hard time wrapping my mind around why you'd bother with
> backwards TID scans.  The amount of code needed versus the amount of
> usefulness seems like a pretty bad cost/benefit ratio, IMO.  I can
> see that there might be value in knowing that a regular scan has
> "ORDER BY ctid ASC" pathkeys (mainly, that it might let us mergejoin
> on TID without an explicit sort).  It does not, however, follow that
> there's any additional value in supporting the DESC case.

I've not followed this thread, but wouldn't that be quite useful to be
able to move old tuples to free space earlier in the table?

I've written multiple scripts that update the later pages in a table, to
force reuse of earlier free pages (in my case by generating ctid = ANY()
style queries with all possible tids for the last few pages, the most
efficient way I could think of).

Greetings,

Andres Freund



Re: Tid scan improvements

2018-12-20 Thread Tom Lane
Edmund Horner  writes:
> [ tid scan patches ]

I'm having a hard time wrapping my mind around why you'd bother with
backwards TID scans.  The amount of code needed versus the amount of
usefulness seems like a pretty bad cost/benefit ratio, IMO.  I can
see that there might be value in knowing that a regular scan has
"ORDER BY ctid ASC" pathkeys (mainly, that it might let us mergejoin
on TID without an explicit sort).  It does not, however, follow that
there's any additional value in supporting the DESC case.

regards, tom lane



monitoring CREATE INDEX [CONCURRENTLY]

2018-12-20 Thread Alvaro Herrera
Monitoring progress of CREATE INDEX [CONCURRENTLY] is sure to be welcome,
so here's a proposal.

There are three distinct interesting cases.  One is straight CREATE
INDEX of a standalone table; then we have CREATE INDEX CONCURRENTLY;
finally, CREATE INDEX on a partitioned table.  Note that there's no
CONCURRENTLY for partitioned tables.

A non-concurrent build is a very straightforward: we call create_index,
which does index_build, done.  See below for how to report for
index_build, which is the interesting part.  I propose not to report
anything else than that for non-concurrent build.  There's some
preparatory work that's identical than for CIC (see below).  Like
VACUUM, it seems a bit pointless to report an initial phase that's
almost immediate, so I propose we just don't report anything until the
actual index building starts.

CREATE INDEX CONCURRENTLY does these things first, which we would not
report (this is just like VACUUM, which only starts reporting once it
starts scanning blocks):

a. lock rel.  No metrics to report.
b. other prep; includes lots of catalog access.  Unlikely to lock, but
   not impossible.  No metrics to report.
c. create_index.  CIC skips index_build here, so there's no reason to
   report it either.

We would start reporting at this point, with these phases:

1. WaitForLockers 1.  Report how many xacts do we need to wait for, how
   many are done.
2. index_build.  See below.
3. WaitForLockers 2.  Report how many xacts do we need to wait for, how
   many are done.
4. validate_index.  Scans the whole rel again.  Report number of blocks
   scanned.
5. wait for virtual XIDs.  Like WaitForLockers: report how many xacts we
   need to wait for, how many are done.

We're done.

(Alternatively, we could have an initial "prep" phase for a/b/c for the
concurrent case and a/b for non-concurrent.  I'm just not sure it's
useful.)

index_build
---

The actual index building is an AM-specific undertaking, and we report
its progress separately from the AM-agnostic code.  That is, each AM has
freedom to define its own list of phases and counters, separate from the
generic code.  This avoids the need to provide a new AM method or invoke
callbacks.  So when you see that CREATE_INDEX_PHASE is either "index
build" you'll have a separate BTREE_CREATE_PHASE value set to either
"scanning heap" or "sorting" or "building upper layers"; equivalently
for other AMs.

Partitioned indexes
---

For partitioned indexes, we only have the index build phase, but we
repeat it for each partition.  In addition to the index_build metrics
described above, we should report how many partitions we need to handle
in total and how many partitions are already done.  (I'm avoiding
getting in the trouble of reporting *which* partition we're currently
handling and have already handled.)

Thoughts?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-12-20 Thread Robert Haas
On Thu, Dec 20, 2018 at 4:11 PM Alvaro Herrera  wrote:
> Oh, so maybe this case is already handled by plan invalidation -- I
> mean, if we run DDL, the stored plan is thrown away and a new one
> recomputed.  IOW this was already a solved problem and I didn't need to
> spend effort on it. /me slaps own forehead

I'm kinda saying the opposite - I'm not sure that it's safe even with
the higher lock levels.  If the plan is relying on the same partition
descriptor being in effect at plan time as at execution time, that
sounds kinda dangerous to me.

Lowering the lock level might also make something that was previously
safe into something unsafe, because now there's no longer a guarantee
that invalidation messages are received soon enough. With
AccessExclusiveLock, we'll send invalidation messages before releasing
the lock, and other processes will acquire the lock and then
AcceptInvalidationMessages().  But with ShareUpdateExclusiveLock the
locks can coexist, so now there might be trouble.  I think this is an
area where we need to do some more investigation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-12-20 Thread Alvaro Herrera
On 2018-Dec-20, Robert Haas wrote:

> I didn't handle that.  If partition pruning relies on nothing changing
> between planning and execution, isn't that broken regardless of any of
> this?  It's true that with the simple query protocol we'll hold locks
> continuously from planning into execution, and therefore with the
> current locking regime we couldn't really have a problem.  But unless
> I'm confused, with the extended query protocol it's quite possible to
> generate a plan, release locks, and then reacquire locks at execution
> time.  Unless we have some guarantee that a new plan will always be
> generated if any DDL has happened in the middle, I think we've got
> trouble, and I don't think that is guaranteed in all cases.

Oh, so maybe this case is already handled by plan invalidation -- I
mean, if we run DDL, the stored plan is thrown away and a new one
recomputed.  IOW this was already a solved problem and I didn't need to
spend effort on it. /me slaps own forehead

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: lock level for DETACH PARTITION looks sketchy

2018-12-20 Thread Alvaro Herrera
On 2018-Dec-20, Robert Haas wrote:

> On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera  
> wrote:
> > I think what prompted the lock to be AccessShareLock for the child rel
> > in the first place is the fact that ATExecDropInherit() (ALTER TABLE NO
> > INHERIT) uses AccessShare for the *parent* relation.
> 
> Seems like apples and oranges, 

Yes, but I can understand someone looking at ATExecDropInherit while
writing ATExecDetachRelation and thinking "ah, I have to grab
AccessShareLock on the other relation" without stopping to think in what
direction the parenthood between the rels goes.

> and also maybe not that safe.

I think it's strange, but I'm not interested in analyzing that at this
time.  Its comment do say that DROP TABLE (of the child, I surmise) does
not acquire *any* lock on the parent, which is also strange.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-12-20 Thread Robert Haas
On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera  wrote:
> > Introduce the concept of a partition directory.
> >
> > Teach the optimizer and executor to use it, so that a single planning
> > cycle or query execution gets the same PartitionDesc for the same table
> > every time it looks it up.  This does not prevent changes between
> > planning and execution, nor does it guarantee that all tables are
> > expanded according to the same snapshot.
>
> Namely: how does this handle the case of partition pruning structure
> being passed from planner to executor, if an attach happens in the
> middle of it and puts a partition in between existing partitions?  Array
> indexes of any partitions that appear later in the partition descriptor
> will change.
>
> This is the reason I used the query snapshot rather than EState.

I didn't handle that.  If partition pruning relies on nothing changing
between planning and execution, isn't that broken regardless of any of
this?  It's true that with the simple query protocol we'll hold locks
continuously from planning into execution, and therefore with the
current locking regime we couldn't really have a problem.  But unless
I'm confused, with the extended query protocol it's quite possible to
generate a plan, release locks, and then reacquire locks at execution
time.  Unless we have some guarantee that a new plan will always be
generated if any DDL has happened in the middle, I think we've got
trouble, and I don't think that is guaranteed in all cases.

Maybe I'm all wet, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: lock level for DETACH PARTITION looks sketchy

2018-12-20 Thread Robert Haas
On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera  wrote:
> I think what prompted the lock to be AccessShareLock for the child rel
> in the first place is the fact that ATExecDropInherit() (ALTER TABLE NO
> INHERIT) uses AccessShare for the *parent* relation.

Seems like apples and oranges, and also maybe not that safe.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-12-20 Thread Alvaro Herrera
Thanks for this work!  I like the name "partition directory".

On 2018-Dec-20, Robert Haas wrote:

> 0002 introduces the concept of a partition directory.  The idea is
> that the planner will create a partition directory, and so will the
> executor, and all calls which occur in those places to
> RelationGetPartitionDesc() will instead call
> PartitionDirectoryLookup().  Every lookup for the same relation in the
> same partition directory is guaranteed to produce the same answer.  I
> believe this patch still has a number of weaknesses.  More on that
> below.

The commit message for this one also points out another potential
problem,

> Introduce the concept of a partition directory.
>
> Teach the optimizer and executor to use it, so that a single planning
> cycle or query execution gets the same PartitionDesc for the same table
> every time it looks it up.  This does not prevent changes between
> planning and execution, nor does it guarantee that all tables are
> expanded according to the same snapshot.

Namely: how does this handle the case of partition pruning structure
being passed from planner to executor, if an attach happens in the
middle of it and puts a partition in between existing partitions?  Array
indexes of any partitions that appear later in the partition descriptor
will change.

This is the reason I used the query snapshot rather than EState.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: lock level for DETACH PARTITION looks sketchy

2018-12-20 Thread Alvaro Herrera
I think what prompted the lock to be AccessShareLock for the child rel
in the first place is the fact that ATExecDropInherit() (ALTER TABLE NO
INHERIT) uses AccessShare for the *parent* relation.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Add timeline to partial WAL segments

2018-12-20 Thread Robert Haas
On Fri, Dec 14, 2018 at 6:05 PM David Steele  wrote:
> The question in my mind: is it safe to back-patch?

I cannot imagine it being a good idea to stick a behavioral change
like this into a minor release.  Yeah, it lets people get out from
under this problem a lot sooner, but it potentially breaks backup
scripts even for people who were not suffering in the first place.  I
don't think solving this problem for the people who have it is worth
inflicting that kind of breakage on everybody.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-12-20 Thread Robert Haas
On Tue, Dec 18, 2018 at 8:04 PM Michael Paquier  wrote:
> On Tue, Dec 18, 2018 at 01:41:06PM -0500, Robert Haas wrote:
> > OK.  I'll post what I have by the end of the week.
>
> Thanks, Robert.

OK, so I got slightly delayed here by utterly destroying my laptop,
but I've mostly reconstructed what I did.  I think there are some
remaining problems, but this seems like a good time to share what I've
got so far.  I'm attaching three patches.

0001 is one which I posted before.  It attempts to fix up
RelationBuildPartitionDesc() so that this function will always return
a partition descriptor based on a consistent snapshot of the catalogs.
Without this, I think there's nothing to prevent a commit which
happens while the function is running from causing the function to
fail or produce nonsense answers.

0002 introduces the concept of a partition directory.  The idea is
that the planner will create a partition directory, and so will the
executor, and all calls which occur in those places to
RelationGetPartitionDesc() will instead call
PartitionDirectoryLookup().  Every lookup for the same relation in the
same partition directory is guaranteed to produce the same answer.  I
believe this patch still has a number of weaknesses.  More on that
below.

0003 actually lowers the lock level.  The comment here might need some
more work.

Here is a list of possible or definite problems that are known to me:

- I think we need a way to make partition directory lookups consistent
across backends in the case of parallel query.  I believe this can be
done with a dshash and some serialization and deserialization logic,
but I haven't attempted that yet.

- I refactored expand_inherited_rtentry() to drive partition expansion
entirely off of PartitionDescs. The reason why this is necessary is
that it clearly will not work to have find_all_inheritors() use a
current snapshot to decide what children we have and lock them, and
then consult a different source of truth to decide which relations to
open with NoLock.  There's nothing to keep the lists of partitions
from being different in the two cases, and that demonstrably causes
assertion failures if you SELECT with an ATTACH/DETACH loop running in
the background. However, it also changes the order in which tables get
locked.  Possibly that could be fixed by teaching
expand_partitioned_rtentry() to qsort() the OIDs the way
find_inheritance_children() does.  It also loses the infinite-loop
protection which find_all_inheritors() has.  Not sure what to do about
that.

- In order for the new PartitionDirectory machinery to avoid
use-after-free bugs, we have to either copy the PartitionDesc out of
the relcache into the partition directory or avoid freeing it while it
is still in use.  Copying it seems unappealing for performance
reasons, so I took the latter approach.  However, what I did here in
terms of reclaiming memory is just about the least aggressive strategy
short of leaking it altogether - it just keeps it around until the
next rebuild that occurs while the relcache entry is not in use.  We
might want to do better, e.g. freeing old copies immediately as soon
as the relcache reference count drops to 0. I just did it this way
because it was simple to code.

- I tried this with Alvaro's isolation tests and it fails some tests.
That's because Alvaro's tests expect that the list of accessible
partitions is based on the query snapshot.  For reasons I attempted to
explain in the comments in 0003, I think the idea that we can choose
the set of accessible partitions based on the query snapshot is very
wrong.  For example, suppose transaction 1 begins, reads an unrelated
table to establish a snapshot, and then goes idle.  Then transaction 2
comes along, detaches a partition from an important table, and then
does crazy stuff with that table -- changes the column list, drops it,
reattaches it with different bounds, whatever.  Then it commits.  If
transaction 1 now comes along and uses the query snapshot to decide
that the detached partition ought to still be seen as a partition of
that partitioned table, disaster will ensue.

- I don't have any tests here, but I think it would be good to add
some, perhaps modeled on Alvaro's, and also some that involve multiple
ATTACH and DETACH operations mixed with other interesting kinds of
DDL.  I also didn't make any attempt to update the documentation,
which is another thing that will probably have to be done at some
point. Not sure how much documentation we have of any of this now.

- I am uncertain whether the logic that builds up the partition
constraint is really safe in the face of concurrent DDL.  I kinda
suspect there are some problems there, but maybe not.  Once you hold
ANY lock on a partition, the partition constraint can't concurrently
become stricter, because no ATTACH can be done without
AccessExclusiveLock on the partition being attached; but it could
concurrently become less strict, because you only need a lesser lock
for a detach.  Not 

Re: Switching to 64-bit Bitmapsets

2018-12-20 Thread David Rowley
On Fri, 21 Dec 2018 at 06:26, Tom Lane  wrote:
> Pushed with some fiddling with the comment.

Great. Thanks!

> I wasn't excited about the test case you offered --- on HEAD, it pretty
> much all devolves to file access operations (probably, checking the
> current length of all the child relations).  Instead I experimented
> with an old test case I had for a complex-to-plan query, and got these
> results:

oh meh. I forgot to mention I was running with plan_cache_mode =
force_generic_plan and max_parallel_workers_per_gather = 0. If you
tried without those then you'd have seen a massively different result.

Your test seems better, regardless.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: lock level for DETACH PARTITION looks sketchy

2018-12-20 Thread Alvaro Herrera
On 2018-Dec-20, Robert Haas wrote:

> One problem about which I thought is the partition check constraint.
> When we attach, we need to validate that the check constraint holds of
> every row in the partition, which means that we need to keep new rows
> from being added while we're attaching.  But when we detach, there's
> no corresponding problem.  If we detach a leaf partition, the check
> constraint just disappears; if we detach an intermediate partitioned
> table, the check constraint loses some clauses.  Either way, there's
> no way for a row that was previously acceptable to become unacceptable
> after the detach operation.  There is, however, the possibility that
> the invalidation messages generated by the detach operation might not
> get processed immediately by other backends, so those backends might
> continue to enforce the partition constraint for some period of time
> after it changes.  That seems OK.

Check.

> There would be trouble, though, if
> we freed the old copy of the partition constraint during a relcache
> rebuild while somebody still had a pointer to it.  I'm not sure that
> can happen, though.

(Reading the pg10 source) AFAICS the partition constraint is stored in
resultRelInfo->ri_PartitionCheck, not affected by relcache invals, so it
seems fine.  We also read a copy of it at plan time for constraint
exclusion purposes, similarly not affected by relcache inval.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Performance issue in foreign-key-aware join estimation

2018-12-20 Thread Tom Lane
In connection with David Rowley's proposal to change bitmapset.c to use
64-bit words, I dug out an old test case I had for a complex-to-plan query
(attached).  Andres Freund posted this to the lists perhaps ten years ago,
though I can't locate the original posting right now.

I was distressed to discover via perf that 69% of the runtime of this
test now goes into match_eclasses_to_foreign_key_col().  That seems
clearly unacceptable.  There aren't an unreasonable number of foreign key
constraints in the underlying schema --- mostly one per table, and there's
only one table with as many as 3.  However, poking around in the planner
data structures, it turns out there are:

888 base relations

1005 EquivalenceClasses

167815 fkey_list entries initially

690 fkey_list entries after match_foreign_keys_to_quals trims them

So the reason match_eclasses_to_foreign_key_col is so dominant in the
runtime is it's invoked 167815 times and has to scan a thousand
EquivalenceClasses (unsuccessfully) on most of those calls.

How did the fkey_list get that big?  I think the issue is that the
query touches the same tables many many times (888 baserels, but
there are only 20 distinct tables in the schema) and we get an
O(N^2) growth in the apparent number of FKs.

Clearly, we ought to rethink that data structure.  I'm not sure
offhand how to make it better, but this is pretty awful.

Perhaps there'd also be some use in having better indexing for
the EquivalenceClass list, but again I'm not sure what that'd
look like.

regards, tom lane

\set ON_ERROR_STOP on

DROP SCHEMA IF EXISTS test_data CASCADE;
DROP SCHEMA IF EXISTS test_view CASCADE;
CREATE SCHEMA test_data;
CREATE SCHEMA test_view;
SET SEARCH_PATH = test_data, test_view;

CREATE TABLE proband (
   proband_id bigserial PRIMARY KEY
);

CREATE TABLE sample (
   sample_id bigserial PRIMARY KEY
);


CREATE TABLE proband__sample (
   proband_id bigint NOT NULL REFERENCES proband,
   sample_id bigint NOT NULL REFERENCES sample,

   UNIQUE(proband_id, sample_id)
);

CREATE TABLE project (
  project_id bigserial PRIMARY KEY,
  name text NOT NULL UNIQUE
);

/*
 * Stuff like:
 * -Double Entry
 * -Single Entry
 * -Machine Read
 * - ...
 */
CREATE TABLE data_quality (
   data_quality_id bigserial PRIMARY KEY,
   name text NOT NULL UNIQUE,
   description text
);


CREATE TABLE information_set (
   information_set_id bigserial PRIMARY KEY,
   name text NOT NULL UNIQUE,
   description text
);

CREATE TABLE information (
   information_id bigserial PRIMARY KEY,
   information_set_id bigint NOT NULL REFERENCES information_set,
   name text NOT NULL UNIQUE,
   description text
);
CREATE INDEX information__information_set_id ON information 
(information_set_id);



CREATE TABLE information_set_instance (
   information_set_instance_id bigserial PRIMARY KEY,
   information_set_id bigint NOT NULL REFERENCES information_set
);
CREATE INDEX information_set_instance__information_set_id ON 
information_set_instance (information_set_id);

CREATE TABLE information_set_instance__proband (
   information_set_instance_id bigint NOT NULL REFERENCES 
information_set_instance,
   proband_id bigint NOT NULL REFERENCES proband,
   UNIQUE (information_set_instance_id, proband_id)
);
CREATE INDEX information_set_instance__proband__information_set_id ON 
information_set_instance__proband (information_set_instance_id);
CREATE INDEX information_set_instance__proband__proband_id ON 
information_set_instance__proband (proband_id);

CREATE TABLE information_set_instance__sample (
   information_set_instance_id bigint NOT NULL REFERENCES 
information_set_instance,
   sample_id bigint NOT NULL REFERENCES sample,
   UNIQUE (information_set_instance_id, sample_id)
);

CREATE INDEX information_set_instance__sample__information_set_id ON 
information_set_instance__sample (information_set_instance_id);
CREATE INDEX information_set_instance__sample__sample_id ON 
information_set_instance__sample (sample_id);

CREATE TABLE information_instance (
   information_instance_id bigserial PRIMARY KEY,
   information_set_instance_id bigint NOT NULL REFERENCES 
information_set_instance,
   information_id bigint NOT NULL REFERENCES information,
   data_quality_id int REFERENCES data_quality
);
CREATE INDEX information_instance__information_set_instance_id ON 
information_set_instance(information_set_instance_id);
CREATE INDEX information_instance__information_id ON 
information(information_id);

CREATE TABLE information_about_tnm (
   information_about_tnm bigserial PRIMARY KEY,
   information_instance_id bigint REFERENCES information_instance,
   t int,
   n int,
   m int
);
CREATE INDEX information_about_tnm__information_instance_id ON 
information_about_tnm(information_instance_id);


CREATE TABLE information_about_sequenced_data (
   

Re: lock level for DETACH PARTITION looks sketchy

2018-12-20 Thread Robert Haas
On Thu, Dec 20, 2018 at 9:29 AM Alvaro Herrera  wrote:
> I can patch that one too, but it's separate -- it goes back to pg10 I
> think (the other to pg11) -- and let's think about the lock mode for a
> bit: as far as I can see, ShareUpdateExclusive is enough; the difference
> with AccessExclusive is that it lets through modes AccessShare, RowShare
> and RowExclusive.  According to mvcc.sgml, that means we're letting
> SELECT, SELECT FOR SHARE/UPDATE and INS/UPD/DEL in the partition being
> detached, respectively.  All those seem acceptable to me.  All DDL is
> blocked, of course.

One problem about which I thought is the partition check constraint.
When we attach, we need to validate that the check constraint holds of
every row in the partition, which means that we need to keep new rows
from being added while we're attaching.  But when we detach, there's
no corresponding problem.  If we detach a leaf partition, the check
constraint just disappears; if we detach an intermediate partitioned
table, the check constraint loses some clauses.  Either way, there's
no way for a row that was previously acceptable to become unacceptable
after the detach operation.  There is, however, the possibility that
the invalidation messages generated by the detach operation might not
get processed immediately by other backends, so those backends might
continue to enforce the partition constraint for some period of time
after it changes.  That seems OK.  There would be trouble, though, if
we freed the old copy of the partition constraint during a relcache
rebuild while somebody still had a pointer to it.  I'm not sure that
can happen, though.

> So right now if you insert via the parent table, detaching the partition
> would be blocked because we also acquire lock on the parent (and detach
> acquires AccessExclusive on the parent).  But after DETACH CONCURRENTLY,
> inserting via the parent should let rows to the partition through,
> because there's no conflicting lock to stop them.  Inserting rows to the
> partition directly would be let through both before and after DETACH
> CONCURRENTLY.

Yeah, that's worrisome.  Initially, I thought it was flat-out insane,
because if somebody modifies that table in some way that makes it
unsuitable as an insertion target for the existing stream of tuples --
adding constraints, changing column definitions, attaching to a new
parent, dropping -- then we'd be in big trouble.  Then I thought that
maybe it's OK, because all of those operations take
AccessExclusiveLock and therefore even if the detach goes through
without a lock, the subsequent change that is needed to create a
problem can't. I can't say that I'm 100% convinced that's bulletproof,
though. Suppose for example that a single session opens a cursor, then
does the detach, then does something else, then tries to read more
from the cursor.  Now the locks don't conflict, but we're still OK (I
think) because those operations should all CheckTableNotInUse().  But
if we try to lower the lock level for some other things in the future,
we might open up other problems here, and it's unclear to me what
guiding principles we can rely on here to keep us out of trouble.

> One note about DETACH CONCURRENTLY: detaching the index from parent now
> uses AccessExclusiveLock (the block I just patched).  For concurrent
> detach that won't work, so we'll need to reduce that lock level too. Not
> sure offhand if there are any problems with that.

Me neither.  I'm pretty sure that you need at least
ShareUpdateExclusiveLock, but I don't know whether you need
AccessExclusiveLock.  Here again, the biggest risk I can see is the
sinval race - nothing forces the invalidation messages to be read
before opening the relation for read or write.  I think it would be
nice if it ended up that the lock level is the same for the table and
the indexes, because otherwise the legal sequences of operations vary
depending on whether you've got a table with inherited indexes or one
without, and that multiplies the difficulty of testing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Switching to 64-bit Bitmapsets

2018-12-20 Thread Tom Lane
David Rowley  writes:
> On Thu, 20 Dec 2018 at 17:50, Tom Lane  wrote:
>> Hm, are you thinking of making BITS_PER_BITMAPWORD match sizeof(Pointer)
>> or something like that?  That seems like a good compromise from here.

> Yeah, something along those lines.  I've implemented that in the attached.

Pushed with some fiddling with the comment.

I wasn't excited about the test case you offered --- on HEAD, it pretty
much all devolves to file access operations (probably, checking the
current length of all the child relations).  Instead I experimented
with an old test case I had for a complex-to-plan query, and got these
results:

HEAD, asserts off:

pgbench reports:
latency average = 11704.992 ms
tps = 0.085434 (including connections establishing)
tps = 0.085437 (excluding connections establishing)

Top hits according to perf:
+   65.42%65.16%158499  postmaster   postgres   
[.] match_eclasses_to_foreign_key_col
+6.85% 6.82% 16634  postmaster   postgres   
[.] bms_overlap
+6.25% 6.22% 15173  postmaster   postgres   
[.] generate_join_implied_equalities_for_ecs
+3.91% 3.88%  9465  postmaster   postgres   
[.] make_canonical_pathkey
+3.04% 3.01%  7351  postmaster   postgres   
[.] have_relevant_eclass_joinclause
+2.06% 2.05%  4992  postmaster   postgres   
[.] bms_is_subset
+1.19% 1.18%  2887  postmaster   postgres   
[.] equal
+0.95% 0.95%  2309  postmaster   postgres   
[.] get_eclass_for_sort_expr
+0.90% 0.90%  2189  postmaster   postgres   
[.] add_paths_to_joinrel

With patch:

latency average = 10741.595 ms
tps = 0.093096 (including connections establishing)
tps = 0.093099 (excluding connections establishing)

+   69.03%68.76%178278  postmaster   postgres   
  [.] match_eclasses_to_foreign_key_col
+5.85% 5.82% 15138  postmaster   postgres   
  [.] generate_join_implied_equalities_for_ecs
+4.58% 4.55% 11842  postmaster   postgres   
  [.] bms_overlap
+4.38% 4.36% 11338  postmaster   postgres   
  [.] make_canonical_pathkey
+2.77% 2.75%  7155  postmaster   postgres   
  [.] have_relevant_eclass_joinclause
+1.30% 1.29%  3364  postmaster   postgres   
  [.] equal
+1.26% 1.25%  3261  postmaster   postgres   
  [.] bms_is_subset
+1.06% 1.05%  2722  postmaster   postgres   
  [.] get_eclass_for_sort_expr
+0.70% 0.70%  1813  postmaster   postgres   
  [.] add_paths_to_joinrel

Ignoring the, um, elephant in the room, this shows a pretty clear win
for the performance of bms_overlap and bms_is_subset, confirming
the thesis that this change is worthwhile.

I'll start a separate thread about match_eclasses_to_foreign_key_col.

regards, tom lane



Re: A case for UPDATE DISTINCT attribute

2018-12-20 Thread Alexey Bashtanov

Hello Gajus,

I have observed that the following pattern is repeating in our data 
management programs:


UPDATE
  event
SET
  fuid = ${fuid},
  venue_id = ${venueId},
  url = ${url}
WHERE
  id = ${id} AND
  fuid IS != ${fuid} AND
  venue_id IS != ${venueId} AND
  url IS DISTINCT FROM ${url};


...
Meanwhile, a WHERE condition that excludes rows with matching values 
makes this into a noop in case of matching target column values.


For this to hold, you need your conditions in WHERE to be ORed, not ANDed.



UPDATE DISTINCT
  event
SET
  fuid = ${fuid},
  venue_id = ${venueId},
  url = ${url}
WHERE
  id = ${id}

would encourage greater adoption of such pattern.

Is there a technical reason this does not existing already?

ᐧ


At least a bunch of questions and concerns arise. Like these:

1) We cannot treat it as a syntactic sugar only and just expand it on 
parsing stage,
as the expression to generate the value assigned may be volatile, like 
UPDATE ... SET ... = random();
2) How should this interact with triggers? E.g. when NEW and OLD were 
the same
before BEFORE UPDATE trigger execution, but would be different after. Or 
visa versa.

Should they be included into transition tables?
3) Should RETURNING clause return the non-updated rows?
4) It must be not easy to guarantee anything if there is a FROM clause, 
a target row is present in the join more than once.
5) We need to fail correctly if one of the column data types doesn't 
have an equality operator.


Best regards,
  Alexey


Re: A few new options for vacuumdb

2018-12-20 Thread Bossart, Nathan
Hi Michael,

Thanks for taking a look.

On 12/19/18, 8:05 PM, "Michael Paquier"  wrote:
> On Wed, Dec 19, 2018 at 08:50:10PM +, Bossart, Nathan wrote:
>> If an option is specified for a server version that is not supported,
>> the option is silently ignored.  For example, SKIP_LOCKED was only
>> added to VACUUM and ANALYZE for v12.  Alternatively, I think we could
>> fail in vacuum_one_database() if an unsupported option is specified.
>> Some of these options will work on all currently supported versions,
>> so I am curious what others think about skipping some of these version
>> checks altogether.
>
> prepare_vacuum_command() already handles that by ignoring silently
> unsupported options (see the case of SKIP_LOCKED).  So why not doing the
> same?

The --skip-locked option in vacuumdb is part of 0002, so I don't think
there's much precedent here.  We do currently fall back to the
unparenthesized syntax for VACUUM for versions before 9.0, so there is
some version handling already.  What do you think about removing
version checks for unsupported major versions?  If we went that route,
these new patches could add version checks only for options that were
added in a supported major version (e.g. mxid_age() was added in 9.5).
Either way, we'll still have to decide whether to fail or to silently
skip the option if you do something like specify --min-mxid-age for a
9.4 server.

>> It does not seem clear whether the user wants us to process mytable
>> only if it is at least 1 GB, or if we should process mytable in
>> addition to any other relations over 1 GB.  Either way, I think trying
>> to support these combinations of options adds more complexity than it
>> is worth.
>
> It seems to me that a combination of both options means that the listed
> table should be processed only if its minimum size is 1GB.  If multiple
> tables are specified with --table, then only those reaching 1GB would be
> processed.  So this restriction can go away.  The same applies for the
> proposed --min-xid-age and --min-mxid-age.

Okay.  This should be pretty easy to do using individual catalog
lookups before we call prepare_vacuum_command().  Alternatively, I
could add a clause for each specified table in the existing query in
vacuum_one_database().  At that point, it may be cleaner to just use
that catalog query in all cases instead of leaving paths where tables
can remain NULL.

> +   
> +Only execute the vacuum or analyze commands on tables with a 
> multixact
> +ID age of at least  class="parameter">mxid_age.
> +   
> Adding a link to explain the multixact business may be helpful, say
> vacuum-for-multixact-wraparound.  Same comment for XID.

Good idea.

> There is an argument about also adding DISABLE_PAGE_SKIPPING.

I'll add this in the next set of patches.  I need to add the
parenthesized syntax and --skip-locked for ANALYZE in
prepare_vacuum_command(), too.

Nathan



Re: insensitive collations

2018-12-20 Thread Daniel Verite
Tom Lane wrote:

> I don't really find it "natural" for equality to consider obviously
> distinct values to be equal.

According to https://www.merriam-webster.com/dictionary/natural
"natural" has no less than 15 meanings. The first in the list is
  "based on an inherent sense of right and wrong"
which I admit is not what we want to imply in this context.

The meaning that I was thinking about was close to definitions
4:  "following from the nature of the one in question "
or 7:   "having a specified character by nature "
or 13:  "closely resembling an original : true to nature"

When postgres uses the comparison from a collation
with no modification whatsoever, it's true to that collation.
When it changes the result from equal to non-equal, it's not.
If a collation says that "ABC" = "abc" and postgres says, mmh, OK
thanks but I'll go with "ABC" != "abc", then that denatures the
collation, in the sense of:
  "to deprive of natural qualities : change the nature of"
(https://www.merriam-webster.com/dictionary/denature)

Aside from that, I'd be +1 for "linguistic" as the opposite of
"bytewise", I think it tends to be easily understood when expressing
that a strcoll()-like function is used as opposed to a strcmp()-like
function.

I'm -1 for "deterministic" as a replacement for "bytewise". Even
if Unicode has choosen that term for exactly the behavior we're talking
about, it's heavily used in the more general sense of:
   "given a particular input, will always produce the same output"
 (quoted from https://en.wikipedia.org/wiki/Deterministic_algorithm)
which we very much expect from all our string comparisons no matter the
flags we may put on the collations. "bytewise" might be less academic
but it has less potential for wrong interpretations.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite



Re: lock level for DETACH PARTITION looks sketchy

2018-12-20 Thread Alvaro Herrera
On 2018-Dec-19, Robert Haas wrote:

> On Wed, Dec 19, 2018 at 2:44 PM Alvaro Herrera  
> wrote:
> > Oh, I remember eyeing that suspiciously, but was too distracted on
> > making the other thing work to realize it was actually wrong :-(
> > I agree that it's wrong.
> 
> OK, cool.  If you're going to push a fix for the other changes, do you
> want to do this one too, or should I fix it separately?

I can patch that one too, but it's separate -- it goes back to pg10 I
think (the other to pg11) -- and let's think about the lock mode for a
bit: as far as I can see, ShareUpdateExclusive is enough; the difference
with AccessExclusive is that it lets through modes AccessShare, RowShare
and RowExclusive.  According to mvcc.sgml, that means we're letting
SELECT, SELECT FOR SHARE/UPDATE and INS/UPD/DEL in the partition being
detached, respectively.  All those seem acceptable to me.  All DDL is
blocked, of course.

So right now if you insert via the parent table, detaching the partition
would be blocked because we also acquire lock on the parent (and detach
acquires AccessExclusive on the parent).  But after DETACH CONCURRENTLY,
inserting via the parent should let rows to the partition through,
because there's no conflicting lock to stop them.  Inserting rows to the
partition directly would be let through both before and after DETACH
CONCURRENTLY.

One note about DETACH CONCURRENTLY: detaching the index from parent now
uses AccessExclusiveLock (the block I just patched).  For concurrent
detach that won't work, so we'll need to reduce that lock level too. Not
sure offhand if there are any problems with that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Index Skip Scan

2018-12-20 Thread Dmitry Dolgov
> On Wed, Nov 21, 2018 at 9:56 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Wed, Nov 21, 2018 at 4:38 PM Alexander Kuzmenkov 
> >  wrote:
> >
> > On 11/18/18 02:27, Dmitry Dolgov wrote:
> > >
> > > [0001-Index-skip-scan-v4.patch]
> >
> > I ran a couple of tests on this, please see the cases below. As before,
> > I'm setting total_cost = 1 for index skip scan so that it is chosen.
> > Case 1 breaks because we determine the high key incorrectly, it is the
> > second tuple on page or something like that, not the last tuple.
>
> From what I see it wasn't about the high key, just a regular off by one error.
> But anyway, thanks for noticing - for some reason it wasn't always
> reproduceable for me, so I missed this issue. Please find fixed patch 
> attached.
> Also I think it invalidates my previous performance tests, so I would
> appreciate if you can check it out too.

I've performed some testing, and on my environment with a dataset of 10^7
records:

* everything below 7.5 * 10^5 unique records out of 10^7 was faster with skip
  scan.

* above 7.5 * 10^5 unique records skip scan was slower, e.g. for 10^6 unique
  records it was about 20% slower than the regular index scan.

For me these numbers sound good, since even in quite extreme case of
approximately 10 records per group the performance of index skip scan is close
to the same for the regular index only scan.

> On Tue, Dec 4, 2018 at 4:26 AM Peter Geoghegan  wrote:
>
> On Wed, Nov 21, 2018 at 12:55 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > Well, no, it's callled with ScanDirectionIsForward(dir). But as far as I
> > remember from the previous discussions the entire topic of backward scan is
> > questionable for this patch, so I'll try to invest some time in it.
>
> Another thing that I think is related to skip scans that you should be
> aware of is dynamic prefix truncation, which I started a thread on
> just now [1]. While I see one big problem with the POC patch I came up
> with, I think that that optimization is going to be something that
> ends up happening at some point. Repeatedly descending a B-Tree when
> the leading column is very low cardinality can be made quite a lot
> less expensive by dynamic prefix truncation. Actually, it's almost a
> perfect case for it.

Thanks, sounds cool. I'll try it out as soon as I'll have some spare time.



START/END line number for COPY FROM

2018-12-20 Thread Surafel Temesgen
Hi,

Currently we can skip header line on COPY FROM but having the ability to
skip and stop copying at any line can use to divide long copy operation and
enable to copy a subset of the file and skipping footer. Attach is a patch
for it


Regards

Surafel
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 411941ed31..86f9a6a905 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -43,6 +43,8 @@ COPY { table_name [ ( column_name [, ...] )
 FORCE_NULL ( column_name [, ...] )
 ENCODING 'encoding_name'
+START starting_line_number
+END ending_line_number
 
  
 
@@ -353,6 +355,24 @@ COPY { table_name [ ( 

 
+   
+START
+
+ 
+  Specifies the line number to begin copying.
+ 
+
+   
+
+   
+END
+
+ 
+  Specifies the line number to end copying.
+ 
+
+   
+
   
  
 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 4311e16007..4d07716a9f 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -121,6 +121,8 @@ typedef struct CopyStateData
 	int			file_encoding;	/* file or remote side's character encoding */
 	bool		need_transcoding;	/* file encoding diff from server? */
 	bool		encoding_embeds_ascii;	/* ASCII can be non-first byte? */
+	int		start_postion;	/* copying star line */
+	int		end_postion;	/* copying end line */
 
 	/* parameters from the COPY command */
 	Relation	rel;			/* relation to copy to or from */
@@ -347,6 +349,7 @@ static void CopySendInt32(CopyState cstate, int32 val);
 static bool CopyGetInt32(CopyState cstate, int32 *val);
 static void CopySendInt16(CopyState cstate, int16 val);
 static bool CopyGetInt16(CopyState cstate, int16 *val);
+static void skipLines(CopyState cstate);
 
 
 /*
@@ -1223,6 +1226,34 @@ ProcessCopyOptions(ParseState *pstate,
 defel->defname),
 		 parser_errposition(pstate, defel->location)));
 		}
+		else if (strcmp(defel->defname, "start") == 0)
+		{
+			if (cstate->start_postion)
+ereport(ERROR,
+		(errcode(ERRCODE_SYNTAX_ERROR),
+		 errmsg("conflicting or redundant options"),
+		 parser_errposition(pstate, defel->location)));
+			cstate->start_postion = defGetInt64(defel);
+			if (cstate->start_postion < 0)
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg("invalid START line number"),
+		 parser_errposition(pstate, defel->location)));
+		}
+		else if (strcmp(defel->defname, "end") == 0)
+		{
+			if (cstate->end_postion)
+ereport(ERROR,
+		(errcode(ERRCODE_SYNTAX_ERROR),
+		 errmsg("conflicting or redundant options"),
+		 parser_errposition(pstate, defel->location)));
+			cstate->end_postion = defGetInt64(defel);
+			if (cstate->end_postion < 0)
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg("invalid END line number"),
+		 parser_errposition(pstate, defel->location)));
+		}
 		else
 			ereport(ERROR,
 	(errcode(ERRCODE_SYNTAX_ERROR),
@@ -1373,6 +1404,13 @@ ProcessCopyOptions(ParseState *pstate,
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  errmsg("CSV quote character must not appear in the NULL specification")));
+
+	if (cstate->end_postion != 0 &&
+		cstate->start_postion > cstate->end_postion)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("START line number can not be greater then END line number")));
+
 }
 
 /*
@@ -2317,6 +2355,7 @@ CopyFrom(CopyState cstate)
 	uint64		lastPartitionSampleLineNo = 0;
 	uint64		nPartitionChanges = 0;
 	double		avgTuplesPerPartChange = 0;
+	uint64		end_line = 1;
 
 	Assert(cstate->rel);
 
@@ -2619,6 +2658,20 @@ CopyFrom(CopyState cstate)
 	has_instead_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
    resultRelInfo->ri_TrigDesc->trig_insert_instead_row);
 
+	if (cstate->start_postion)
+	{
+		end_line = cstate->start_postion;
+		skipLines(cstate);
+	}
+
+	/* throw the header line away means start copying at second line */
+	if (cstate->start_postion == 0 && cstate->header_line)
+	{
+		cstate->start_postion = 2;
+		end_line = cstate->start_postion;
+		skipLines(cstate);
+	}
+
 	/*
 	 * Check BEFORE STATEMENT insertion triggers. It's debatable whether we
 	 * should do this for COPY, since it's not really an "INSERT" statement as
@@ -2644,6 +2697,9 @@ CopyFrom(CopyState cstate)
 		TupleTableSlot *slot;
 		bool		skip_tuple;
 
+		if (cstate->end_postion !=0 && cstate->end_postion < end_line++)
+			break;
+
 		CHECK_FOR_INTERRUPTS();
 
 		if (nBufferedTuples == 0)
@@ -3394,14 +3450,6 @@ NextCopyFromRawFields(CopyState cstate, char ***fields, int *nfields)
 	/* only available for text or csv input */
 	Assert(!cstate->binary);
 
-	/* on input just throw the header line away */
-	if (cstate->cur_lineno == 0 && cstate->header_line)
-	{
-		cstate->cur_lineno++;
-		if (CopyReadLine(cstate))
-			return false;		/* done */
-	}
-
 	cstate->cur_lineno++;
 
 	/* Actually read the line into memory here 

Re: Add timeline to partial WAL segments

2018-12-20 Thread David Steele

On 12/15/18 1:56 AM, Michael Paquier wrote:

On Fri, Dec 14, 2018 at 06:05:18PM -0500, David Steele wrote:

On 12/14/18 3:26 PM, Robert Haas wrote:

The new TLI is the only thing that is guaranteed to be unique with
each new promotion, and I would guess that it is therefore the right
thing to use to disambiguate them.


This is the conclusion we came to after a few months of diagnosing and
working on this problem.


Hm.  Okay.  From that point of view I think that we should also make
pg_receivewal able to generate the same kind of segment format when it
detects a timeline switch for the last partial segment of the previous
timeline.  Switching to a new timeline makes the streaming finish, so we
could use that to close properly the WAL segment with the new name.  At
the same time it would be nice to have a macro able to generate a
partial segment name and also check after it.


Or perhaps just always add the timeline to the .partial?  That way it 
doesn't need to be renamed later.  Also, there would be a consistent 
name, rather than sometimes .partial, sometimes ..partial.



The question in my mind: is it safe to back-patch?


That's unfortunately a visibility change, meaning that any existing
backup solution able to handle .partial segments would get broken in a
minor release.  From that point of view this should go only on HEAD.


That means every archive command needs to deal with this issue on its 
own.  It's definitely not a trivial thing to do.


It's obviously strong to call this a bug, but there's very clearly an 
issue here.  I wonder if there is anything else we can do that would work?



The other patch to reorder the priority of the .ready files could go
down without any problem though.


Glad to hear it.

--
-David
da...@pgmasters.net



Re: Change pgarch_readyXlog() to return .history files first

2018-12-20 Thread David Steele

On 12/15/18 2:10 AM, Michael Paquier wrote:

On Fri, Dec 14, 2018 at 08:43:20AM -0500, David Steele wrote:

On 12/13/18 7:15 PM, Michael Paquier wrote:



Or you could just use IsTLHistoryFileName here?


We'd have to truncate .ready off the string to make that work, which
seems easy enough.  Is that what you were thinking?


Yes, that's the idea.  pgarch_readyXlog returns the segment name which
should be archived, so you could just compute it after detecting a
.ready file.


One thing to consider is the check above is more efficient than
IsTLHistoryFileName() and it potentially gets run a lot.


This check misses strspn(), so any file finishing with .history would
get eaten even if that's unlikely to happen.


Good point.  The new patch uses IsTLHistoryFileName().


If one wants to simply check this code, you can just create dummy orphan
files in archive_status and see in which order they get removed.


Seems awfully racy.  Are there currently any tests like this for the
archiver that I can look at extending?


Sorry for the confusion, I was referring to manual testing here.


Ah, I see.  Yes, that's exactly how I tested it, in addition to doing 
real promotions.



Thinking about it, we could have an automatic test to check for the file
order pattern by creating dummy files, starting the server with archiver
enabled, and then parse the logs as orphan .ready files would get
removed in the order their archiving is attempted with one WARNING entry
generated for each.  I am not sure if that is worth a test though.


Yes, parsing the logs was the best thing I could think of, too.

--
-David
da...@pgmasters.net
>From 1987a90b724506c7d9e453d9a976e3f982f625ec Mon Sep 17 00:00:00 2001
From: David Steele 
Date: Thu, 20 Dec 2018 13:28:51 +0200
Subject: [PATCH] Change pgarch_readyXlog() to return .history files first.

The alphabetical ordering of pgarch_readyXlog() means that on promotion
000100010001.partial will be archived before 0002.history.

This appears harmless, but the .history files are what other potential primaries
use to decide what timeline they should pick.  The additional latency of
compressing/transferring the much larger partial file means that archiving of
the .history file is delayed and greatly increases the chance that another
primary will promote to the same timeline.

Teach pgarch_readyXlog() to return .history files first (and in order) to reduce
the window where this can happen.  This won't prevent all conflicts, but it is a
simple change and should greatly reduce real-world occurrences.
---
 src/backend/postmaster/pgarch.c | 36 +++--
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index e88d545d65..50dc2027b1 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -695,11 +695,11 @@ pgarch_archiveXlog(char *xlog)
  * 2) because the oldest ones will sooner become candidates for
  * recycling at time of checkpoint
  *
- * NOTE: the "oldest" comparison will presently consider all segments of
- * a timeline with a smaller ID to be older than all segments of a timeline
- * with a larger ID; the net result being that past timelines are given
- * higher priority for archiving.  This seems okay, or at least not
- * obviously worth changing.
+ * NOTE: the "oldest" comparison will consider any .history file to be older
+ * than any other file except another .history file.  Segments on a timeline
+ * with a smaller ID will be older than all segments on a timeline with a 
larger
+ * ID; the net result being that past timelines are given higher priority for
+ * archiving.  This seems okay, or at least not obviously worth changing.
  */
 static bool
 pgarch_readyXlog(char *xlog)
@@ -715,6 +715,7 @@ pgarch_readyXlog(char *xlog)
DIR*rldir;
struct dirent *rlde;
boolfound = false;
+   boolhistoryFound = false;
 
snprintf(XLogArchiveStatusDir, MAXPGPATH, XLOGDIR "/archive_status");
rldir = AllocateDir(XLogArchiveStatusDir);
@@ -728,12 +729,28 @@ pgarch_readyXlog(char *xlog)
strspn(rlde->d_name, VALID_XFN_CHARS) >= basenamelen &&
strcmp(rlde->d_name + basenamelen, ".ready") == 0)
{
-   if (!found)
+/* Truncate off the .ready */
+rlde->d_name[basenamelen] = '\0';
+
+   /* Is this a history file? */
+   bool history = IsTLHistoryFileName(rlde->d_name);
+
+   /*
+* If this is the first file or the first history file, 
copy it
+*/
+   if (!found || (history && !historyFound))
{
strcpy(newxlog, rlde->d_name);
found = true;
+   

Re: Tid scan improvements

2018-12-20 Thread David Rowley
Review of v5:

0001: looks good.

0002:

1. I don't think you need palloc0() here. palloc() looks like it would be fine.

if (tidRangeArray->ranges == NULL)
tidRangeArray->ranges = (TidRange *)
palloc0(tidRangeArray->numAllocated * sizeof(TidRange));

if that wasn't the case, then you'll need to also zero the additional
memory when you repalloc().

2. Can't the following code be moved into the correct
forwards/backwards if block inside the if inscan block above?

/* If we've finished iterating over the ranges, exit the loop. */
if (node->tss_CurrentTidRange >= numRanges ||
node->tss_CurrentTidRange < 0)
break;

Something like:

if (bBackward)
{
if (node->tss_CurrentTidRange < 0)
{
/* initialize for backward scan */
node->tss_CurrentTidRange = numRanges - 1;
}
else if (node->tss_CurrentTidRange == 0)
break;
else
node->tss_CurrentTidRange--;
 }
else
{
if (node->tss_CurrentTidRange < 0)
{
/* initialize for forward scan */
node->tss_CurrentTidRange = 0;
}
else if (node->tss_CurrentTidRange >= numRanges - 1)
break;
else
node->tss_CurrentTidRange++;
}

I think that's a few less lines and instructions and (I think) a bit neater too.

3. if (found_quals != NIL) (yeah, I Know there's already lots of
places not doing this)

/* If we found any, make an AND clause out of them. */
if (found_quals)

likewise in:

/* Use a range qual if any were found. */
if (found_quals)

4. The new tests in tidscan.sql should drop the newly created tables.
(I see some get dropped in the 0004 patch, but not all. Best not to
rely on a later patch to do work that this patch should do)

0003: looks okay.

0004:

5. Please add a comment to scandir in:

typedef struct TidScan
{
Scan scan;
List*tidquals; /* qual(s) involving CTID = something */
ScanDirection scandir;
} TidScan;

/* forward or backward or don't care */ would do.

Likewise for struct TidPath. Likely IndexPath can be used for guidance.

6. Is it worth adding a Merge Join regression test for this patch?

Something like:

postgres=# explain select * from t1 inner join t1 t2 on t1.ctid =
t2.ctid order by t1.ctid desc;
 QUERY PLAN
-
 Merge Join  (cost=0.00..21.25 rows=300 width=14)
   Merge Cond: (t1.ctid = t2.ctid)
   ->  Tid Scan Backward on t1  (cost=0.00..8.00 rows=300 width=10)
   ->  Materialize  (cost=0.00..8.75 rows=300 width=10)
 ->  Tid Scan Backward on t1 t2  (cost=0.00..8.00 rows=300 width=10)
(5 rows)

0005:

7. I see the logic behind this new patch, but quite possibly the
majority of the time the relpages will be out of date and you'll
mistakenly apply this to not the final page. I'm neither here nor
there with it. I imagine you might feel the same since you didn't
merge it with 0001. Maybe we can leave it out for now and see what
others think.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Using POPCNT and other advanced bit manipulation instructions

2018-12-20 Thread Jose Luis Tallon

On 20/12/18 6:53, David Rowley wrote:

Back in 2016 [1] there was some discussion about using the POPCNT
instruction to improve the performance of counting the number of bits
set in a word.  Improving this helps various cases, such as
bms_num_members and also things like counting the allvisible and
frozen pages in the visibility map.

[snip]

I've put together a very rough patch to implement using POPCNT and the
leading and trailing 0-bit instructions to improve the performance of
bms_next_member() and bms_prev_member().  The correct function should
be determined on the first call to each function by way of setting a
function pointer variable to the most suitable supported
implementation.   I've not yet gone through and removed all the
number_of_ones[] arrays to replace with a pg_popcount*() call.


IMVHO: Please do not disregard potential optimization by the compiler 
around those calls.. o_0  That might explain the reduced performance 
improvement observed.


Not that I can see any obvious alternative to your implementation right 
away ...



That
seems to have mostly been done in Thomas' patch [3], part of which
I've used for the visibilitymap.c code changes.  If this patch proves
to be possible, then I'll look at including the other changes Thomas
made in his patch too.

What I'm really looking for by posting now are reasons why we can't do
this. I'm also interested in getting some testing done on older
machines, particularly machines with processors that are from before
2007, both AMD and Intel.


I can offer a 2005-vintage Opteron 2216 rev3 (bought late 2007) to test 
on. Feel free to toss me some test code.


cpuinfo flags:    fpu de tsc msr pae mce cx8 apic mca cmov pat clflush 
mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow 
rep_good nopl extd_apicid eagerfpu pni cx16 hypervisor lahf_lm 
cmp_legacy 3dnowprefetch vmmcall



  2007-2008 seems to be around the time both
AMD and Intel added support for POPCNT and LZCNT, going by [4].


Thanks





[patch] de.po REINDEX error

2018-12-20 Thread Christoph Berg
de.po's error message when you try to "REINDEX DATABASE otherdb" has
been the wrong way round since 2011:

diff --git a/src/backend/po/de.po b/src/backend/po/de.po
index ca52df6731..6aa4354d82 100644
--- a/src/backend/po/de.po
+++ b/src/backend/po/de.po
@@ -7540,7 +7540,7 @@ msgstr "Tabelle »%s« hat keine Indexe"
 #: commands/indexcmds.c:2329
 #, c-format
 msgid "can only reindex the currently open database"
-msgstr "aktuell geöffnete Datenbank kann nicht reindiziert werden"
+msgstr "nur die aktuell geöffnete Datenbank kann reindiziert werden"
 
 #: commands/indexcmds.c:2435
 #, c-format

Spotted by a participant during a PostgreSQL training.

Christoph



Re: Remove Deprecated Exclusive Backup Mode

2018-12-20 Thread David Steele

On 12/20/18 1:35 AM, Michael Paquier wrote:

On Wed, Dec 19, 2018 at 06:38:00PM +0200, David Steele wrote:

I'll push this to the first post-PG12 CF when one appears.  Can we just go
ahead and create a 2019-07 CF now?  I know we generally discuss this at
PGCon, but we can always rename it, right?


The schedule gets decided at the developer meeting, still the commit
fests created can have their name and start/end dates modified as needed
So, I think that it is not an issue to add one ahead now.  And here you
go:
https://commitfest.postgresql.org/23/

This request was going to pop out at some point anyway in the follow-up
months.


Thanks, Michael.  Getting this error, not sure what it means:

Cannot move patch to the same commitfest, and multiple future 
commitfests exist!


--
-David
da...@pgmasters.net



Re: [HACKERS] WAL logging problem in 9.4.3?

2018-12-20 Thread Kyotaro HORIGUCHI
Hello.

At Fri, 30 Nov 2018 18:27:05 +0100, Dmitry Dolgov <9erthali...@gmail.com> wrote 
in 
> > On Wed, Nov 14, 2018 at 4:48 AM Kyotaro HORIGUCHI 
> >  wrote:
> >
> > 0004 was shot by e9edc1ba0b. Rebased to the current HEAD.
> > Successfully built and passeed all regression/recovery tests
> > including additional recovery/t/016_wal_optimize.pl.
> 
> Thank you for working on this patch. Unfortunately, cfbot complains that
> v4-0004-Fix-WAL-skipping-feature.patch could not be applied without conflicts.
> Could you please post a rebased version one more time?

Thanks. Here's the rebased version. I found no other amendment
required other than the apparent conflict.


> > On Fri, Jul 27, 2018 at 9:26 PM Andrew Dunstan 
> >  wrote:
> >
> > On 07/18/2018 10:58 AM, Heikki Linnakangas wrote:
> > > On 18/07/18 16:29, Robert Haas wrote:
> > >> On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier
> > >>  wrote:
> >  What's wrong with the approach proposed in
> >  http://postgr.es/m/55afc302.1060...@iki.fi ?
> > >>>
> > >>> For back-branches that's very invasive so that seems risky to me
> > >>> particularly seeing the low number of complaints on the matter.
> > >>
> > >> Hmm. I think that if you disable the optimization, you're betting that
> > >> people won't mind losing performance in this case in a maintenance
> > >> release.  If you back-patch Heikki's approach, you're betting that the
> > >> committed version doesn't have any bugs that are worse than the status
> > >> quo.  Personally, I'd rather take the latter bet.  Maybe the patch
> > >> isn't all there yet, but that seems like something we can work
> > >> towards.  If we just give up and disable the optimization, we won't
> > >> know how many people we ticked off or how badly until after we've done
> > >> it.
> > >
> > > Yeah. I'm not happy about backpatching a big patch like what I
> > > proposed, and Kyotaro developed further. But I think it's the least
> > > bad option we have, the other options discussed seem even worse.
> > >
> > > One way to review the patch is to look at what it changes, when
> > > wal_level is *not* set to minimal, i.e. what risk or overhead does it
> > > pose to users who are not affected by this bug? It seems pretty safe
> > > to me.
> > >
> > > The other aspect is, how confident are we that this actually fixes the
> > > bug, with least impact to users using wal_level='minimal'? I think
> > > it's the best shot we have so far. All the other proposals either
> > > don't fully fix the bug, or hurt performance in some legit cases.
> > >
> > > I'd suggest that we continue based on the patch that Kyotaro posted at
> > > https://www.postgresql.org/message-id/20180330.100646.86008470.horiguchi.kyotaro%40lab.ntt.co.jp.
> > >
> > I have just spent some time reviewing Kyatoro's patch. I'm a bit
> > nervous, too, given the size. But I'm also nervous about leaving things
> > as they are. I suspect the reason we haven't heard more about this is
> > that these days use of "wal_level = minimal" is relatively rare.
> 
> I'm totally out of context of this patch, but reading this makes me nervous
> too. Taking into account that the problem now is lack of review, do you have
> plans to spend more time reviewing this patch?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 120f3f1d4dc47eb74a6ad7fde3c116e31b8eab3e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/4] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/016_wal_optimize.pl | 192 
 1 file changed, 192 insertions(+)
 create mode 100644 src/test/recovery/t/016_wal_optimize.pl

diff --git a/src/test/recovery/t/016_wal_optimize.pl b/src/test/recovery/t/016_wal_optimize.pl
new file mode 100644
index 00..310772a2b3
--- /dev/null
+++ b/src/test/recovery/t/016_wal_optimize.pl
@@ -0,0 +1,192 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 14;
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+	my $wal_level = shift;
+
+	# Primary needs to have wal_level = minimal here
+	my $node = get_new_node("node_$wal_level");
+	$node->init;
+	$node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+	$node->start;
+
+	# Test direct truncation optimization.  No tuples
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test1 (id serial PRIMARY KEY);
+		TRUNCATE test1;
+		COMMIT;");
+
+	$node->stop('immediate');
+	$node->start;
+
+	my $result = 

Improve selectivity estimate for range queries

2018-12-20 Thread Yuzuko Hosoya
Hi,

I found the problem about selectivity estimate for range queries
on master as follows.  This is my test case:

postgres=# CREATE TABLE test(id int, val text);
CREATE TABLE
postgres=# INSERT INTO test SELECT i, 'testval' FROM generate_series(0,2)i;
INSERT 0 3
postgres=# VACUUM analyze test;
VACUUM
postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE id > 0 and id < 1;
  QUERY PLAN

---
 Seq Scan on test  (cost=0.00..613.00 rows=150 width=12) (actual 
time=0.044..21.371 rows=
loops=1)
   Filter: ((id > 0) AND (id < 1))
   Rows Removed by Filter: 20001
 Planning Time: 0.099 ms
 Execution Time: 37.142 ms
(5 rows)

Although the actual number of rows was , the estimated number
of rows was 150.

So I dug to see what happened and thought that the following part
in clauselist_selectivity caused this problem.


/*
  * Now scan the rangequery pair list.
  */
 while (rqlist != NULL)
 {
 RangeQueryClause *rqnext;

 if (rqlist->have_lobound && rqlist->have_hibound)
 {
 /* Successfully matched a pair of range clauses */
 Selectivity s2;

 /*
  * Exact equality to the default value probably means the
  * selectivity function punted.  This is not airtight but should
  * be good enough.
  */
 if (rqlist->hibound == DEFAULT_INEQ_SEL ||
 rqlist->lobound == DEFAULT_INEQ_SEL)
 {
 s2 = DEFAULT_RANGE_INEQ_SEL;
 }
 else
 {
 s2 = rqlist->hibound + rqlist->lobound - 1.0;


In my environment, the selectivity for id > 0 was 0.1,
and the selectivity for id < 1 was 0.1. Then, the
value of rqlist->hibound and rqlist->lobound are set respectively.
On the other hand, DEFAULT_INEQ_SEL is 0..  As a result,
the final selectivity is computed according to DEFAULT_RANGE_INEQ_SEL,
since the condition of the second if statement was satisfied. However,
these selectivities were computed according to the statistics, so the
final selectivity had to be calculated from rqlist->hibound + rqlist->lobound - 
1.0.

My test case might be uncommon, but I think it would cause some problems.
To handle such cases I've thought up of an idea based on a previous commit[1]
which solved a similar problem related to DEFAULT_NUM_DISTINCT.  Just like
this modification, I added a flag which shows whether the selectivity
was calculated according to the statistics or not to clauselist_selectivity,
and used it as a condition of the if statement instead of 
rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL.
I am afraid that changes were more than I had expected, but we can estimate
selectivity correctly.

Patch attached.  Do you have any thoughts?

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git=commitdiff=4c2777d0b733220d9029f78817af8ce

Best regards,
Yuzuko Hosoya
NTT Open Source Software Center


improve_selectivity_estimate_for_range_queries.patch
Description: Binary data


RE: Localization tools for Postgres

2018-12-20 Thread Ideriha, Takeshi
Hi
cc’ed pgsql-translators

I personally uses ‘poedit’ when working around po files.


Takeshi Ideriha
Fujitsu Limited

From: Дмитрий Воронин [mailto:carriingfat...@yandex.ru]
Sent: Thursday, December 20, 2018 12:54 AM
To: pgsql-hack...@postgresql.org
Subject: Localization tools for Postgres

Hi, hackers.

Can you tell me which tools do you prefer when you update *.po translations?

Thanks!


Re: Using POPCNT and other advanced bit manipulation instructions

2018-12-20 Thread David Rowley
On Thu, 20 Dec 2018 at 20:17, Gavin Flower
 wrote:
> Looking at the normalized standard deviations, the patched results have
> a higher than 5% chance of being better simply by chance.  I suspect
> that you have made an improvement, but the statistics are not convincing.

Yeah, I'd hoped that I could have gotten a better signal to noise
ratio by running the test many times, but you're right.  That was on
my laptop.  I've run the test again on an AWS instance and the results
seem to be a bit more stable. Same table with 1 int column and 100m
rows. statistics set to 10.

Unpatched

postgres=# analyze a;

Time: 38.248 ms
Time: 35.185 ms
Time: 35.067 ms
Time: 34.879 ms
Time: 34.816 ms
Time: 34.558 ms
Time: 34.722 ms
Time: 34.427 ms
Time: 34.214 ms
Time: 34.301 ms
Time: 35.751 ms
Time: 33.993 ms
Time: 33.880 ms
Time: 33.617 ms
Time: 33.381 ms
Time: 33.326 ms

Patched:

postgres=# analyze a;

Time: 34.421 ms
Time: 33.523 ms
Time: 33.230 ms
Time: 33.678 ms
Time: 32.987 ms
Time: 32.914 ms
Time: 33.165 ms
Time: 32.707 ms
Time: 32.645 ms
Time: 32.814 ms
Time: 32.082 ms
Time: 32.143 ms
Time: 32.310 ms
Time: 31.966 ms
Time: 31.702 ms
Time: 32.089 ms

Avg +5.72%, Median +5.29%

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services