date:20191030

Re: Creating foreign key on partitioned table is too slow

2019-10-30 Thread Tom Lane

David Rowley  writes:
> In Ottawa this year, Andres and I briefly talked about the possibility
> of making a series of changes to how equalfuncs.c works. The idea was
> to make it easy by using some pre-processor magic to allow us to
> create another version of equalfuncs which would let us have an equal
> comparison function that returns -1 / 0 / +1 rather than just true or
> false.  This would allow us to Binary Search Trees of objects. I
> identified that EquivalenceClass.ec_members would be better written as
> a BST to allow much faster lookups in get_eclass_for_sort_expr().

This seems like a good idea, but why would we want to maintain two
versions?  Just change equalfuncs.c into comparefuncs.c, full stop.
equal() would be a trivial wrapper for (compare_objects(a,b) == 0).

Andres' ideas about autogenerating all that boilerplate aren't
bad, but that's no justification for carrying two full sets of
per-node logic when one set would do.

regards, tom lane

Re: tableam vs. TOAST

2019-10-30 Thread Prabhat Sahu

On Wed, Oct 30, 2019 at 9:46 PM Robert Haas  wrote:

> On Wed, Oct 30, 2019 at 3:49 AM Prabhat Sahu <
> prabhat.s...@enterprisedb.com> wrote:
>
>> While testing the Toast patch(PG+v7 patch) I found below server crash.
>> System configuration:
>> VCPUs: 4, RAM: 8GB, Storage: 320GB
>>
>> This issue is not frequently reproducible, we need to repeat the same
>> testcase multiple times.
>>
>
> I wonder if this is an independent bug, because the backtrace doesn't look
> like it's related to the stuff this is changing. Your report doesn't
> specify whether you can also reproduce the problem without the patch, which
> is something that you should always check before reporting a bug in a
> particular patch.
>

Hi Robert,

My sincere apologize that I have not mentioned the issue in more detail.
I have ran the same case against both PG HEAD and HEAD+Patch multiple
times(7, 10, 20nos), and
as I found this issue was not failing in HEAD and same case is reproducible
in HEAD+Patch (again I was not sure about the backtrace whether its related
to patch or not).

> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

-- 

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.

The Postgres Database Company

Re: [BUG] Partition creation fails after dropping a column and adding a partial index

2019-10-30 Thread Michael Paquier

On Tue, Oct 29, 2019 at 01:16:58PM +0900, Michael Paquier wrote:
> Yes, something looks wrong with that.  I have not looked at it in
> details yet though.  I'll see about that tomorrow.

So..  When building the attribute map for a cloned index (with either
LIKE during the transformation or for partition indexes), we store
each attribute number with 0 used for dropped columns.  Unfortunately,
if you look at the way the attribute map is built in this case the
code correctly generates the mapping with convert_tuples_by_name_map.
But, the length of the mapping used is incorrect as this makes use of 
the number of attributes for the newly-created child relation, and not
the parent which includes the dropped column in its count.  So the
answer is simply to use the parent as reference for the mapping
length.

The patch is rather simple as per the attached, with extended
regression tests included.  I have not checked on back-branches yet,
but that's visibly wrong since 8b08f7d down to v11 (will do that when
back-patching).

There could be a point in changing convert_tuples_by_name_map & co so
as they return the length of the map on top of the map to avoid such
mistakes in the future.  That's a more invasive patch not really
adapted for a back-patch, but we could do that on HEAD once this bug
is fixed.  I have also checked other calls of this API and the
handling is done correctly.

Wyatt, what do you think?
--
Michael
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8d25d14772..5597be6e3d 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1091,7 +1091,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 RelationGetDescr(parent));
 			idxstmt =
 generateClonedIndexStmt(NULL, idxRel,
-		attmap, RelationGetDescr(rel)->natts,
+		attmap, RelationGetDescr(parent)->natts,
 		);
 			DefineIndex(RelationGetRelid(rel),
 		idxstmt,
diff --git a/src/test/regress/expected/create_table.out b/src/test/regress/expected/create_table.out
index ce2484fd4e..f63016871c 100644
--- a/src/test/regress/expected/create_table.out
+++ b/src/test/regress/expected/create_table.out
@@ -1168,3 +1168,52 @@ insert into defcheck_def values (0, 0);
 create table defcheck_0 partition of defcheck for values in (0);
 ERROR:  updated partition constraint for default partition "defcheck_def" would be violated by some row
 drop table defcheck;
+-- tests of column drop with partition tables and indexes using
+-- predicates and expressions.
+create table part_column_drop (
+  useless_1 int,
+  id int,
+  useless_2 int,
+  d int,
+  b int,
+  useless_3 int
+) partition by range (id);
+alter table part_column_drop drop column useless_1;
+alter table part_column_drop drop column useless_2;
+alter table part_column_drop drop column useless_3;
+create index part_column_drop_b_pred on part_column_drop(b) where b = 1;
+create index part_column_drop_b_expr on part_column_drop((b = 1));
+create index part_column_drop_d_pred on part_column_drop(d) where d = 2;
+create index part_column_drop_d_expr on part_column_drop((d = 2));
+create table part_column_drop_1_10 partition of
+  part_column_drop for values from (1) to (10);
+\d part_column_drop
+Partitioned table "public.part_column_drop"
+ Column |  Type   | Collation | Nullable | Default 
++-+---+--+-
+ id | integer |   |  | 
+ d  | integer |   |  | 
+ b  | integer |   |  | 
+Partition key: RANGE (id)
+Indexes:
+"part_column_drop_b_expr" btree ((b = 1))
+"part_column_drop_b_pred" btree (b) WHERE b = 1
+"part_column_drop_d_expr" btree ((d = 2))
+"part_column_drop_d_pred" btree (d) WHERE d = 2
+Number of partitions: 1 (Use \d+ to list them.)
+
+\d part_column_drop_1_10
+   Table "public.part_column_drop_1_10"
+ Column |  Type   | Collation | Nullable | Default 
++-+---+--+-
+ id | integer |   |  | 
+ d  | integer |   |  | 
+ b  | integer |   |  | 
+Partition of: part_column_drop FOR VALUES FROM (1) TO (10)
+Indexes:
+"part_column_drop_1_10_b_idx" btree (b) WHERE b = 1
+"part_column_drop_1_10_d_idx" btree (d) WHERE d = 2
+"part_column_drop_1_10_expr_idx" btree ((b = 1))
+"part_column_drop_1_10_expr_idx1" btree ((d = 2))
+
+drop table part_column_drop;
diff --git a/src/test/regress/sql/create_table.sql b/src/test/regress/sql/create_table.sql
index 144eeb480d..e835b65ac4 100644
--- a/src/test/regress/sql/create_table.sql
+++ b/src/test/regress/sql/create_table.sql
@@ -903,3 +903,26 @@ create table defcheck_1 partition of defcheck for values in (1, null);
 insert into defcheck_def values (0, 0);
 create table defcheck_0 partition of defcheck for values in (0);
 drop table defcheck;
+
+-- tests of column drop with partition tables and indexes using
+--

Re: Allow CREATE OR REPLACE VIEW to rename the columns

2019-10-30 Thread Tom Lane

Fujii Masao  writes:
> Currently CREATE OR REPLACE VIEW command fails if the column names
> are changed.

That is, I believe, intentional.  It's an effective aid to catching
mistakes in view redefinitions, such as misaligning the new set of
columns relative to the old.  That's particularly important given
that we allow you to add columns during CREATE OR REPLACE VIEW.
Consider the oversimplified case where you start with

CREATE VIEW v AS SELECT 1 AS x, 2 AS y;

and you want to add a column z, and you get sloppy and write

CREATE OR REPLACE VIEW v AS SELECT 1 AS x, 3 AS z, 2 AS y;

If we did not throw an error on this, references that formerly
pointed to column y would now point to z (as that's still attnum 2),
which is highly unlikely to be what you wanted.

The right way to handle a column rename in a view is to do a separate
ALTER VIEW RENAME COLUMN, making it totally clear what you intend to
happen.  (Right now, we make you spell that "ALTER TABLE RENAME COLUMN",
but it'd be reasonable to add that syntax to ALTER VIEW too.)  I don't
think this functionality should be folded into redefinition of the content
of the view.  It'd add more opportunity for mistakes than anything else.

regards, tom lane

Re: Allow cluster_name in log_line_prefix

2019-10-30 Thread Tatsuo Ishii

> Hi folks
> 
> I was recently surprised to notice that log_line_prefix doesn't support a
> cluster_name placeholder. I suggest adding one. If I don't hear objections
> I'll send a patch.

I think it'd be a good thing for users.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Add SQL function to show total block numbers in the relation

2019-10-30 Thread Michael Paquier

On Wed, Oct 30, 2019 at 10:09:47AM -0400, Tom Lane wrote:
> btkimurayuzk  writes:
>> I propose new simple sql query, which shows total block numbers in the 
>> relation.
>> ...
>> Of cource, we can know this value such as
>> select (pg_relation_size('t') / 
>> current_setting('block_size')::bigint)::int;
> 
> I don't really see why the existing solution isn't sufficient.

+1.
--
Michael


signature.asc
Description: PGP signature

Re: Allow cluster_name in log_line_prefix

2019-10-30 Thread Thomas Munro

On Mon, Oct 28, 2019 at 3:33 PM Craig Ringer  wrote:
> I was recently surprised to notice that log_line_prefix doesn't support a 
> cluster_name placeholder. I suggest adding one. If I don't hear objections 
> I'll send a patch.

+1

Allow CREATE OR REPLACE VIEW to rename the columns

2019-10-30 Thread Fujii Masao

Hi,

Currently CREATE OR REPLACE VIEW command fails if the column names
are changed. For example,

=# CREATE VIEW test AS SELECT 0 AS a;
=# CREATE OR REPLACE VIEW test AS SELECT 0 AS x;
ERROR:  cannot change name of view column "a" to "x"

I'd like to propose the attached patch that allows CREATE OR REPLACE VIEW
to rename the view columns. Thought?

Regards,

-- 
Fujii Masao


rename_view_column_v1.patch
Description: Binary data

Re: Problem with synchronous replication

2019-10-30 Thread Michael Paquier

On Wed, Oct 30, 2019 at 05:43:04PM +0900, Kyotaro Horiguchi wrote:
> At Wed, 30 Oct 2019 17:21:17 +0900, Fujii Masao  wrote 
> in 
>> This change causes every ending backends to always take the exclusive lock
>> even when it's not in SyncRep queue. This may be problematic, for example,
>> when terminating multiple backends at the same time? If yes,
>> it might be better to check SHMQueueIsDetached() again after taking the lock.
>> That is,
> 
> I'm not sure how much that harms but double-checked locking
> (releasing) is simple enough for reducing possible congestion here, I
> think.

FWIW, I could not measure any actual difference with pgbench -C, up to
500 sessions and an empty input file (just have one meta-command) and
-c 20.

I have added some comments in SyncRepCleanupAtProcExit(), and adjusted
the patch with the suggestion from Fujii-san.  Any comments?
--
Michael
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 16aee1de4c..f72b8a75f3 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -149,6 +149,13 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	const char *old_status;
 	int			mode;
 
+	/*
+	 * This should be called while holding interrupts during a transaction
+	 * commit to prevent the follow-up shared memory queue cleanups to be
+	 * influenced by external interruptions.
+	 */
+	Assert(InterruptHoldoffCount > 0);
+
 	/* Cap the level for anything other than commit to remote flush only. */
 	if (commit)
 		mode = SyncRepWaitMode;
@@ -361,10 +368,18 @@ SyncRepCancelWait(void)
 void
 SyncRepCleanupAtProcExit(void)
 {
+	/*
+	 * First check if we are removed from the queue without the lock to
+	 * not slow down backend exit.
+	 */
 	if (!SHMQueueIsDetached(&(MyProc->syncRepLinks)))
 	{
 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
-		SHMQueueDelete(&(MyProc->syncRepLinks));
+
+		/* maybe we have just been removed, so recheck */
+		if (!SHMQueueIsDetached(&(MyProc->syncRepLinks)))
+			SHMQueueDelete(&(MyProc->syncRepLinks));
+
 		LWLockRelease(SyncRepLock);
 	}
 }
@@ -508,6 +523,8 @@ SyncRepReleaseWaiters(void)
 /*
  * Calculate the synced Write, Flush and Apply positions among sync standbys.
  *
+ * The caller must hold SyncRepLock.
+ *
  * Return false if the number of sync standbys is less than
  * synchronous_standby_names specifies. Otherwise return true and
  * store the positions into *writePtr, *flushPtr and *applyPtr.
@@ -521,6 +538,8 @@ SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 {
 	List	   *sync_standbys;
 
+	Assert(LWLockHeldByMe(SyncRepLock));
+
 	*writePtr = InvalidXLogRecPtr;
 	*flushPtr = InvalidXLogRecPtr;
 	*applyPtr = InvalidXLogRecPtr;
@@ -680,6 +699,8 @@ cmp_lsn(const void *a, const void *b)
 List *
 SyncRepGetSyncStandbys(bool *am_sync)
 {
+	Assert(LWLockHeldByMe(SyncRepLock));
+
 	/* Set default result */
 	if (am_sync != NULL)
 		*am_sync = false;
@@ -984,7 +1005,7 @@ SyncRepGetStandbyPriority(void)
  * Pass all = true to wake whole queue; otherwise, just wake up to
  * the walsender's LSN.
  *
- * Must hold SyncRepLock.
+ * The caller must hold SyncRepLock in exclusive mode.
  */
 static int
 SyncRepWakeQueue(bool all, int mode)
@@ -995,6 +1016,7 @@ SyncRepWakeQueue(bool all, int mode)
 	int			numprocs = 0;
 
 	Assert(mode >= 0 && mode < NUM_SYNC_REP_WAIT_MODE);
+	Assert(LWLockHeldByMeInMode(SyncRepLock, LW_EXCLUSIVE));
 	Assert(SyncRepQueueIsOrderedByLSN(mode));
 
 	proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),


signature.asc
Description: PGP signature

Re: Problem with synchronous replication

2019-10-30 Thread Michael Paquier

On Wed, Oct 30, 2019 at 12:34:28PM +0900, Kyotaro Horiguchi wrote:
> If we do that strictly, other functions like
> SyncRepGetOldestSyncRecPtr need the same Assert()s. I think static
> functions don't need Assert() and caution in their comments would be
> enough.

Perhaps.  I'd rather be careful though if we meddle again with this
code in the future as it is shared across multiple places and
callers.

> SyncRepReleaseWaiters reads MyWalSnd->sync_standby_priority without
> holding SyncRepLock, which could lead to a message with wrong
> priority. I'm not sure it matters, though.

The WAL sender is the only writer of its info in shared memory, so
there is no problem to have it read data without its spin lock hold.

> Seems reasonable for holdoffs. The same assertion would be needed in
> more places but it's another issue.

Sure.

> By the way while I was looking this, I found a typo. Please find the
> attached.

Thanks, applied that one.
--
Michael


signature.asc
Description: PGP signature

Re: Creating foreign key on partitioned table is too slow

2019-10-30 Thread Andres Freund

Hi,

On 2019-10-31 11:19:05 +1300, David Rowley wrote:
> In Ottawa this year, Andres and I briefly talked about the possibility
> of making a series of changes to how equalfuncs.c works. The idea was
> to make it easy by using some pre-processor magic to allow us to
> create another version of equalfuncs which would let us have an equal
> comparison function that returns -1 / 0 / +1 rather than just true or
> false.

See also the thread at
https://www.postgresql.org/message-id/20190920051857.2fhnvhvx4qdddviz%40alap3.anarazel.de
which would make this fairly easy, without having to compile equalfuncs
twice or such.

Greetings,

Andres Freund

Re: Zedstore - compressed in-core columnar storage

2019-10-30 Thread Taylor Vesely

Alex Wang and I have been doing some performance analysis of the most
recent version of the zedstore branch, and have some interesting
statistics to share.

We specifically focused on TPC-DS query 2, because it plays to what
should be the strength of zedstore- namely it does a full table scan
of only a subset of columns. I've attached the explain verbose output
for reference.

We scan two columns of 'catalog_sales', and two columns of 'web_sales'.

->  Parallel Append
 ->  Parallel Seq Scan on tpcds.catalog_sales
   Output: catalog_sales.cs_ext_sales_price,
catalog_sales.cs_sold_date_sk
 ->  Parallel Seq Scan on tpcds.web_sales
   Output: web_sales.ws_ext_sales_price, web_sales.ws_sold_date_sk

For heap, it needs to do a full table scan of both tables, and we need
to read the entire table into memory. For our dataset, that totals
around 119GB of data.

***HEAP***
tpcds=# select pg_size_pretty(pg_relation_size('web_sales'));
 pg_size_pretty

 39 GB
(1 row)

tpcds=# select pg_size_pretty(pg_relation_size('catalog_sales'));
 pg_size_pretty

 80 GB
(1 row)
***/HEAP***

With Zedstore the total relation size is smaller because of
compression. When scanning the table, we only scan the blocks with
data we are interested in, and leave the rest alone. So the total
size we need to scan for these tables totals around 4GB

***ZEDSTORE***
zedstore=# select pg_size_pretty(pg_relation_size('web_sales'));
 pg_size_pretty

 20 GB
(1 row)

zedstore=# select pg_size_pretty(pg_relation_size('catalog_sales'));
 pg_size_pretty

 40 GB
(1 row)

zedstore=# with zedstore_tables as (select d.oid, f.*
zedstore(# from (select c.oid
zedstore(#   from pg_am am
zedstore(#join pg_class c on (c.relam = am.oid)
zedstore(#   where am.amname = 'zedstore') d,
zedstore(#  pg_zs_btree_pages(d.oid) f)
zedstore-# select zs.attno, att.attname, zs.oid::regclass, count(zs.attno)
as pages
zedstore-#  pg_size_pretty(count(zs.attno) * 8 * 1024) from
zedstore_tables zs
zedstore-# left join pg_attribute att on zs.attno = att.attnum
zedstore-#   and zs.oid = att.attrelid
zedstore-# where zs.oid in ('catalog_sales'::regclass,
'web_sales'::regclass)
zedstore-# and (att.attname in
('cs_ext_sales_price','cs_sold_date_sk','ws_ext_sales_price','ws_sold_date_sk')
zedstore(# or zs.attno = 0)
zedstore-# group by zs.attno, att.attname, zs.oid
zedstore-# order by zs.oid , zs.attno;
 attno |  attname   |  oid  | pages  | pg_size_pretty
---++---++
 0 || catalog_sales |  39549 | 309 MB
 1 | cs_sold_date_sk| catalog_sales |   2441 | 19 MB
24 | cs_ext_sales_price | catalog_sales | 289158 | 2259 MB
 0 || web_sales |  20013 | 156 MB
 1 | ws_sold_date_sk| web_sales |  17578 | 137 MB
24 | ws_ext_sales_price | web_sales | 144860 | 1132 MB
***/ZEDSTORE ***

On our test machine, our tables were stored on a single spinning disk,
so our read speed was pretty abysmal with this query. This query is
I/O bound for us, so it was the single largest factor. With heap, the
tables are scanned sequentially, and therefore can scan around 150MB of
table data per second:

***HEAP***
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   8.540.001.85   11.620.00   77.98

Devicer/s w/s rkB/s wkB/s   rrqm/s   wrqm/s  %rrqm
 %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdd   1685.330.00 157069.33  0.0018.67 0.00   1.10
  0.001.560.00   2.6293.20 0.00   0.59 100.00

Devicer/s w/s rkB/s wkB/s   rrqm/s   wrqm/s  %rrqm
 %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdd   1655.330.00 154910.67  0.0021.33 0.00   1.27
  0.001.620.00   2.6893.58 0.00   0.60 100.13

Devicer/s w/s rkB/s wkB/s   rrqm/s   wrqm/s  %rrqm
 %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdd   1746.330.00 155121.33  0.0028.00 0.00   1.58
  0.001.480.00   2.6188.83 0.00   0.57 100.00
***/HEAP***

Because zedstore resembled random I/O, the read speed was
significantly hindered on our single disk. As a result, we saw ~150x
slower read speeds.

***ZEDSTORE***
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   6.240.001.226.340.00   86.20

Devicer/s w/s rkB/s wkB/s   rrqm/s   wrqm/s  %rrqm
 %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdb129.330.00   1034.67  0.00 0.00 0.00   0.00
  0.00   15.890.00   2.05 8.00 0.00   7.67  99.20

Devicer/s w/s rkB/s wkB/s   rrqm/s   wrqm/s  %rrqm
 %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdb120.67

Re: Creating foreign key on partitioned table is too slow

2019-10-30 Thread David Rowley

On Thu, 31 Oct 2019 at 07:30, Tomas Vondra  wrote:
>
> On Thu, Oct 24, 2019 at 04:28:38PM -0700, Andres Freund wrote:
> >Hi,
> >
> >On 2019-10-23 05:59:01 +, kato-...@fujitsu.com wrote:
> >> To benchmark with tpcb model, I tried to create a foreign key in the 
> >> partitioned history table, but backend process killed by OOM.
> >> the number of partitions is 8192. I tried in master(commit: ad4b7aeb84).
> >
> >Obviously this should be improved. But I think it's also worthwhile to
> >note that using 8k partitions is very unlikely to be a good choice for
> >anything. The metadata, partition pruning, etc overhead is just going to
> >be very substantial.
> >
>
> True. Especially with two partitioned tables, each with 8k partitions.

In Ottawa this year, Andres and I briefly talked about the possibility
of making a series of changes to how equalfuncs.c works. The idea was
to make it easy by using some pre-processor magic to allow us to
create another version of equalfuncs which would let us have an equal
comparison function that returns -1 / 0 / +1 rather than just true or
false.  This would allow us to Binary Search Trees of objects. I
identified that EquivalenceClass.ec_members would be better written as
a BST to allow much faster lookups in get_eclass_for_sort_expr().

The implementation I had in mind for the BST was a compact tree that
instead of using pointers for the left and right children, it just
uses an integer to reference the array element number.  This would
allow us to maintain very fast insert-order traversals.  Deletes would
need to decrement all child references greater than the deleted index.
This is sort of on-par with how the new List implementation in master.
i.e deletes take additional effort, but inserts are fast if there's
enough space in the array for a new element, traversals are
cache-friendly, etc.  I think trees might be better than hash tables
for this as a hash function needs to hash all fields, whereas a
comparison function can stop when it finds the first non-match.

This may also be able to help simplify the code in setrefs.c to get
rid of the complex code around indexed tlists. tlist_member() would
become O(log n) instead of O(n), so perhaps there'd be not much point
in having both search_indexed_tlist_for_var() and
search_indexed_tlist_for_non_var().

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

2019-10-30 Thread Peter Geoghegan

On Wed, Oct 30, 2019 at 1:25 PM Andres Freund  wrote:
> I assume you mean that the index would dynamically recognize when it
> needs the wider tids ("for the higher portion")? If so, yea, that makes
> sense to me. Would that need to encode the 6/8byteness of a tid on a
> per-element basis? Or are you thinking of being able to upconvert in a
> general way?

I think that index access methods should continue to control the
layout of TIDs/item pointers, while mostly just treating them as
fixed-width integers. Maybe you just have 6 and 8 byte structs, or
maybe you also have a 16 byte struct, but it isn't really variable
length to the index AM (it's more like a union). It becomes the index
AM's responsibility to remember which TID width applies at a per tuple
level (or per posting list level, etc). In general, index AMs have to
"supply their own status bits", which should be fine (the alternative
is that they refuse to support anything other than the traditional 6
byte TID format).

Table AMs don't get to supply their own operator class for sorting
these integers -- they had better be happy with the TID sort order
that is baked in (ascending int order) in the context of index scans
that return duplicates, and things like that. There is probably
generic infrastructure for up-converting, too, but the index AM is
fundamentally in the driving seat with this design.

> If we had variable width tids varying by more than 2 bytes, it might be
> reasonable for cases like this to store all tids padded to the width of
> the widest tid. I think that'd still be pretty OK, because you'd only
> pay the price if you actually have long tids, and only on pages where
> such tids are referenced. Obviously that means that such a posting list
> could grow more than by the size of the inserted tid upon insertion, but
> that might still be OK?  That'd require storing the width of the posting
> list elements somewhere, unfortunately - not sure if that's a problem?

My solution is to require the index AM to look after itself. The index
AM is responsible for not mixing them up. For nbtree with
deduplication, this means that having different width TIDs makes it
impossible to merge together posting lists/tuples. For GIN, this means
that we can expand the width of the TIDs in the posting list to match
the widest TID. We can then make it into a posting tree if necessary
-- GIN has the benefit of always being able to fall back on the option
of making a new posting tree (unlike nbtree with deduplication). GIN's
B-Tree is very primitive in some ways (no deletion of items in the
entry tree, no deletion of pages in the entry tree), which gives it
the ability to safely fall back on creating a new posting tree when it
runs out of space.

> ISTM if we had varint style tids, we'd probably still save space on
> average for today's heap that way. How important is it for you to be
> able to split out the "block offset" and "page offset" bits?

Pretty important. The nbtree deduplication patch is very compelling
because it almost offers the best of both worlds -- the concurrency
characteristics of today's nbtree, combined with very much improved
space efficiency. Keeping the space accounting as simple as possible
seems like a big part of why this is possible at all. There is only
one new type of WAL record required for deduplication, and they're
pretty small. (Existing WAL records are changed to support posting
list splits, but these are small, low-overhead changes.)

> I'm somewhat inclined to think that tids should just be a varint (or
> maybe two, if we want to make it slightly simpler to keep compat to how
> they look today), and that the AM internally makes sense of that.

I am opposed to adding anything that is truly varint. The index access
method ought to be able to have fine control over the layout, without
being burdened by an overly complicated TID representation.

> > I can support varwidth TIDs in the future pretty well if the TID
> > doesn't have to be *arbitrarily* wide.
>
> I think it'd be perfectly reasonable to have a relatively low upper
> limit for tid width. Not 8 bytes, but also not 200 bytes.

My point is that we should find a way to make TIDs continue to be an
array of fixed width integers in any given context. Lots of index
access method code can be ported in a relatively straightforward
fashion this way. This has some downsides, but I believe that they're
worth it.

> > Individual posting lists can themselves either use 6 byte or 8 byte
> > TIDs, preserving the ability to access a posting list entry at random
> > using simple pointer arithmetic.  This makes converting over index AMs
> > a lot less painful -- it'll be pretty easy to avoid mixing together
> > the 6 byte and 8 byte structs.
>
> With varint style tids as I suggested, that ought to be fairly simple?

nbtree probably won't be able to tolerate having to widen every TID in
the posting list all at once when new tuples are inserted that have
TIDs that are one byte wider,

Re: v12.0: ERROR: could not find pathkey item to sort

2019-10-30 Thread David Rowley

On Thu, 31 Oct 2019 at 05:09, Tom Lane  wrote:
> David --- much of the complexity here comes from the addition of
> the eclass_indexes infrastructure, so do you have any thoughts?

Hindsight makes me think I should have mentioned in the comment for
eclass_indexes that it's only used for simple rels and remains NULL
for anything else.

All the code in equivclass.c either uses get_common_eclass_indexes()
and get_eclass_indexes_for_relids() which go down to the simple rel
level to obtain their eclass_indexes. When calling
get_eclass_indexes_for_relids() we'll build a union Bitmapset with the
indexes from each simple rel that the join rel is made from. We only
ever directly use the eclass_indexes field when we're certain we're
dealing with a simple rel already. get_eclass_indexes_for_relids()
would do the same job, but using the field directly saves a bit of
needless effort and memory allocations. So, in short, I don't really
see why we need to set eclass_indexes for anything other than simple
rels.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

2019-10-30 Thread Andres Freund

Hi,

On 2019-10-30 12:37:50 -0700, Peter Geoghegan wrote:
> On Wed, Oct 30, 2019 at 12:03 PM Andres Freund  wrote:
> > I'd much rather not entrench this further, even leaving global indexes
> > aside. The 4 byte block number is a significant limitation for heap
> > tables too, and we should lift that at some point not too far away.
> > Then there's also other AMs that could really use a wider tid space.
> 
> I agree that that limitation is a problem that should be fixed before
> too long. But the solution probably shouldn't be a radical departure
> from what we have today. The vast majority of tables are not affected
> by the TID space limitation. Those tables that are can tolerate
> supporting fixed width "long" TIDs (perhaps 8 bytes long?) that are
> used for the higher portion of the heap TID space alone.
> 
> The idea here is that TID is varwidth, but actually uses the existing
> heap TID format most of the time. For larger tables it uses a wider
> fixed width struct that largely works the same as the old 6 byte
> struct.

I assume you mean that the index would dynamically recognize when it
needs the wider tids ("for the higher portion")? If so, yea, that makes
sense to me. Would that need to encode the 6/8byteness of a tid on a
per-element basis? Or are you thinking of being able to upconvert in a
general way?

> > > Though I suppose a posting list almost has to have fixed width TIDs to
> > > perform acceptably.
> >
> > Hm. It's not clear to me why that is?
> 
> Random access matters for things like determining the correct offset
> to split a posting list at. This is needed in the event of an
> overlapping insertion of a new duplicate tuple whose heap TID falls
> within the range of the posting list. Also, I want to be able to scan
> posting lists backwards for backwards scans. In general, fixed-width
> TIDs make the page space accounting fairly simple, which matters a lot
> in nbtree.

If we had variable width tids varying by more than 2 bytes, it might be
reasonable for cases like this to store all tids padded to the width of
the widest tid. I think that'd still be pretty OK, because you'd only
pay the price if you actually have long tids, and only on pages where
such tids are referenced. Obviously that means that such a posting list
could grow more than by the size of the inserted tid upon insertion, but
that might still be OK?  That'd require storing the width of the posting
list elements somewhere, unfortunately - not sure if that's a problem?

ISTM if we had varint style tids, we'd probably still save space on
average for today's heap that way. How important is it for you to be
able to split out the "block offset" and "page offset" bits?

I'm somewhat inclined to think that tids should just be a varint (or
maybe two, if we want to make it slightly simpler to keep compat to how
they look today), and that the AM internally makes sense of that.

> I can support varwidth TIDs in the future pretty well if the TID
> doesn't have to be *arbitrarily* wide.

I think it'd be perfectly reasonable to have a relatively low upper
limit for tid width. Not 8 bytes, but also not 200 bytes.

> Individual posting lists can themselves either use 6 byte or 8 byte
> TIDs, preserving the ability to access a posting list entry at random
> using simple pointer arithmetic.  This makes converting over index AMs
> a lot less painful -- it'll be pretty easy to avoid mixing together
> the 6 byte and 8 byte structs.

With varint style tids as I suggested, that ought to be fairly simple?

> I don't think "indirect" indexes are a realistic goal for
> Postgres. VACUUM is just too messy there (as is any other garbage
> collection mechanism).  Zedstore and Zheap don't change this.

Hm, I don't think there's a fundamental problem, but let's leave that
aside, there's enough other reasons to improve this.

Greetings,

Andres Freund

pgstat.c has brittle response to transient problems

2019-10-30 Thread Tom Lane

While fooling with the NetBSD-vs-libpython issue noted in a nearby
thread, I observed that the core regression tests sometimes hang up
in the "stats" test on this platform (NetBSD 8.1/amd64).  Investigation
found that the stats collector process was sometimes exiting like
this:

2019-10-29 19:38:14.563 EDT [7018] FATAL:  could not read statistics message: 
No buffer space available
2019-10-29 19:38:14.563 EDT [7932] LOG:  statistics collector process (PID 
7018) exited with exit code 1

The postmaster then restarts the collector, but possibly with a time
delay (see the PGSTAT_RESTART_INTERVAL respawn throttling logic).
This seems to interact badly with the wait-for-stats-collector logic
in backend_read_statsfile, so that each cycle of the wait_for_stats()
test function takes a long time ... and we will do 300 of those
unconditionally.  (Possibly wait_for_stats ought to be modified so
that it pays attention to elapsed wall-clock time rather than
iterating for a fixed number of times?)

NetBSD's recv() man page glosses ENOBUFS as "A message was not
delivered because it would have overflowed the buffer", but I don't
believe that's actually what's happening.  (Just to be sure,
I added an Assert on the sending side that no message exceeds
sizeof(PgStat_Msg).  I wonder why we didn't have one already.)
Trawling the NetBSD kernel code, it seems like ENOBUFS could get
returned as a result of transient shortages of kernel working
memory --- most of the uses of that error code seem to be on the
sending side, but I found some that seem to be in receiving code.

In short: it's evidently possible to get ENOBUFS as a transient
failure condition on this platform, and having the stats collector
die seems like an overreaction.  I'm inclined to have it log the
error and press on, instead.  Looking at the POSIX spec for
recv() suggests that ENOMEM is another plausible transient failure,
so maybe we should do the same with that.  Roughly:

if (len < 0)
{
+   /* silently ignore these cases (no data available) */
if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
break;/* out of inner loop */
+   /* noisily ignore these cases (soft errors) */
+   if (errno == ENOBUFS || errno == ENOMEM)
+   {
+   ereport(LOG,
+   (errcode_for_socket_access(),
+errmsg("could not read statistics message: %m")));
+   break;/* out of inner loop */
+   }
+   /* hard failure, but maybe hara-kiri will fix it */
ereport(ERROR,
(errcode_for_socket_access(),
 errmsg("could not read statistics message: %m")));
}

A variant idea is to treat all unexpected errnos as LOG-and-push-on,
but maybe we ought to have a limit on how many times we'll do that
before erroring out.

Thoughts?

regards, tom lane

Re: MarkBufferDirtyHint() and LSN update

2019-10-30 Thread Tomas Vondra


On Wed, Oct 30, 2019 at 02:44:18PM +0100, Antonin Houska wrote:

Please consider this scenario (race conditions):

1. FlushBuffer() has written the buffer but hasn't yet managed to clear the
BM_DIRTY flag (however BM_JUST_DIRTIED could be cleared by now).

2. Another backend modified a hint bit and called MarkBufferDirtyHint().

3. In MarkBufferDirtyHint(), if XLogHintBitIsNeeded() evaluates to true
(e.g. due to checksums enabled), new LSN is computed, however it's not
assigned to the page because the buffer is still dirty:

if (!(buf_state & BM_DIRTY))
{
...

if (!XLogRecPtrIsInvalid(lsn))
PageSetLSN(page, lsn);
}

4. MarkBufferDirtyHint() completes.

5. In the first session, FlushBuffer()->TerminateBufferIO() will not clear
BM_DIRTY because MarkBufferDirtyHint() has eventually set
BM_JUST_DIRTIED. Thus the hint bit change itself will be written by the next
call of FlushBuffer(). However page LSN is hasn't been updated so the
requirement that WAL must be flushed first is not met.

I think that PageSetLSN() should be called regardless BM_DIRTY. Do I miss any
subtle detail?



Isn't this prevented by locking of the buffer header? Both FlushBuffer
and MarkBufferDirtyHint do obtain that lock. I see MarkBufferDirtyHint
does a bit of work before, but that's related to BM_PERMANENT.

If there really is a race condition, it shouldn't be that difficult to
trigger it by adding a sleep or a breakpoint, I guess.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

2019-10-30 Thread Peter Geoghegan

On Wed, Oct 30, 2019 at 12:03 PM Andres Freund  wrote:
> I'd much rather not entrench this further, even leaving global indexes
> aside. The 4 byte block number is a significant limitation for heap
> tables too, and we should lift that at some point not too far away.
> Then there's also other AMs that could really use a wider tid space.

I agree that that limitation is a problem that should be fixed before
too long. But the solution probably shouldn't be a radical departure
from what we have today. The vast majority of tables are not affected
by the TID space limitation. Those tables that are can tolerate
supporting fixed width "long" TIDs (perhaps 8 bytes long?) that are
used for the higher portion of the heap TID space alone.

The idea here is that TID is varwidth, but actually uses the existing
heap TID format most of the time. For larger tables it uses a wider
fixed width struct that largely works the same as the old 6 byte
struct.

> > Though I suppose a posting list almost has to have fixed width TIDs to
> > perform acceptably.
>
> Hm. It's not clear to me why that is?

Random access matters for things like determining the correct offset
to split a posting list at. This is needed in the event of an
overlapping insertion of a new duplicate tuple whose heap TID falls
within the range of the posting list. Also, I want to be able to scan
posting lists backwards for backwards scans. In general, fixed-width
TIDs make the page space accounting fairly simple, which matters a lot
in nbtree.

I can support varwidth TIDs in the future pretty well if the TID
doesn't have to be *arbitrarily* wide. Individual posting lists can
themselves either use 6 byte or 8 byte TIDs, preserving the ability to
access a posting list entry at random using simple pointer arithmetic.
This makes converting over index AMs a lot less painful -- it'll be
pretty easy to avoid mixing together the 6 byte and 8 byte structs.

> > Can we steal some bits that are currently used for offset number
> > instead? 16 bits is far more than we ever need to use for heap offset
> > numbers in practice.
>
> I think that's a terrible idea. For one, some AMs will have significant
> higher limits, especially taking compression and larger block sizes into
> account. Also not all AMs need identifiers tied so closely to a disk
> position, e.g. zedstore does not.  We shouldn't hack evermore
> information into the offset, given that background.

Fair enough, but somebody needs to cut some scope here.

> Having to walk through the index tuple might be acceptable - in all
> likelihood we'll have to do so anyway.  It does however not *really*
> resolve the issue that we still need to pass something tid back from the
> indexam, so we can fetch the associated tuple from the heap, or add the
> tid to a bitmap. But that could be done separately from the index
> internal data structures.

I agree.

> > Generalizing the nbtree AM to be able to work with an arbitrary type
> > of table row identifier that isn't at all like a TID raises tricky
> > definitional questions.

> Hm. I don't see why a different types of TID would imply them being
> stable?

It is unclear what it means. I would like to see a sketch of a design
for varwidth TIDs that balances everybody's concerns. I don't think
"indirect" indexes are a realistic goal for Postgres. VACUUM is just
too messy there (as is any other garbage collection mechanism).
Zedstore and Zheap don't change this.

> > Frankly I am not very enthusiastic about working on a project that has
> > unclear scope and unclear benefits for users.
>
> Why would properly supporting AMs like zedstore, global indexes,
> "indirect" indexes etc benefit users?

Global indexes seem doable.

I don't see how "indirect" indexes can ever work in Postgres. I don't
know exactly what zedstore needs here, but maybe it can work well with
a less ambitious design for varwidth TIDs along the lines I've
sketched.

-- 
Peter Geoghegan

Re: Parallel leader process info in EXPLAIN

2019-10-30 Thread Tomas Vondra


On Wed, Oct 30, 2019 at 10:39:04AM -0700, Peter Geoghegan wrote:

On Wed, Oct 30, 2019 at 9:24 AM Melanie Plageman
 wrote:

Checked out the patches a bit and noticed that the tuplesort
instrumentation uses spaceUsed and I saw this comment in
tuplesort_get_stats()



might it be worth trying out the memory accounting API
5dd7fc1519461548eebf26c33eac6878ea3e8788 here?


I made exactly the same suggestion several years back, not long after
the API was first proposed by Jeff. However, tuplesort.c has changed a
lot since that time, to the extent that that comment now seems kind of
obsolete. These days, availMem accounting still isn't used at all for
disk sorts. Rather, the slab allocator is used. Virtually all the
memory used to merge is now managed by logtape.c, with only fixed
per-tape memory buffers within tuplesort.c. This per-tape stuff is
tiny anyway -- slightly more than 1KiB per tape.

It would now be relatively straightforward to report the memory used
by disk-based sorts, without needing to use the memory accounting API.
I imagine that it would report the high watermark memory usage during
the final merge. It's possible for this to be lower than the high
watermark during initial run generation (e.g. because of the
MaxAllocSize limit in buffer size within logtape.c), but that still
seems like the most useful figure to users. There'd be a new
"LogicalTapeSetMemory()" function to go along with the existing
LogicalTapeSetBlocks() function, or something along those lines.

Not planning to work on this now, but perhaps you have time for it.



Another thing worth mentioning is that the memory accounting API does
nothing about the pfree() calls, mentioned in the comment. The memory is
tracked at the block level, so unless the pfree() frees the whole block
(which only really happens for oversized chunks) the accounting info
does not change.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

2019-10-30 Thread Andres Freund

Hi,

On 2019-10-30 11:33:21 -0700, Peter Geoghegan wrote:
> On Mon, Apr 22, 2019 at 9:35 AM Andres Freund  wrote:
> > On 2019-04-21 17:46:09 -0700, Peter Geoghegan wrote:
> > > Andres has suggested that I work on teaching nbtree to accommodate
> > > variable-width, logical table identifiers, such as those required for
> > > indirect indexes, or clustered indexes, where secondary indexes must
> > > use a logical primary key value instead of a heap TID.
> 
> I'm revisiting this thread now because it may have relevance to the
> nbtree deduplication patch. If nothing else, the patch further commits
> us to the current heap TID format by making assumptions about the
> width of posting lists with 6 byte TIDs.

I'd much rather not entrench this further, even leaving global indexes
aside. The 4 byte block number is a significant limitation for heap
tables too, and we should lift that at some point not too far away.
Then there's also other AMs that could really use a wider tid space.


> Though I suppose a posting list almost has to have fixed width TIDs to
> perform acceptably.

Hm. It's not clear to me why that is?


> > I think it's two more cases:
> >
> > - table AMs that want to support tables that are bigger than 32TB. That
> >   used to be unrealistic, but it's not anymore. Especially when the need
> >   to VACUUM etc is largely removed / reduced.
> 
> Can we steal some bits that are currently used for offset number
> instead? 16 bits is far more than we ever need to use for heap offset
> numbers in practice.

I think that's a terrible idea. For one, some AMs will have significant
higher limits, especially taking compression and larger block sizes into
account. Also not all AMs need identifiers tied so closely to a disk
position, e.g. zedstore does not.  We shouldn't hack evermore
information into the offset, given that background.


> (I wonder if this would also have benefits for the representation of
> in-memory bitmaps?)

Hm. Not sure how?


> > - global indexes (for cross-partition unique constraints and such),
> >   which need a partition identifier as part of the tid (or as part of
> >   the index key, but I think that actually makes interaction with
> >   indexam from other layers more complicated - the inside of the index
> >   maybe may want to represent it as a column, but to the outside that
> >   ought not to be visible)
> 
> Can we just use an implementation level attribute for this? Would it
> be so bad if we weren't able to jump straight to the partition number
> without walking through the tuple when the tuple has varwidth
> attributes? (If that isn't acceptable, then we can probably make it
> work for global indexes without having to generalize everything.)

Having to walk through the index tuple might be acceptable - in all
likelihood we'll have to do so anyway.  It does however not *really*
resolve the issue that we still need to pass something tid back from the
indexam, so we can fetch the associated tuple from the heap, or add the
tid to a bitmap. But that could be done separately from the index
internal data structures.


> Generalizing the nbtree AM to be able to work with an arbitrary type
> of table row identifier that isn't at all like a TID raises tricky
> definitional questions. It would have to work in a way that made the
> new variety of table row identifier stable, which is a significant new
> requirement (and one that zheap is clearly not interested in).

Hm. I don't see why a different types of TID would imply them being
stable?


> I am not suggesting that these issues are totally insurmountable. What
> I am saying is this: If we already had "stable logical" TIDs instead
> of "mostly physical TIDs", then generalizing nbtree index tuples to
> store arbitrary table row identifiers would more or less be all about
> the data structure managed by nbtree. But that isn't the case, and
> that strongly discourages me from working on this -- we shouldn't talk
> about the problem as if it is mostly just a matter of settling of the
> best index tuple format.



> Frankly I am not very enthusiastic about working on a project that has
> unclear scope and unclear benefits for users.

Why would properly supporting AMs like zedstore, global indexes,
"indirect" indexes etc benefit users?

Greetings,

Andres Freund

Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

2019-10-30 Thread Peter Geoghegan

On Mon, Apr 22, 2019 at 9:35 AM Andres Freund  wrote:
> On 2019-04-21 17:46:09 -0700, Peter Geoghegan wrote:
> > Andres has suggested that I work on teaching nbtree to accommodate
> > variable-width, logical table identifiers, such as those required for
> > indirect indexes, or clustered indexes, where secondary indexes must
> > use a logical primary key value instead of a heap TID.

I'm revisiting this thread now because it may have relevance to the
nbtree deduplication patch. If nothing else, the patch further commits
us to the current heap TID format by making assumptions about the
width of posting lists with 6 byte TIDs. Though I suppose a posting
list almost has to have fixed width TIDs to perform acceptably.
Database systems with a varwidth TID format probably don't support
anything like posting lists.

> I think it's two more cases:
>
> - table AMs that want to support tables that are bigger than 32TB. That
>   used to be unrealistic, but it's not anymore. Especially when the need
>   to VACUUM etc is largely removed / reduced.

Can we steal some bits that are currently used for offset number
instead? 16 bits is far more than we ever need to use for heap offset
numbers in practice. (I wonder if this would also have benefits for
the representation of in-memory bitmaps?)

> - global indexes (for cross-partition unique constraints and such),
>   which need a partition identifier as part of the tid (or as part of
>   the index key, but I think that actually makes interaction with
>   indexam from other layers more complicated - the inside of the index
>   maybe may want to represent it as a column, but to the outside that
>   ought not to be visible)

Can we just use an implementation level attribute for this? Would it
be so bad if we weren't able to jump straight to the partition number
without walking through the tuple when the tuple has varwidth
attributes? (If that isn't acceptable, then we can probably make it
work for global indexes without having to generalize everything.)

In general, Postgres heap TIDs are not stable identifiers of rows
(they only stably identify HOT chains). This is not the case in other
database systems, which may go to great trouble to make it possible to
assume that TIDs are stable over time (Andy Pavlo says that our TIDs
are physical while Oracle's are logical -- I don't like that
terminology, but know what he means). Generalizing the nbtree AM to be
able to work with an arbitrary type of table row identifier that isn't
at all like a TID raises tricky definitional questions. It would have
to work in a way that made the new variety of table row identifier
stable, which is a significant new requirement (and one that zheap is
clearly not interested in).

If we're willing to support something like a clustered index in
nbtree, I wonder what we need to do to make things like numeric
display scale still work. For bonus points, describe how unique
checking works with a secondary unique index.

I am not suggesting that these issues are totally insurmountable. What
I am saying is this: If we already had "stable logical" TIDs instead
of "mostly physical TIDs", then generalizing nbtree index tuples to
store arbitrary table row identifiers would more or less be all about
the data structure managed by nbtree. But that isn't the case, and
that strongly discourages me from working on this -- we shouldn't talk
about the problem as if it is mostly just a matter of settling of the
best index tuple format. Frankly I am not very enthusiastic about
working on a project that has unclear scope and unclear benefits for
users.

-- 
Peter Geoghegan

Re: Creating foreign key on partitioned table is too slow

2019-10-30 Thread Tomas Vondra


On Thu, Oct 24, 2019 at 04:28:38PM -0700, Andres Freund wrote:

Hi,

On 2019-10-23 05:59:01 +, kato-...@fujitsu.com wrote:

To benchmark with tpcb model, I tried to create a foreign key in the 
partitioned history table, but backend process killed by OOM.
the number of partitions is 8192. I tried in master(commit: ad4b7aeb84).


Obviously this should be improved. But I think it's also worthwhile to
note that using 8k partitions is very unlikely to be a good choice for
anything. The metadata, partition pruning, etc overhead is just going to
be very substantial.



True. Especially with two partitioned tables, each with 8k partitions.

I do think it makes sense to reduce the memory usage, because just
eating all available memory (in the extreme case) is not very nice. I've
added that patch to the CF, although the patch I shared is very crude
and I'm by no means suggesting it's how it should be done ultimately.

The other bit (speed of planning with 8k partitions) is probably a more
general issue, and I suppose we'll improve that over time. I don't think
there's a simple change magically improving that.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel leader process info in EXPLAIN

2019-10-30 Thread Peter Geoghegan

On Wed, Oct 30, 2019 at 9:24 AM Melanie Plageman
 wrote:
> Checked out the patches a bit and noticed that the tuplesort
> instrumentation uses spaceUsed and I saw this comment in
> tuplesort_get_stats()

> might it be worth trying out the memory accounting API
> 5dd7fc1519461548eebf26c33eac6878ea3e8788 here?

I made exactly the same suggestion several years back, not long after
the API was first proposed by Jeff. However, tuplesort.c has changed a
lot since that time, to the extent that that comment now seems kind of
obsolete. These days, availMem accounting still isn't used at all for
disk sorts. Rather, the slab allocator is used. Virtually all the
memory used to merge is now managed by logtape.c, with only fixed
per-tape memory buffers within tuplesort.c. This per-tape stuff is
tiny anyway -- slightly more than 1KiB per tape.

It would now be relatively straightforward to report the memory used
by disk-based sorts, without needing to use the memory accounting API.
I imagine that it would report the high watermark memory usage during
the final merge. It's possible for this to be lower than the high
watermark during initial run generation (e.g. because of the
MaxAllocSize limit in buffer size within logtape.c), but that still
seems like the most useful figure to users. There'd be a new
"LogicalTapeSetMemory()" function to go along with the existing
LogicalTapeSetBlocks() function, or something along those lines.

Not planning to work on this now, but perhaps you have time for it.

-- 
Peter Geoghegan

Re: PL/Python fails on new NetBSD/PPC 8.0 install

2019-10-30 Thread Tom Lane

I wrote:
> Thomas Munro  writes:
>> From a quick look at the relevant trees, isn't the problem here that
>> cpython thinks it can reserve pthread_t value -1 (or rather, that
>> number cast to unsigned long, which is the type it uses for its own
>> thread IDs):

> Yeah, this.  I shall now go rant at the Python people about that.

Done at

https://bugs.python.org/issue38646

regards, tom lane

Re: Proposal: Global Index

2019-10-30 Thread Andres Freund

Hi,

On 2019-10-30 13:05:57 -0400, Tom Lane wrote:
> Peter Geoghegan  writes:
> > On Wed, Oct 30, 2019 at 9:23 AM Tom Lane  wrote:
> >> Well, the *effects* of the feature seem desirable, but that doesn't
> >> mean that we want an implementation that actually has a shared index.
> >> As soon as you do that, you've thrown away most of the benefits of
> >> having a partitioned data structure in the first place.
> 
> > Right, but that's only the case for the global index. Global indexes
> > are useful when used judiciously.
> 
> But ... why bother with partitioning then?  To me, the main reasons
> why you might want a partitioned table are

Quite commonly there's a lot of *other* indexes, often on a lot wider
data than the primary key, that don't need to be global. And whereas in
a lot of cases the primary key in a partitioned table has pretty good
locality (and thus will be mostly buffered IO), other indexes will often
not have that property (i.e. not have much correlation with table
position).

> * ability to cheaply add and remove partitions, primarily so that
> you can cheaply do things like "delete the oldest month's data".

You can still do that to some degree with a global index. Imagine
e.g. keeping a 'partition id' as a sort-of column in the global
index. That allows you to drop the partition, without having to
immediately rebuild the index, by checking the partition id against the
live partitions during lookup.  So sure, your'e wasting space for a bit
in the global index, but it'll also be space that is likely to be fairly
efficiently reclaimed the next time vacuum touches the index.  And if
not the global index can be rebuilt concurrently without blocking
writes.

> * ability to scale past our limits on the physical size of one table
> --- both the hard BlockNumber-based limit, and the performance
> constraints of e.g. vacuuming a very large table.

For that to be a problem for a global index the global index (which will
often be something like two int4 or int8 columns) itself would need to
be above the block number based limit - which doesn't seem that close.

WRT vacuuming - based on my observations the table itself isn't a
performance problem for vacuuming all that commonly anymore, it's the
associated index scans. So yea, that's a problem.

Greetings,

Andres Freund

Re: [Proposal] Add accumulated statistics

2019-10-30 Thread Pavel Stehule

út 15. 1. 2019 v 2:14 odesílatel Tsunakawa, Takayuki <
tsunakawa.ta...@jp.fujitsu.com> napsal:

> From: Pavel Stehule [mailto:pavel.steh...@gmail.com]
> > the cumulated lock statistics maybe doesn't help with debugging - but it
> > is very good indicator of database (in production usage) health.
>
> I think it will help both.  But I don't think the sampling won't be as
> helpful as the precise lock statistics accumulation, because the sampling
> doesn't give us exactly how effective our improvements to PostgreSQL code
> are.  I remember PG developers used LOCK_STATS to see how many (or ratio
> of) lwlock waits decreased by applying patches.
>
>
> We can use the cumulated lock stats like:
>
> 1. SELECT * FROM pg_session_waits;
> 2. Run a benchmark.
> 3. SELECT * FROM pg_session_waits;
> 4. Calculate the difference between 1 and 3.
>
> Or, reset the wait stats before the benchmark run and just use the stats
> as-is.
>
> I'd like to know why you thought the cumulated wait stats isn't helpful
> for debugging.
>

I don't remember my thoughts, and maybe I used wrong sentences - usually
lock times are very small, and very unstable if you has too detailed level.
But if you use aggregated values per some longer time window, then these
values can be stable and very interesting. More - usually lock time has
correlation with database (application) health.

Like you I don't think so sampled values are too helpful.

Regards

Pavel


>
> Regards
> Takayuki Tsunakawa
>
>
>

Re: Proposal: Global Index

2019-10-30 Thread Tom Lane

Peter Geoghegan  writes:
> On Wed, Oct 30, 2019 at 9:23 AM Tom Lane  wrote:
>> Well, the *effects* of the feature seem desirable, but that doesn't
>> mean that we want an implementation that actually has a shared index.
>> As soon as you do that, you've thrown away most of the benefits of
>> having a partitioned data structure in the first place.

> Right, but that's only the case for the global index. Global indexes
> are useful when used judiciously.

But ... why bother with partitioning then?  To me, the main reasons
why you might want a partitioned table are

* ability to cheaply add and remove partitions, primarily so that
you can cheaply do things like "delete the oldest month's data".

* ability to scale past our limits on the physical size of one table
--- both the hard BlockNumber-based limit, and the performance
constraints of e.g. vacuuming a very large table.

Both of those go out the window with a global index.  So you might
as well just have one table and forget all the overhead.

regards, tom lane

Re: Proposal: Global Index

2019-10-30 Thread Peter Geoghegan

On Wed, Oct 30, 2019 at 9:23 AM Tom Lane  wrote:
> Well, the *effects* of the feature seem desirable, but that doesn't
> mean that we want an implementation that actually has a shared index.
> As soon as you do that, you've thrown away most of the benefits of
> having a partitioned data structure in the first place.

Right, but that's only the case for the global index. Global indexes
are useful when used judiciously. They enable the use of partitioning
for use cases where not being able to enforce uniqueness across all
partitions happens to be a deal breaker. I bet that this is fairly
common.

> No, I don't have an idea how we might support, eg, uniqueness of
> non-partition-key columns without that.  But we need to spend our
> effort on figuring that out, not on building a complicated mechanism
> whose performance is never going to not suck.

I don't think that there is a way to solve the problem that doesn't
look very much like a global index. Also, being able to push down a
partition number when scanning a global index seems like it would be
very compelling in some scenarios.

I'm a bit worried about the complexity that will need to be added to
nbtree to make global indexes work, but it's probably possible to come
up with something that isn't too bad. GIN already uses an
implementation level attribute number column for multi-column GIN
indexes, which is a little like what Ibrar has in mind. The really
complicated new code required for global indexes will be in places
like vacuumlazy.c.

--
Peter Geoghegan

Re: Parallel leader process info in EXPLAIN

2019-10-30 Thread Melanie Plageman

On Wed, Oct 23, 2019 at 12:30 AM Thomas Munro 
wrote:

>
> While working on some slides explaining EXPLAIN, I couldn't resist the
> urge to add the missing $SUBJECT.  The attached 0001 patch gives the
> following:
>
> Gather  ... time=0.146..33.077 rows=1 loops=1)
>   Workers Planned: 2
>   Workers Launched: 2
>   Buffers: shared hit=4425
>   ->  Parallel Seq Scan on public.t ... time=19.421..30.092 rows=0 loops=3)
> Filter: (t.i = 42)
> Rows Removed by Filter: 33
> Leader: actual time=0.013..32.025 rows=1 loops=1  <--- NEW
>   Buffers: shared hit=1546<--- NEW
> Worker 0: actual time=29.069..29.069 rows=0 loops=1
>   Buffers: shared hit=1126
> Worker 1: actual time=29.181..29.181 rows=0 loops=1
>   Buffers: shared hit=1753
>
> Without that, you have to deduce what work was done in the leader, but
> I'd rather just show it.
>
> The 0002 patch adjusts Sort for consistency with that scheme, so you get:
>
> Sort  ... time=84.303..122.238 rows=33 loops=3)
>Output: t1.i
>Sort Key: t1.i
>Leader:  Sort Method: external merge  Disk: 5864kB   <--- DIFFERENT
>Worker 0:  Sort Method: external merge  Disk: 3376kB
>Worker 1:  Sort Method: external merge  Disk: 4504kB
>Leader: actual time=119.624..165.949 rows=426914 loops=1
>Worker 0: actual time=61.239..90.984 rows=245612 loops=1
>Worker 1: actual time=72.046..109.782 rows=327474 loops=1
>
> Without the "Leader" label, it's not really clear to the uninitiated
> whether you're looking at combined, average or single process numbers.
>

Cool! I dig it.
Checked out the patches a bit and noticed that the tuplesort
instrumentation uses spaceUsed and I saw this comment in
tuplesort_get_stats()

/*
* Note: it might seem we should provide both memory and disk usage for a
* disk-based sort.  However, the current code doesn't track memory space
* accurately once we have begun to return tuples to the caller (since we
* don't account for pfree's the caller is expected to do), so we cannot
* rely on availMem in a disk sort.  This does not seem worth the overhead
* to fix.  Is it worth creating an API for the memory context code to
* tell us how much is actually used in sortcontext?
*/

might it be worth trying out the memory accounting API
5dd7fc1519461548eebf26c33eac6878ea3e8788 here?

>
> Of course there are some more things that could be reported in a
> similar way eventually, such as filter counters and hash join details.
>
>
This made me think about other explain wishlist items.
For parallel hashjoin, I would find it useful to know which batches
each worker participated in (maybe just probing to start with, but
loading would be great too).

I'm not sure anyone else (especially users) would care about this,
though.

-- 
Melanie Plageman

Re: Proposal: Global Index

2019-10-30 Thread Tom Lane

Robert Haas  writes:
> On Wed, Oct 30, 2019 at 10:13 AM Tom Lane  wrote:
>> I believe that the current design of partitioning is explicitly intended
>> to avoid the need for such a construct.  It'd be absolutely disastrous
>> to have such a thing from many standpoints, including the breadth of
>> locking needed to work with the global index, the difficulty of vacuuming,
>> and the impossibility of cheaply attaching or detaching partitions.
>> In other words, this is a "feature" we do not want.

> I don't think that's true. Certainly, a lot of EnterpriseDB customers
> want this feature - it comes up regularly in discussions here. But
> that is not to say that the technical challenges are not formidable,
> and I don't think this proposal really grapples with any of them.
> However, that doesn't mean that the feature isn't desirable.

Well, the *effects* of the feature seem desirable, but that doesn't
mean that we want an implementation that actually has a shared index.
As soon as you do that, you've thrown away most of the benefits of
having a partitioned data structure in the first place.

No, I don't have an idea how we might support, eg, uniqueness of
non-partition-key columns without that.  But we need to spend our
effort on figuring that out, not on building a complicated mechanism
whose performance is never going to not suck.

regards, tom lane

Re: tableam vs. TOAST

2019-10-30 Thread Robert Haas

On Wed, Oct 30, 2019 at 3:49 AM Prabhat Sahu 
wrote:

> While testing the Toast patch(PG+v7 patch) I found below server crash.
> System configuration:
> VCPUs: 4, RAM: 8GB, Storage: 320GB
>
> This issue is not frequently reproducible, we need to repeat the same
> testcase multiple times.
>

I wonder if this is an independent bug, because the backtrace doesn't look
like it's related to the stuff this is changing. Your report doesn't
specify whether you can also reproduce the problem without the patch, which
is something that you should always check before reporting a bug in a
particular patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Proposal: Global Index

2019-10-30 Thread Robert Haas

On Wed, Oct 30, 2019 at 10:13 AM Tom Lane  wrote:
> I believe that the current design of partitioning is explicitly intended
> to avoid the need for such a construct.  It'd be absolutely disastrous
> to have such a thing from many standpoints, including the breadth of
> locking needed to work with the global index, the difficulty of vacuuming,
> and the impossibility of cheaply attaching or detaching partitions.
>
> In other words, this is a "feature" we do not want.

I don't think that's true. Certainly, a lot of EnterpriseDB customers
want this feature - it comes up regularly in discussions here. But
that is not to say that the technical challenges are not formidable,
and I don't think this proposal really grapples with any of them.
However, that doesn't mean that the feature isn't desirable.

One of the biggest reasons why people want it is to enforce uniqueness
for secondary keys - e.g. the employees table is partitioned by
employee ID, but SSN should also be unique, at least among employees
for whom it's not NULL.

But people also want it for faster data retrieval: if you're looking
for a commonly-occurring value, an index per partition is fine. But if
you're looking for values that occur only once or a few times across
the whole hierarchy, an index scan per partition is very costly.
Consider, e.g.:

Nested Loop
-> Seq Scan
-> Append
  -> Index Scan on each_partition

You don't have to have very many partitions for that to suck, and it's
a thing that people want to do. Runtime partition pruning helps with
this case a lot, but, once again, only for the primary key. Secondary
keys are a big problem for partitioning today, in many ways.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: v12.0: ERROR: could not find pathkey item to sort

2019-10-30 Thread Tom Lane

Amit Langote  writes:
> Attached updated patches.

[ looks at that... ]  I seriously, seriously dislike what you did
in build_join_rel, ie adding the new joinrel to the global data
structures before it's fully filled in.  That's inevitably going
to bite us on the ass someday, and you couldn't even be bothered
with a comment.

Worse, the reason you did that seems to be so that
generate_join_implied_equalities can find and modify the joinrel,
which is an undocumented and not very well thought out side-effect.
There are other call sites for that where the joinrel may or may not
already exist, meaning that it might or might not add more members into
the joinrel's eclass_indexes.  At best that's wasted work, and at
worst it costs efficiency, since we could in principle get different
sets of common relids depending on which join input pair we're
considering.  They're equally valid sets, but unioning them is
just going to add more relids than we need.

Also, the existing logic around eclass_indexes is that it's only
set for baserels and we know it is valid after we've finished
EC merging.  I don't much like modifying add_child_rel_equivalences
to have some different opinions about that for joinrels.

It'd probably be better to just eat the cost of doing
get_common_eclass_indexes over again when it's time to do
add_child_rel_equivalences for a joinrel, since we aren't (I hope)
going to do that more than once per joinrel anyway.  This would
probably require refactoring things so that there are separate
entry points to add child equivalences for base rels and join rels.
But that seems cleaner anyway than what you've got here.

David --- much of the complexity here comes from the addition of
the eclass_indexes infrastructure, so do you have any thoughts?

regards, tom lane

Re: PL/Python fails on new NetBSD/PPC 8.0 install

2019-10-30 Thread Tom Lane

Thomas Munro  writes:
> On Wed, Oct 30, 2019 at 9:25 AM Tom Lane  wrote:
>> What I'm inclined to do is go file a bug report saying that this
>> behavior contradicts both POSIX and NetBSD's own man page, and
>> see what they say about that.

So I went and filed that bug,

http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=54661

and the answer seems to be that netbsd's libpthread is operating as
designed.  They don't support creating new threads if libpthread
wasn't present at main program start, so redirecting all the
entry points to the libc stub functions in that case is actually
pretty sane, self-consistent behavior.

This behavior is actually kinda useful from our standpoint: it means
that a perlu/pythonu/tclu function *can't* cause a backend to become
multithreaded, even if it tries.  So I definitely don't want to
"fix" this by linking libpthread to the core backend; that would
open us up to problems we needn't have, on this platform anyway.

> From a quick look at the relevant trees, isn't the problem here that
> cpython thinks it can reserve pthread_t value -1 (or rather, that
> number cast to unsigned long, which is the type it uses for its own
> thread IDs):

Yeah, this.  I shall now go rant at the Python people about that.

regards, tom lane

Re: WIP/PoC for parallel backup

2019-10-30 Thread Asif Rehman

On Mon, Oct 28, 2019 at 8:29 PM Robert Haas  wrote:

> On Mon, Oct 28, 2019 at 10:03 AM Asif Rehman 
> wrote:
> > I have updated the patch to include the changes suggested by Jeevan.
> This patch also implements the thread workers instead of
> > processes and fetches a single file at a time. The tar format has been
> disabled for first version of parallel backup.
>
> Looking at 0001-0003:
>
> It's not clear to me what the purpose of the start WAL location is
> supposed to be. As far as I can see, SendBackupFiles() stores it in a
> variable which is then used for exactly nothing, and nothing else uses
> it.  It seems like that would be part of a potential incremental
> backup feature, but I don't see what it's got to do with parallel full
> backup.
>

'startptr' is used by sendFile() during checksum verification. Since
SendBackupFiles() is using sendFIle we have to set a valid WAL location.


> The tablespace_path option appears entirely unused, and I don't know
> why that should be necessary here, either.
>

This is to calculate the basepathlen. We need to exclude the tablespace
location (or
base path) from the filename before it is sent to the client with sendFile
call. I added
this option primarily to avoid performing string manipulation on filename
to extract the
tablespace location and then calculate the basepathlen.

Alternatively we can do it by extracting the base path from the received
filename. What
do you suggest?


>
> STORE_BACKUPFILE() seems like maybe it should be a function rather
> than a macro, and also probably be renamed, because it doesn't store
> files and the argument's not necessarily a file.
>
Sure.


>
> SendBackupManifest() does not send a backup manifest in the sense
> contemplated by the email thread on that subject.  It sends a file
> list.  That seems like the right idea - IMHO, anyway - but you need to
> do a thorough renaming.
>

I'm considering the following command names:
START_BACKUP
- Starts the backup process

SEND_BACKUP_FILELIST (Instead of SEND_BACKUP_MANIFEST)
- Sends the list of all files (along with file information such as
filename, file type (directory/file/link),
file size and file mtime for each file) to be backed up.

SEND_BACKUP_FILES
- Sends one or more files to the client.

STOP_BACKUP
- Stops the backup process.

I'll update the function names accordingly after your confirmation. Of
course, suggestions for
better names are welcome.


>
> I think it would be fine to decide that this facility won't support
> exclusive-mode backup.
>

Sure. Will drop this patch.


>
> I don't think much of having both sendDir() and sendDir_(). The latter
> name is inconsistent with any naming convention we have, and there
> seems to be no reason not to just add an argument to sendDir() and
> change the callers.


> I think we should rename - perhaps as a preparatory patch - the
> sizeonly flag to dryrun, or something like that.
>

Sure, will take care of it.


> The resource cleanup does not look right.  You've included calls to
> PG_ENSURE_ERROR_CLEANUP(base_backup_cleanup, 0) in both StartBackup()
> and StopBackup(), but what happens if there is an error or even a
> clean shutdown of the connection in between? I think that there needs

to be some change here to ensure that a walsender will always call
> base_backup_cleanup() when it exits; I think that'd probably remove
> the need for any PG_ENSURE_ERROR_CLEANUP calls at all, including ones
> we have already.  This might also be something that could be done as a
> separate, prepatory refactoring patch.
>

You're right. I didn't handle this case properly. I will removed
PG_ENSURE_ERROR_CLEANUP
calls and replace it with before_shmem_exit handler. This way
whenever backend process exits,
base_backup_cleanup will be called:
- If it exists before calling the do_pg_stop_backup, base_backup_cleanup
will take care of cleanup.
- otherwise in case of a clean shutdown (after calling do_pg_stop_backup)
then base_backup_cleanup
will simply return without doing anything.


--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca

Re: Proposal: Global Index

2019-10-30 Thread Tom Lane

Ibrar Ahmed  writes:
> A global index by very definition is a single index on the parent table
> that maps to many
> underlying table partitions.

I believe that the current design of partitioning is explicitly intended
to avoid the need for such a construct.  It'd be absolutely disastrous
to have such a thing from many standpoints, including the breadth of
locking needed to work with the global index, the difficulty of vacuuming,
and the impossibility of cheaply attaching or detaching partitions.

In other words, this is a "feature" we do not want.

regards, tom lane

Re: Add SQL function to show total block numbers in the relation

2019-10-30 Thread Tom Lane

btkimurayuzk  writes:
> I propose new simple sql query, which shows total block numbers in the 
> relation.
> ...
> Of cource, we can know this value such as
> select (pg_relation_size('t') / 
> current_setting('block_size')::bigint)::int;

I don't really see why the existing solution isn't sufficient.

regards, tom lane

Re: Binary support for pgoutput plugin

2019-10-30 Thread Dave Cramer

On Sun, 27 Oct 2019 at 11:00, Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Mon, Jun 17, 2019 at 10:29:26AM -0400, Dave Cramer wrote:
> > > Which is what I have done. Thanks
> > >
> > > I've attached both patches for comments.
> > > I still have to add documentation.
> >
> > Additional patch for documentation.
>
> Thank you for the patch! Unfortunately 0002 has some conflicts, could
> you please send a rebased version? In the meantime I have few questions:
>
> -LogicalRepRelId
> +void
>  logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
>  {
> charaction;
> -   LogicalRepRelId relid;
> -
> -   /* read the relation id */
> -   relid = pq_getmsgint(in, 4);
>
> action = pq_getmsgbyte(in);
> if (action != 'N')
> @@ -175,7 +173,6 @@ logicalrep_read_insert(StringInfo in,
> LogicalRepTupleData *newtup)
>
> logicalrep_read_tuple(in, newtup);
>
> -   return relid;
>  }
>
> 
>
> -   relid = logicalrep_read_insert(s, );
> +   /* read the relation id */
> +   relid = pq_getmsgint(s, 4);
> rel = logicalrep_rel_open(relid, RowExclusiveLock);
> +
> +   logicalrep_read_insert(s, );
>
> Maybe I'm missing something, what is the reason of moving pq_getmsgint
> out of logicalrep_read_*? Just from the patch it seems that the code is
> equivalent.
>
> > There is one obvious hack where in binary mode I reset the input
> > cursor to allow the binary input to be re-read From what I can tell
> > the alternative is to convert the data in logicalrep_read_tuple but
> > that would require moving a lot of the logic currently in worker.c to
> > proto.c. This seems minimally invasive.
>
> Which logic has to be moved for example? Actually it sounds more natural
> to me, if this functionality would be in e.g. logicalrep_read_tuple
> rather than slot_store_data, since it has something to do with reading
> data. And it seems that in pglogical it's also located in
> pglogical_read_tuple.
>

Ok, I've rebased and reverted logicalrep_read_insert

As to the last comment, honestly it's been so long I can't remember why I
put that comment in there.

Thanks for reviewing

Dave


0001-First-pass-at-working-code-without-subscription-opti.patch
Description: Binary data


0003-document-new-binary-option-for-CREATE-SUBSCRIPTION.patch
Description: Binary data


0002-add-binary-column-to-pg_subscription.patch
Description: Binary data


0004-get-relid-inside-of-logical_read_insert.patch
Description: Binary data

Remove HAVE_LONG_LONG_INT

2019-10-30 Thread Peter Eisentraut

HAVE_LONG_LONG_INT is now implied by the requirement for C99, so the 
separate Autoconf check can be removed.  The uses are almost all in ecpg 
code, and AFAICT the check was originally added specifically for ecpg.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 913823e1985f14908d7e1abe6215f63982b722a1 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 30 Oct 2019 14:39:15 +0100
Subject: [PATCH] Remove HAVE_LONG_LONG_INT

The presence of long long int is now implied in the requirement for
C99 and the configure check for the same.

We keep the define hard-coded in ecpg_config.h for backward
compatibility.
---
 configure | 117 --
 configure.in  |   1 -
 src/include/c.h   |   2 -
 src/include/pg_config.h.in|   6 -
 src/include/pg_config.h.win32 |   3 -
 src/interfaces/ecpg/ecpglib/data.c|  14 ---
 src/interfaces/ecpg/ecpglib/descriptor.c  |   4 -
 src/interfaces/ecpg/ecpglib/execute.c |   6 +-
 src/interfaces/ecpg/ecpglib/misc.c|   6 -
 src/interfaces/ecpg/include/ecpg_config.h.in  |   2 +-
 src/interfaces/ecpg/preproc/ecpg.trailer  |  36 +-
 src/interfaces/ecpg/test/expected/sql-sqlda.c |   2 -
 src/interfaces/ecpg/test/sql/sqlda.pgc|   2 -
 13 files changed, 7 insertions(+), 194 deletions(-)

diff --git a/configure b/configure
index 6b1c779ee3..7312bd7a7a 100755
--- a/configure
+++ b/configure
@@ -14170,123 +14170,6 @@ fi
 
 
 
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for unsigned long long 
int" >&5
-$as_echo_n "checking for unsigned long long int... " >&6; }
-if ${ac_cv_type_unsigned_long_long_int+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  ac_cv_type_unsigned_long_long_int=yes
- if test "x${ac_cv_prog_cc_c99-no}" = xno; then
-   cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-
-  /* For now, do not test the preprocessor; as of 2007 there are too many
-implementations with broken preprocessors.  Perhaps this can
-be revisited in 2012.  In the meantime, code should not expect
-#if to work with literals wider than 32 bits.  */
-  /* Test literals.  */
-  long long int ll = 9223372036854775807ll;
-  long long int nll = -9223372036854775807LL;
-  unsigned long long int ull = 18446744073709551615ULL;
-  /* Test constant expressions.   */
-  typedef int a[((-9223372036854775807LL < 0 && 0 < 9223372036854775807ll)
-? 1 : -1)];
-  typedef int b[(18446744073709551615ULL <= (unsigned long long int) -1
-? 1 : -1)];
-  int i = 63;
-int
-main ()
-{
-/* Test availability of runtime routines for shift and division.  */
-  long long int llmax = 9223372036854775807ll;
-  unsigned long long int ullmax = 18446744073709551615ull;
-  return ((ll << 63) | (ll >> 63) | (ll < i) | (ll > i)
- | (llmax / ll) | (llmax % ll)
- | (ull << 63) | (ull >> 63) | (ull << i) | (ull >> i)
- | (ullmax / ull) | (ullmax % ull));
-  ;
-  return 0;
-}
-
-_ACEOF
-if ac_fn_c_try_link "$LINENO"; then :
-
-else
-  ac_cv_type_unsigned_long_long_int=no
-fi
-rm -f core conftest.err conftest.$ac_objext \
-conftest$ac_exeext conftest.$ac_ext
- fi
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: 
$ac_cv_type_unsigned_long_long_int" >&5
-$as_echo "$ac_cv_type_unsigned_long_long_int" >&6; }
-  if test $ac_cv_type_unsigned_long_long_int = yes; then
-
-$as_echo "#define HAVE_UNSIGNED_LONG_LONG_INT 1" >>confdefs.h
-
-  fi
-
-
-
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for long long int" >&5
-$as_echo_n "checking for long long int... " >&6; }
-if ${ac_cv_type_long_long_int+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  ac_cv_type_long_long_int=yes
-  if test "x${ac_cv_prog_cc_c99-no}" = xno; then
-   ac_cv_type_long_long_int=$ac_cv_type_unsigned_long_long_int
-   if test $ac_cv_type_long_long_int = yes; then
- if test "$cross_compiling" = yes; then :
-  :
-else
-  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-#include 
-#ifndef LLONG_MAX
-# define HALF \
- (1LL << (sizeof (long long int) * CHAR_BIT - 2))
-# define LLONG_MAX (HALF - 1 + HALF)
-#endif
-int
-main ()
-{
-long long int n = 1;
-int i;
-for (i = 0; ; i++)
-  {
-long long int m = n << i;
-if (m >> i != n)
-  return 1;
-if (LLONG_MAX / 2 < m)
-  break;
-  }
-return 0;
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_run "$LINENO"; then :
-
-else
-

MarkBufferDirtyHint() and LSN update

2019-10-30 Thread Antonin Houska

Please consider this scenario (race conditions):

1. FlushBuffer() has written the buffer but hasn't yet managed to clear the
BM_DIRTY flag (however BM_JUST_DIRTIED could be cleared by now).

2. Another backend modified a hint bit and called MarkBufferDirtyHint().

3. In MarkBufferDirtyHint(), if XLogHintBitIsNeeded() evaluates to true
(e.g. due to checksums enabled), new LSN is computed, however it's not
assigned to the page because the buffer is still dirty:

if (!(buf_state & BM_DIRTY))
{
...

if (!XLogRecPtrIsInvalid(lsn))
PageSetLSN(page, lsn);
}

4. MarkBufferDirtyHint() completes.

5. In the first session, FlushBuffer()->TerminateBufferIO() will not clear
BM_DIRTY because MarkBufferDirtyHint() has eventually set
BM_JUST_DIRTIED. Thus the hint bit change itself will be written by the next
call of FlushBuffer(). However page LSN is hasn't been updated so the
requirement that WAL must be flushed first is not met.

I think that PageSetLSN() should be called regardless BM_DIRTY. Do I miss any
subtle detail?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Problem with synchronous replication

2019-10-30 Thread Michael Paquier

On Wed, Oct 30, 2019 at 05:21:17PM +0900, Fujii Masao wrote:
> This change causes every ending backends to always take the exclusive lock
> even when it's not in SyncRep queue. This may be problematic, for example,
> when terminating multiple backends at the same time? If yes,
> it might be better to check SHMQueueIsDetached() again after taking the lock.
> That is,
> 
> if (!SHMQueueIsDetached(&(MyProc->syncRepLinks)))
> {
> LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
> if (!SHMQueueIsDetached(&(MyProc->syncRepLinks)))
> SHMQueueDelete(&(MyProc->syncRepLinks));
> LWLockRelease(SyncRepLock);
> }

Makes sense.  Thanks for the suggestion.
--
Michael


signature.asc
Description: PGP signature

Re: Unix-domain socket support on Windows

2019-10-30 Thread Peter Eisentraut

To move this topic a long, I'll submit some preparatory patches in a 
committable order.


First is the patch to deal with getpeereid() that was already included 
in the previous patch series.  This is just some refactoring that 
reduces the difference between Windows and other platforms and prepares 
the Unix-domain socket specific code to compile cleanly on Windows.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From adcad11992f4fde0a1eda4f96fef4abeb8570cdc Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 30 Oct 2019 12:58:32 +0100
Subject: [PATCH v4] Sort out getpeereid() and peer auth handling on Windows

The getpeereid() uses have so far been protected by HAVE_UNIX_SOCKETS,
so they didn't ever care about Windows support.  But in anticipation
of Unix-domain socket support on Windows, that needs to be handled
differently.

Windows doesn't support getpeereid() at this time, so we use the
existing not-supported code path.  We let configure do its usual thing
of picking up the replacement from libpgport, instead of the custom
overrides that it was doing before.

But then Windows doesn't have struct passwd, so this patch sprinkles
some additional #ifdef WIN32 around to make it work.  This is similar
to existing code that deals with this issue.
---
 configure | 36 +--
 configure.in  | 11 --
 src/backend/libpq/auth.c  | 17 +++
 src/include/port.h|  2 +-
 src/interfaces/libpq/fe-connect.c | 10 ++---
 src/tools/msvc/Mkvcbuild.pm   |  2 +-
 6 files changed, 36 insertions(+), 42 deletions(-)

diff --git a/configure b/configure
index 6b1c779ee3..5d4998d9e1 100755
--- a/configure
+++ b/configure
@@ -15743,6 +15743,19 @@ esac
 
 fi
 
+ac_fn_c_check_func "$LINENO" "getpeereid" "ac_cv_func_getpeereid"
+if test "x$ac_cv_func_getpeereid" = xyes; then :
+  $as_echo "#define HAVE_GETPEEREID 1" >>confdefs.h
+
+else
+  case " $LIBOBJS " in
+  *" getpeereid.$ac_objext "* ) ;;
+  *) LIBOBJS="$LIBOBJS getpeereid.$ac_objext"
+ ;;
+esac
+
+fi
+
 ac_fn_c_check_func "$LINENO" "getrusage" "ac_cv_func_getrusage"
 if test "x$ac_cv_func_getrusage" = xyes; then :
   $as_echo "#define HAVE_GETRUSAGE 1" >>confdefs.h
@@ -15921,19 +15934,13 @@ $as_echo "$as_me: On $host_os we will use our strtof 
wrapper." >&6;}
 esac
 
 case $host_os in
-
 # Windows uses a specialised env handler
-# and doesn't need a replacement getpeereid because it doesn't use
-# Unix sockets.
 mingw*)
 
 $as_echo "#define HAVE_UNSETENV 1" >>confdefs.h
 
-
-$as_echo "#define HAVE_GETPEEREID 1" >>confdefs.h
-
 ac_cv_func_unsetenv=yes
-ac_cv_func_getpeereid=yes;;
+;;
 *)
 ac_fn_c_check_func "$LINENO" "unsetenv" "ac_cv_func_unsetenv"
 if test "x$ac_cv_func_unsetenv" = xyes; then :
@@ -15948,21 +15955,8 @@ esac
 
 fi
 
-ac_fn_c_check_func "$LINENO" "getpeereid" "ac_cv_func_getpeereid"
-if test "x$ac_cv_func_getpeereid" = xyes; then :
-  $as_echo "#define HAVE_GETPEEREID 1" >>confdefs.h
 
-else
-  case " $LIBOBJS " in
-  *" getpeereid.$ac_objext "* ) ;;
-  *) LIBOBJS="$LIBOBJS getpeereid.$ac_objext"
- ;;
-esac
-
-fi
-
-
-   ;;
+;;
 esac
 
 # System's version of getaddrinfo(), if any, may be used only if we found
diff --git a/configure.in b/configure.in
index 2b9025cac3..ebbeea8b2b 100644
--- a/configure.in
+++ b/configure.in
@@ -1718,6 +1718,7 @@ AC_REPLACE_FUNCS(m4_normalize([
explicit_bzero
fls
getopt
+   getpeereid
getrusage
inet_aton
mkdtemp
@@ -1746,18 +1747,14 @@ case $host_os in
 esac
 
 case $host_os in
-
 # Windows uses a specialised env handler
-# and doesn't need a replacement getpeereid because it doesn't use
-# Unix sockets.
 mingw*)
 AC_DEFINE(HAVE_UNSETENV, 1, [Define to 1 because replacement 
version used.])
-AC_DEFINE(HAVE_GETPEEREID, 1, [Define to 1 because function 
not required.])
 ac_cv_func_unsetenv=yes
-ac_cv_func_getpeereid=yes;;
+;;
 *)
-AC_REPLACE_FUNCS([unsetenv getpeereid])
-   ;;
+AC_REPLACE_FUNCS([unsetenv])
+;;
 esac
 
 # System's version of getaddrinfo(), if any, may be used only if we found
diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index d28271c1d8..0bafb83557 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -78,9 +78,7 @@ static intident_inet(hbaPort *port);
  * Peer authentication
  *
  */
-#ifdef HAVE_UNIX_SOCKETS
 static int auth_peer(hbaPort *port);
-#endif
 
 
 /*
@@ -559,11 +557,7

Re: Remove one use of IDENT_USERNAME_MAX

2019-10-30 Thread Peter Eisentraut


On 2019-10-29 15:34, Tom Lane wrote:

Peter Eisentraut  writes:

On 2019-10-28 14:45, Tom Lane wrote:

Kyotaro Horiguchi  writes:

In think one of the reasons for the coding is the fact that *pw is
described to be placed in the static area, which can be overwritten by
succeeding calls to getpw*() functions.



Good point ... so maybe pstrdup instead of using a fixed-size buffer?



Maybe.  Or we just decide that check_usermap() is not allowed to call
getpw*().  It's just a string-matching routine, so it doesn't have any
such business anyway.


I'm okay with that as long as you add a comment describing this
assumption.


Committed with a pstrdup().  That seemed more consistent with other code 
in that file.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgbench - extend initialization phase control

2019-10-30 Thread Fujii Masao

On Mon, Oct 28, 2019 at 10:36 PM Fabien COELHO  wrote:
>
>
> Hello Masao-san,
>
> >> Maybe. If you cannot check, you can only guess. Probably it should be
> >> small, but the current version does not allow to check whether it is so.
> >
> > Could you elaborate what you actually want to measure the performance
> > impact by adding explicit begin and commit? Currently pgbench -i issues
> > the following queries. The data generation part is already executed within
> > single transaction. You want to execute not only data generation but also
> > drop/creation of tables within single transaction, and measure how much
> > performance impact happens? I'm sure that would be negligible.
> > Or you want to execute data generate in multiple transactions, i.e.,
> > execute each statement for data generation (e.g., one INSERT) in single
> > transaction, and then want to measure the performance impact?
> > But the patch doesn't enable us to do such data generation yet.
>
> Indeed, you cannot do this precise thing, but you can do others.
>
> > So I'm thinking that it's maybe better to commit the addtion of "G" option
> > first separately. And then we can discuss how much "(" and ")" options
> > are useful later.
>
> Attached patch v6 only provides G - server side data generation.

Thanks for the patch!

+ snprintf(sql, sizeof(sql),
+ "insert into pgbench_branches(bid,bbalance) "
+ "select bid, 0 "
+ "from generate_series(1, %d) as bid", scale);

"scale" should be "nbranches * scale".

+ snprintf(sql, sizeof(sql),
+ "insert into pgbench_accounts(aid,bid,abalance,filler) "
+ "select aid, (aid - 1) / %d + 1, 0, '' "
+ "from generate_series(1, %d) as aid", naccounts, scale * naccounts);

Like client-side data generation, INT64_FORMAT should be used here
instead of %d?

If large scale factor is specified, the query for generating pgbench_accounts
data can take a very long time. While that query is running, operators may be
likely to do Ctrl-C to cancel the data generation. In this case, IMO pgbench
should cancel the query, i.e., call PQcancel(). Otherwise, the query will keep
running to the end.

- for (step = initialize_steps; *step != '\0'; step++)
+ for (const char *step = initialize_steps; *step != '\0'; step++)

Per PostgreSQL basic coding style, ISTM that "const char *step"
should be declared separately from "for" loop, like the original.

- fprintf(stderr, "unrecognized initialization step \"%c\"\n",
+ fprintf(stderr,
+ "unrecognized initialization step \"%c\"\n"
+ "Allowed step characters are: \"" ALL_INIT_STEPS "\".\n",
  *step);
- fprintf(stderr, "allowed steps are: \"d\", \"t\", \"g\", \"v\",
\"p\", \"f\"\n");

The original message seems better to me. So what about just appending "G"
into the above latter message? That is,
"allowed steps are: \"d\", \"t\", \"g\", \"G\", \"v\", \"p\", \"f\"\n"

-  g (Generate data)
+  g or G
(Generate data, client or server side)

Isn't it better to explain a bit more what "client-side / server-side data
generation" is? For example, something like

When "g" (client-side data generation) is specified, data is generated
in pgbench client and sent to the server. When "G" (server-side data
generation) is specified, only queries are sent from pgbench client
and then data is generated in the server. If the network bandwidth is low
between pgbench and the server, using "G" might make the data
generation faster.

Regards,

-- 
Fujii Masao

Re: v12.0: ERROR: could not find pathkey item to sort

2019-10-30 Thread Amit Langote

Thanks for taking a look and sorry about the delay in replying.

On Fri, Oct 25, 2019 at 1:51 AM Tom Lane  wrote:
> Amit Langote  writes:
> > On Mon, Oct 14, 2019 at 11:54 PM Tom Lane  wrote:
> >> In view of the proposed patches being dependent on some other
> >> 13-only changes, I wonder if we should fix v12 by reverting
> >> d25ea0127.  The potential planner performance loss for large
> >> partition sets could be nasty, but failing to plan at all is worse.
>
> > Actually, the patch I proposed to fix equivalence code can be applied
> > on its own.  The example I gave on that thread needs us to fix
> > partitionwise code to even work, which is perhaps a 13-only change,
> > but we have an example here that is broken due to d25ea01275.
> > Perhaps, considering applying my patch seems better than alternatives
> > which are either reverting d25ea01275 or avoiding doing partitionwise
> > join for such queries.
>
> I looked at this a bit, and I see that the core idea is to generate
> the missing EC members by applying add_child_rel_equivalences when
> building child join rels.  Perhaps we can make that work, but this
> patch fails to, because you've ignored the comment pointing out
> that the nullable_relids fixup logic only works for baserels:
>
>  * And likewise for nullable_relids.  Note this code assumes
>  * parent and child relids are singletons.
>
> We could improve that to work for joinrels, I think, but it would become
> very much more complicated (and slower).  AFAICS the logic would have
> to be "iterate through top_parent_relids, see which ones are in
> em_nullable_relids, and for each one that is, find the corresponding
> child_relid and substitute that in new_nullable_relids".  This is a
> bit of a problem because we don't have any really convenient way to
> map individual top parent relids to child relids.

Actually, there is adjust_child_relids_multilevel() which translates
the top parent relids in em_nullable_relids to child relids.

I have updated the patches that way.

> I wonder if we
> should extend AppendRelInfo to include the top parent relid as well as
> the immediate parent.  (Perhaps, while we're at it, we could make
> adjust_appendrel_attrs_multilevel less of an inefficient and
> underdocumented mess.)

Hmm, I agree we should try to fix that situation somehow.  Even better
if we could do away with child expressions in ECs, because they cause
EC related code to show up in profiles when executing complex queries
with thousands of partitions.

> Also, I'm pretty sure this addition is wrong/broken:
>
> +
> +/*
> + * There aren't going to be more expressions to translate in
> + * the same EC.
> + */
> +break;
>
> What makes you think that an EC can only contain one entry per rel?

I was wrong about that.  Fixed.

> More generally, as long as this patch requires changing
> add_child_rel_equivalences' API anyway, I wonder if we should
> rethink that altogether.  Looking at the code now, I realize that
> d25ea01275 resulted in horribly bastardizing that function's API,
> because the parent_rel and appinfo arguments are only consulted in
> some cases, while in other cases we disregard them and rely on
> child_rel->top_parent_relids to figure out how to translate stuff.
> It would be possible to make the argument list be just (root, child_rel)
> and always rely on child_rel->top_parent_relids.

Actually, as of 3373c71553, add_child_rel_equivalences() looks at
parent_rel->eclass_indexes to look up ECs, so maybe we can't take out
parent_rel.

>  At the very least,
> if we keep the extra arguments, we should document them as being just
> optimizations.

For common cases that doesn't involve multi-level partitioning, it
really helps to have the appinfos be supplied by the caller because
they're already available.  I've added a comment at the top about
that.

Attached updated patches.

Thanks,
Amit


d25ea01275-fixup-HEAD-v2.patch
Description: Binary data


d25ea01275-fixup-PG12-v2.patch
Description: Binary data

Re: [HACKERS] Block level parallel vacuum

2019-10-30 Thread Masahiko Sawada

On Mon, Oct 28, 2019 at 3:50 PM Amit Kapila  wrote:
>
> On Sun, Oct 27, 2019 at 12:52 PM Dilip Kumar  wrote:
> >
> > On Fri, Oct 25, 2019 at 9:19 PM Masahiko Sawada  
> > wrote:
> > >
> > >
> > I haven't yet read the new set of the patch.  But, I have noticed one
> > thing.  That we are getting the size of the statistics using the AM
> > routine.  But, we are copying those statistics from local memory to
> > the shared memory directly using the memcpy.   Wouldn't it be a good
> > idea to have an AM specific routine to get it copied from the local
> > memory to the shared memory?  I am not sure it is worth it or not but
> > my thought behind this point is that it will give AM to have local
> > stats in any form ( like they can store a pointer in that ) but they
> > can serialize that while copying to shared stats.  And, later when
> > shared stats are passed back to the Am then it can deserialize in its
> > local form and use it.
> >
>
> You have a point, but after changing the gist index, we don't have any
> current usage for indexes that need something like that. So, on one
> side there is some value in having an API to copy the stats, but on
> the other side without having clear usage of an API, it might not be
> good to expose a new API for the same.   I think we can expose such an
> API in the future if there is a need for the same.  Do you or anyone
> know of any external IndexAM that has such a need?
>
> Few minor comments while glancing through the latest patchset.
>
> 1. I think you can merge 0001*, 0002*, 0003* patch into one patch as
> all three expose new variable/function from IndexAmRoutine.

Fixed.

>
> 2.
> +prepare_index_statistics(LVShared *lvshared, Relation *Irel, int nindexes)
> +{
> + char *p = (char *) GetSharedIndStats(lvshared);
> + int vac_work_mem = IsAutoVacuumWorkerProcess() &&
> + autovacuum_work_mem != -1 ?
> + autovacuum_work_mem : maintenance_work_mem;
>
> I think this function won't be called from AutoVacuumWorkerProcess at
> least not as of now, so isn't it a better idea to have an Assert for
> it?

Fixed.

>
> 3.
> +void
> +heap_parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
>
> This function is for performing a parallel operation on the index, so
> why to start with heap?

Because parallel vacuum supports only indexes that are created on heaps.

>  It is better to name it as
> index_parallel_vacuum_main or simply parallel_vacuum_main.

I'm concerned that both names index_parallel_vacuum_main and
parallel_vacuum_main seem to be generic in spite of these codes are
heap-specific code.

>
> 4.
> /* useindex = true means two-pass strategy; false means one-pass */
> @@ -128,17 +280,12 @@ typedef struct LVRelStats
>   BlockNumber pages_removed;
>   double tuples_deleted;
>   BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
> - /* List of TIDs of tuples we intend to delete */
> - /* NB: this list is ordered by TID address */
> - int num_dead_tuples; /* current # of entries */
> - int max_dead_tuples; /* # slots allocated in array */
> - ItemPointer dead_tuples; /* array of ItemPointerData */
> + LVDeadTuples *dead_tuples;
>   int num_index_scans;
>   TransactionId latestRemovedXid;
>   bool lock_waiter_detected;
>  } LVRelStats;
>
> -
>  /* A few variables that don't seem worth passing around as parameters */
>  static int elevel = -1;
>
> It seems like a spurious line removal.

Fixed.

These above comments are incorporated in the latest patch set(v32) [1].

[1] 
https://www.postgresql.org/message-id/CAD21AoAqT17QwKJ_sWOqRxNvg66wMw1oZZzf9Rt-E-zD%2BXOh_Q%40mail.gmail.com

Regards,

--
Masahiko Sawada

Re: Problem with synchronous replication

2019-10-30 Thread Kyotaro Horiguchi

Hello.

At Wed, 30 Oct 2019 17:21:17 +0900, Fujii Masao  wrote 
in 
> This change causes every ending backends to always take the exclusive lock
> even when it's not in SyncRep queue. This may be problematic, for example,
> when terminating multiple backends at the same time? If yes,
> it might be better to check SHMQueueIsDetached() again after taking the lock.
> That is,

I'm not sure how much that harms but double-checked locking
(releasing) is simple enough for reducing possible congestion here, I
think. In short, + 1 for that.

> if (!SHMQueueIsDetached(&(MyProc->syncRepLinks)))
> {
> LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
> if (!SHMQueueIsDetached(&(MyProc->syncRepLinks)))
> SHMQueueDelete(&(MyProc->syncRepLinks));
> LWLockRelease(SyncRepLock);
> }

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Proposal: Global Index

2019-10-30 Thread Ibrar Ahmed

A global index by very definition is a single index on the parent table
that maps to many
underlying table partitions. The parent table itself does not have any
underlying storage,
so it must, therefore, retrieve the data satisfying index constraints from
the underlying tables.
In very crude terms, it is an accumulation of data from table partitions so
that data spanning
across multiple partitions are accessed in one go as opposed to
individually querying each
partition.

For the initial version of this work, we are only considering to build
b-tree global indexes.

- Partitioned Index (Index Partitioning)
When global indexes become too large, then those are partitioned to keep
the performance
and maintenance overhead manageable. These are not within the scope of this
work.


- Local Index
A local index is an index that is local to a specific table partition; i.e.
it doesn’t span across
multiple partitions. So, when we create an index on a parent table, it will
create a separate
index for all its partitions. Unfortunately, PostgreSQL uses the
terminology of “partitioned index”
when it refers to local indexes. This work with fix this terminology for
PostgreSQL so that the
nomenclature remains consistent with other DBMS.


- Why We Need Global Index?
A global index is expected to give two very important upgrades to the
partitioning feature set in
PostgreSQL. It is expected to give a significant improvement in
read-performance for queries
targeting multiple local indexes of partitions. It also adds a unique
constraint across partitions.


- Unique Constraint
Data uniqueness is a critical requirement for building an index. For global
indexes that span across
multiple partitions, uniqueness will have to be enforced on index
column(s). This effectively translates
into a unique constraint.


- Performance
Currently, the pseudo index created on the parent table of partitions does
not contain any
data. Rather, it dereferences to the local indexes when an index search is
required. This
means that multiple indexes will have to be evaluated and data to be
combined thereafter.
However, with the global indexes, data will reside with global index
declared on the parent
table. This avoids the need for multi-level index lookups. So read
performance is expected
to be significantly higher in cases. There will however be a negative
performance impact
during write (insert/update) of data. This is discussed in more detail
later on.


- Creating a GLOBAL Index - Syntax
A global index may be created with the addition of a “GLOBAL” keyword to
the index statement.
Alternatively, one could specify the “LOCAL” keyword to create local
indexes on partitions.
We are suggesting to call this set of keywords: “partition_index_type”. By
default,
partition_index_type will be set as LOCAL. Here is a sample of the create
index syntax.

CREATE Index idx parent (columns) [GLOBAL | LOCAL];

Note: There is no shift/reduced by adding these options.


- Pointing Index to Tuple
Currently, CTID carries a page and offset information for a known heap
(table name). However,
in the context of global indexes, this information within an index is
insufficient. Since the index is
expected to carry tuples from multiple partitions (heaps), CTID alone will
not be able to link an index
node to a tuple. This requires carrying additional data for the heap name
to be stored with each
index node.

How this should be implemented is a point to be discussed. A few
possibilities are listed below:

-- Expand CTID to include a relfilenode id. In PostgreSQL-Conf Asia, Bruce
suggested having the OID
instead of relfilenode as relfilenode can be duplicated across tablespaces.
  -- Using OID introduces another complication where we would need to
query catalog for OID to
 heap mapping.

-- The second option is to have a variable-length CTID. We can reserve some
top-level bit for segregation of
Global CTID or Standard CTID. Robert Haas suggested in PostgreSQL-EU to
discuss this with Peter Geoghegan.
  -- I discussed it with Peter and he believes that it is a very
invasive approach that requires a whole lot of
 the effort to get committed.

-- Heikki pointed me to include heap specific information using the INCLUDE
keyword so that heap information
is stored with each index node as data.
  -- We (Peter and I) also discussed that option and this looks a more
easy and non-invasive option.


- Optimizer
The challenge with optimizer is a selection between local and global
indexes when both are present.
How do we:

-- Evaluate the cost of scanning a global index?
-- When should the LOCAL index be preferred over the GLOBAL index and vice
versa?
-- Should we hit a GLOBAL index when the query is targeting a
couple of partitions only?
-- We need to consider the sizes of those partitions being hit and
the sizes of partitions not being hit.
-- Bruce suggested that we prioritize a GLOBAL index in the first version
so that in every case,
the GLOBAL

Re: Problem with synchronous replication

2019-10-30 Thread Fujii Masao

On Wed, Oct 30, 2019 at 4:16 PM lingce.ldm  wrote:
>
> On Oct 29, 2019, at 18:50, Kyotaro Horiguchi  wrote:
>
>
> Hello.
>
> At Fri, 25 Oct 2019 15:18:34 +0800, "Dongming Liu" 
>  wrote in
>
>
> Hi,
>
> I recently discovered two possible bugs about synchronous replication.
>
> 1. SyncRepCleanupAtProcExit may delete an element that has been deleted
> SyncRepCleanupAtProcExit first checks whether the queue is detached, if it is 
> not detached,
> acquires the SyncRepLock lock and deletes it. If this element has been 
> deleted by walsender,
> it will be deleted repeatedly, SHMQueueDelete will core with a segment fault.
>
> IMO, like SyncRepCancelWait, we should lock the SyncRepLock first and then 
> check
> whether the queue is detached or not.
>
>
> I think you're right here.

This change causes every ending backends to always take the exclusive lock
even when it's not in SyncRep queue. This may be problematic, for example,
when terminating multiple backends at the same time? If yes,
it might be better to check SHMQueueIsDetached() again after taking the lock.
That is,

if (!SHMQueueIsDetached(&(MyProc->syncRepLinks)))
{
LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
if (!SHMQueueIsDetached(&(MyProc->syncRepLinks)))
SHMQueueDelete(&(MyProc->syncRepLinks));
LWLockRelease(SyncRepLock);
}

Regards,

-- 
Fujii Masao

Re: Join Correlation Name

2019-10-30 Thread Fabien COELHO



Bonjour Vik,


Is quoting the spec good enough?
SQL:2016 Part 2 Foundation Section 7.10 :


Ah, this is the one information I did not have when reviewing Peter's 
patch.



 ::=
    USING[ AS  ]

 ::=
    

I think possibly what the spec says (and that neither my patch nor
Peter's implements) is assigning the alias just to the . 


I think you are right, the alias is only on the identical columns.

It solves the issue I raised about inaccessible attributes, and explains 
why it is only available with USING and no other join variants.



So my original example query should actually be:

SELECT a.x, b.y, j.z FROM a INNER JOIN b USING (z) AS j;


Yep, only z should be in j, it is really just about the USING clause.

--
Fabien.

Re: tableam vs. TOAST

2019-10-30 Thread Prabhat Sahu

Hi All,

While testing the Toast patch(PG+v7 patch) I found below server crash.
System configuration:
VCPUs: 4, RAM: 8GB, Storage: 320GB

This issue is not frequently reproducible, we need to repeat the same
testcase multiple times.

CREATE OR REPLACE FUNCTION toast_chunks_cnt_func(p1 IN text)
  RETURNS int AS $$
DECLARE
 chunks_cnt int;
 v_tbl text;
BEGIN
  SELECT reltoastrelid::regclass INTO v_tbl FROM pg_class WHERE RELNAME =
p1;
  EXECUTE 'SELECT count(*) FROM ' || v_tbl::regclass INTO chunks_cnt;
  RETURN chunks_cnt;
END; $$ LANGUAGE PLPGSQL;

-- Server crash after multiple run of below testcase
-- 
CHECKPOINT;
CREATE TABLE toast_tab (c1 text);
\d+ toast_tab
-- ALTER table column c1 for storage as "EXTERNAL" to make sure that the
column value is pushed to the TOAST table but not COMPRESSED.
ALTER TABLE toast_tab ALTER COLUMN c1 SET STORAGE EXTERNAL;
\d+ toast_tab
\timing
INSERT INTO toast_tab
( select repeat('a', 20)
  from generate_series(1,4) x);
\timing
SELECT reltoastrelid::regclass FROM pg_class WHERE RELNAME = 'toast_tab';
SELECT toast_chunks_cnt_func('toast_tab') "Number of chunks";
SELECT pg_column_size(t1.*) FROM toast_tab t1 limit 1;
SELECT DISTINCT SUBSTR(c1, 9,10) FROM toast_tab;

CHECKPOINT;
\timing
UPDATE toast_tab SET c1 = UPPER(c1);
\timing
SELECT toast_chunks_cnt_func('toast_tab') "Number of chunks";
SELECT pg_column_size(t1.*) FROM toast_tab t1 limit 1;
SELECT DISTINCT SUBSTR(c1, 9,10) FROM toast_tab;

DROP TABLE toast_tab;
-- 

-- Stacktrace as below:
[centos@host-192-168-1-249 bin]$ gdb -q -c data2/core.3151 postgres
Reading symbols from
/home/centos/PG/PGsrc/postgresql/inst/bin/postgres...done.
[New LWP 3151]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: checkpointer
   '.
Program terminated with signal 6, Aborted.
#0  0x7f2267d33207 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-260.el7_6.5.x86_64 keyutils-libs-1.5.8-3.el7.x86_64
krb5-libs-1.15.1-37.el7_6.x86_64 libcom_err-1.42.9-13.el7.x86_64
libselinux-2.5-14.1.el7.x86_64 openssl-libs-1.0.2k-16.el7_6.1.x86_64
pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x7f2267d33207 in raise () from /lib64/libc.so.6
#1  0x7f2267d348f8 in abort () from /lib64/libc.so.6
#2  0x00eb3a80 in errfinish (dummy=0) at elog.c:552
#3  0x00c26530 in ProcessSyncRequests () at sync.c:393
#4  0x00bbbc57 in CheckPointBuffers (flags=256) at bufmgr.c:2589
#5  0x00604634 in CheckPointGuts (checkPointRedo=51448358328,
flags=256) at xlog.c:8992
#6  0x00603b5e in CreateCheckPoint (flags=256) at xlog.c:8781
#7  0x00aed8fa in CheckpointerMain () at checkpointer.c:481
#8  0x006240de in AuxiliaryProcessMain (argc=2,
argv=0x7ffe887c0880) at bootstrap.c:461
#9  0x00b0e834 in StartChildProcess (type=CheckpointerProcess) at
postmaster.c:5414
#10 0x00b09283 in reaper (postgres_signal_arg=17) at
postmaster.c:2995
#11 
#12 0x7f2267df1f53 in __select_nocancel () from /lib64/libc.so.6
#13 0x00b05000 in ServerLoop () at postmaster.c:1682
#14 0x00b0457b in PostmasterMain (argc=5, argv=0x349bce0) at
postmaster.c:1391
#15 0x00971c9f in main (argc=5, argv=0x349bce0) at main.c:210
(gdb)



On Sat, Oct 5, 2019 at 12:03 AM Robert Haas  wrote:

> On Fri, Sep 6, 2019 at 10:59 AM Robert Haas  wrote:
> > On Thu, Sep 5, 2019 at 4:07 PM Andres Freund  wrote:
> > > Yea, makes sense to me.
> >
> > OK, done.  Here's the remaining patches again, with a slight update to
> > the renaming patch (now 0002).  In the last version, I renamed
> > toast_insert_or_update to heap_toast_insert_or_update but did not
> > rename toast_delete to heap_toast_delete.  Actually, I'm not seeing
> > any particular reason not to go ahead and push the renaming patch at
> > this point also.
>
> And, hearing no objections, done.
>
> Here's the last patch back, rebased over that renaming. Although I
> think that Andres (and Tom) are probably right that there's room for
> improvement here, I currently don't see a way around the issues I
> wrote about in
> http://postgr.es/m/ca+tgmoa0zfcacpojcsspollgpztvfsyvcvb-uss8yokzmo5...@mail.gmail.com
> -- so not quite sure where to go next. Hopefully Andres or someone
> else will give me a quick whack with the cluebat if I'm missing
> something obvious.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


-- 

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.

The Postgres Database Company

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-10-30 Thread Dilip Kumar

On Wed, Oct 30, 2019 at 9:38 AM vignesh C  wrote:
>
I have noticed one more problem in the logic of setting the logical
decoding work mem from the create subscription command.  Suppose in
subscription command we don't give the work mem then it sends the
garbage value to the walsender and the walsender overwrite its value
with the garbage value.  After investigating a bit I have found the
reason for the same.

@@ -406,6 +406,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
  appendStringInfo(, "proto_version '%u'",
  options->proto.logical.proto_version);

+ appendStringInfo(, ", work_mem '%d'",
+ options->proto.logical.work_mem);

I think the problem is we are unconditionally sending the work_mem as
part of the CREATE REPLICATION SLOT, without checking whether it's
valid or not.

--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -71,6 +71,7 @@ GetSubscription(Oid subid, bool missing_ok)
  sub->name = pstrdup(NameStr(subform->subname));
  sub->owner = subform->subowner;
  sub->enabled = subform->subenabled;
+ sub->workmem = subform->subworkmem;

Another problem is that there is no handling if the subform->subworkmem is NULL.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Add SQL function to show total block numbers in the relation

2019-10-30 Thread btkimurayuzk


Hello,


I propose new simple sql query, which shows total block numbers in the 
relation.


I now reviewing this patch (https://commitfest.postgresql.org/25/2211/) 
and I think,
it is usefull for knowing how many blocks there are in the relation to 
determine whether we use VACUUM RESUME or not.


Of cource, we can know this value such as

select (pg_relation_size('t') / 
current_setting('block_size')::bigint)::int;



but I think it is a litte bit complex.



Comment and feedback are very welcome.

Regards ,


Yu Kimuradiff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 28eb322f3f..99834e7286 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -21290,6 +21290,9 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());

 pg_relation_size

+   
+pg_relation_block_number
+   

 pg_size_bytes

@@ -21363,6 +21366,15 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
 Shorthand for pg_relation_size(..., 'main')

   
+  
+   
+pg_relation_block_number(relation regclass)
+
+   bigint
+   
+Shorthand for pg_relation_block_number(..., 'main')
+   
+  
   

 pg_size_bytes(text)
@@ -21504,6 +21516,11 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
 

 
+   
+pg_relation_block_number accepts the OID or name of a table
+and returns the number of blocks of that relation.
+   
+

 pg_size_pretty can be used to format the result of one of
 the other functions in a human-readable way, using bytes, kB, MB, GB or TB
diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c
index a87e7214e9..2462d65570 100644
--- a/src/backend/utils/adt/dbsize.c
+++ b/src/backend/utils/adt/dbsize.c
@@ -23,6 +23,7 @@
 #include "commands/tablespace.h"
 #include "miscadmin.h"
 #include "storage/fd.h"
+#include "storage/bufmgr.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/numeric.h"
@@ -335,6 +336,25 @@ pg_relation_size(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(size);
 }
 
+Datum
+pg_relation_block_number(PG_FUNCTION_ARGS)
+{
+	Oid			relOid = PG_GETARG_OID(0);
+	Relation	rel;
+	int64		size;
+
+	rel = try_relation_open(relOid, AccessShareLock);
+
+	if (rel == NULL)
+		PG_RETURN_NULL();
+
+	size = RelationGetNumberOfBlocks(rel);
+
+	relation_close(rel, AccessShareLock);
+
+	PG_RETURN_INT64(size);
+}
+
 /*
  * Calculate total on-disk size of a TOAST relation, including its indexes.
  * Must not be applied to non-TOAST relations.
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 58ea5b982b..b68a523d3b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6929,6 +6929,10 @@
   descr => 'disk space usage for the specified fork of a table or index',
   proname => 'pg_relation_size', provolatile => 'v', prorettype => 'int8',
   proargtypes => 'regclass text', prosrc => 'pg_relation_size' },
+{ oid => '2228',
+  descr => 'total block number for the main fork of the specified table or index',
+  proname => 'pg_relation_block_number', provolatile => 'v', prorettype => 'int8',
+  proargtypes => 'regclass', prosrc => 'pg_relation_block_number' },
 { oid => '2286',
   descr => 'total disk space usage for the specified table and associated indexes',
   proname => 'pg_total_relation_size', provolatile => 'v', prorettype => 'int8',

Re: RFC: split OBJS lines to one object per line

2019-10-30 Thread Michael Paquier

On Tue, Oct 29, 2019 at 11:32:09PM -0700, Andres Freund wrote:
> Cool. Any opinion on whether to got for
> 
> OBJS = \
>   dest.o \
>   fastpath.o \
> ...
> 
> or
> 
> OBJS = dest.o \
>   fastpath.o \
> ...
> 
> I'm mildly inclined to go for the former.

FWIW, I am more used to the latter, but the former is also fine by
me.
--
Michael


signature.asc
Description: PGP signature

Re: Problem with synchronous replication

2019-10-30 Thread lingce . ldm

On Oct 30, 2019, at 09:45, Michael Paquier mailto:mich...@paquier.xyz>> wrote:
> 
> On Tue, Oct 29, 2019 at 07:50:01PM +0900, Kyotaro Horiguchi wrote:
>> At Fri, 25 Oct 2019 15:18:34 +0800, "Dongming Liu" 
>>  wrote in 
>>> I recently discovered two possible bugs about synchronous replication.
>>> 
>>> 1. SyncRepCleanupAtProcExit may delete an element that has been deleted
>>> SyncRepCleanupAtProcExit first checks whether the queue is detached, if it 
>>> is not detached, 
>>> acquires the SyncRepLock lock and deletes it. If this element has been 
>>> deleted by walsender, 
>>> it will be deleted repeatedly, SHMQueueDelete will core with a segment 
>>> fault. 
>>> 
>>> IMO, like SyncRepCancelWait, we should lock the SyncRepLock first and then 
>>> check
>>> whether the queue is detached or not.
>> 
>> I think you're right here.
> 
> Indeed.  Looking at the surroundings we expect some code paths to hold
> SyncRepLock in exclusive or shared mode but we don't actually check
> that the lock is hold.  So let's add some assertions while on it.
> 
>> This is not right. It is in transaction commit so it is in a
>> HOLD_INTERRUPTS section. ProcessInterrupt does not respond to
>> cancel/die interrupts thus the ereport should return.
> 
> Yeah.  There is an easy way to check after that: InterruptHoldoffCount
> needs to be strictly positive.
> 
> My suggestions are attached.  Any thoughts?

Thanks, this patch looks good to me.

smime.p7s
Description: S/MIME cryptographic signature

Re: Problem with synchronous replication

2019-10-30 Thread lingce . ldm

On Oct 29, 2019, at 18:50, Kyotaro Horiguchi mailto:horikyota@gmail.com>> wrote:
> 
> Hello.
> 
> At Fri, 25 Oct 2019 15:18:34 +0800, "Dongming Liu" 
> mailto:lingce@alibaba-inc.com>> wrote in 
>> 
>> Hi,
>> 
>> I recently discovered two possible bugs about synchronous replication.
>> 
>> 1. SyncRepCleanupAtProcExit may delete an element that has been deleted
>> SyncRepCleanupAtProcExit first checks whether the queue is detached, if it 
>> is not detached, 
>> acquires the SyncRepLock lock and deletes it. If this element has been 
>> deleted by walsender, 
>> it will be deleted repeatedly, SHMQueueDelete will core with a segment 
>> fault. 
>> 
>> IMO, like SyncRepCancelWait, we should lock the SyncRepLock first and then 
>> check
>> whether the queue is detached or not.
> 
> I think you're right here.

Thanks.

> 
>> 2. SyncRepWaitForLSN may not call SyncRepCancelWait if ereport check one 
>> interrupt.
>> For SyncRepWaitForLSN, if a query cancel interrupt arrives, we just 
>> terminate the wait 
>> with suitable warning. As follows:
>> 
>> a. set QueryCancelPending to false
>> b. errport outputs one warning
>> c. calls SyncRepCancelWait to delete one element from the queue
>> 
>> If another cancel interrupt arrives when we are outputting warning at step 
>> b, the errfinish
>> will call CHECK_FOR_INTERRUPTS that will output an ERROR, such as "canceling 
>> autovacuum
>> task", then the process will jump to the sigsetjmp. Unfortunately, the step 
>> c will be skipped
>> and the element that should be deleted by SyncRepCancelWait is remained.
>> 
>> The easiest way to fix this is to swap the order of step b and step c. On 
>> the other hand, 
>> let sigsetjmp clean up the queue may also be a good choice. What do you 
>> think?
>> 
>> Attached the patch, any feedback is greatly appreciated.
> 
> This is not right. It is in transaction commit so it is in a
> HOLD_INTERRUPTS section. ProcessInterrupt does not respond to
> cancel/die interrupts thus the ereport should return.

I read the relevant code, you are right. I called SyncRepWaitForLSN somewhere 
else, 
but forgot to put it in a HOLD_INTERRUPTS and  triggered an exception.

regards.

—
Dongming Liu

smime.p7s
Description: S/MIME cryptographic signature

Re: RFC: split OBJS lines to one object per line

2019-10-30 Thread Tom Lane

Andres Freund  writes:
> On 2019-10-29 16:31:11 -0400, Tom Lane wrote:
>> We did something similar not too long ago in configure.in (bfa6c5a0c),
>> and it seems to have helped.  +1

> Cool. Any opinion on whether to got for ...

Not here.

regards, tom lane

Re: RFC: split OBJS lines to one object per line

2019-10-30 Thread Andres Freund

Hi,

On 2019-10-29 16:31:11 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > one of the most frequent conflicts I see is that two patches add files
> > to OBJS (or one of its other spellings), and there are conflicts because
> > another file has been added.
> > ...
> > Now, obviously these types of conflicts are easy enough to resolve, but
> > it's still annoying.  It seems that this would be substantially less
> > often a problem if we just split such lines to one file per
> > line.
> 
> We did something similar not too long ago in configure.in (bfa6c5a0c),
> and it seems to have helped.  +1

Cool. Any opinion on whether to got for

OBJS = \
dest.o \
fastpath.o \
...

or

OBJS = dest.o \
fastpath.o \
...

I'm mildly inclined to go for the former.

Greetings,

Andres Freund

Re: [HACKERS] Block level parallel vacuum

2019-10-30 Thread Dilip Kumar

On Tue, Oct 29, 2019 at 3:11 PM Dilip Kumar  wrote:
>
> On Tue, Oct 29, 2019 at 1:59 PM Masahiko Sawada  wrote:
> >
> > On Tue, Oct 29, 2019 at 4:06 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Oct 28, 2019 at 2:13 PM Dilip Kumar  wrote:
> > > >
> > > > On Thu, Oct 24, 2019 at 4:33 PM Dilip Kumar  
> > > > wrote:
> > > > >
> > > > > On Thu, Oct 24, 2019 at 4:21 PM Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > > > On Thu, Oct 24, 2019 at 11:51 AM Dilip Kumar 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Fri, Oct 18, 2019 at 12:18 PM Dilip Kumar 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Fri, Oct 18, 2019 at 11:25 AM Amit Kapila 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > I am thinking if we can write the patch for both the 
> > > > > > > > > approaches (a.
> > > > > > > > > compute shared costs and try to delay based on that, b. try 
> > > > > > > > > to divide
> > > > > > > > > the I/O cost among workers as described in the email 
> > > > > > > > > above[1]) and do
> > > > > > > > > some tests to see the behavior of throttling, that might help 
> > > > > > > > > us in
> > > > > > > > > deciding what is the best strategy to solve this problem, if 
> > > > > > > > > any.
> > > > > > > > > What do you think?
> > > > > > > >
> > > > > > > > I agree with this idea.  I can come up with a POC patch for 
> > > > > > > > approach
> > > > > > > > (b).  Meanwhile, if someone is interested to quickly hack with 
> > > > > > > > the
> > > > > > > > approach (a) then we can do some testing and compare.  
> > > > > > > > Sawada-san,
> > > > > > > > by any chance will you be interested to write POC with approach 
> > > > > > > > (a)?
> > > > > > > > Otherwise, I will try to write it after finishing the first one
> > > > > > > > (approach b).
> > > > > > > >
> > > > > > > I have come up with the POC for approach (a).
> > > >
> > > > > > Can we compute the overall throttling (sleep time) in the operation
> > > > > > separately for heap and index, then divide the index's sleep_time 
> > > > > > with
> > > > > > a number of workers and add it to heap's sleep time?  Then, it will 
> > > > > > be
> > > > > > a bit easier to compare the data between parallel and non-parallel
> > > > > > case.
> > > > I have come up with a patch to compute the total delay during the
> > > > vacuum.  So the idea of computing the total cost delay is
> > > >
> > > > Total cost delay = Total dealy of heap scan + Total dealy of
> > > > index/worker;  Patch is attached for the same.
> > > >
> > > > I have prepared this patch on the latest patch of the parallel
> > > > vacuum[1].  I have also rebased the patch for the approach [b] for
> > > > dividing the vacuum cost limit and done some testing for computing the
> > > > I/O throttling.  Attached patches 0001-POC-compute-total-cost-delay
> > > > and 0002-POC-divide-vacuum-cost-limit can be applied on top of
> > > > v31-0005-Add-paralell-P-option-to-vacuumdb-command.patch.  I haven't
> > > > rebased on top of v31-0006, because v31-0006 is implementing the I/O
> > > > throttling with one approach and 0002-POC-divide-vacuum-cost-limit is
> > > > doing the same with another approach.   But,
> > > > 0001-POC-compute-total-cost-delay can be applied on top of v31-0006 as
> > > > well (just 1-2 lines conflict).
> > > >
> > > > Testing:  I have performed 2 tests, one with the same size indexes and
> > > > second with the different size indexes and measured total I/O delay
> > > > with the attached patch.
> > > >
> > > > Setup:
> > > > VacuumCostDelay=10ms
> > > > VacuumCostLimit=2000
> > > >
> > > > Test1 (Same size index):
> > > > create table test(a int, b varchar, c varchar);
> > > > create index idx1 on test(a);
> > > > create index idx2 on test(b);
> > > > create index idx3 on test(c);
> > > > insert into test select i, repeat('a',30)||i, repeat('a',20)||i from
> > > > generate_series(1,50) as i;
> > > > delete from test where a < 20;
> > > >
> > > >   Vacuum (Head)   Parallel Vacuum
> > > >Vacuum Cost Divide Patch
> > > > Total Delay1784 (ms)   1398(ms)
> > > >  1938(ms)
> > > >
> > > >
> > > > Test2 (Variable size dead tuple in index)
> > > > create table test(a int, b varchar, c varchar);
> > > > create index idx1 on test(a);
> > > > create index idx2 on test(b) where a > 10;
> > > > create index idx3 on test(c) where a > 15;
> > > >
> > > > insert into test select i, repeat('a',30)||i, repeat('a',20)||i from
> > > > generate_series(1,50) as i;
> > > > delete from test where a < 20;
> > > >
> > > > Vacuum (Head)   Parallel Vacuum
> > > >   Vacuum Cost Divide Patch
> > > > Total Delay 1438 (ms)   1029(ms)
> > > >1529(ms)
> > > >
> > > >
> > > > Conclusion:
> > > > 1. The tests prove that the total I/O delay is significantly less with
> > > > the

Re: pg_waldump erroneously outputs newline for FPWs, and another minor bug

2019-10-30 Thread Andres Freund

Hi,

On 2019-10-29 16:33:41 -0700, Andres Freund wrote:
> Hi,
> 
> When using -b, --bkp-details pg_waldump outputs an unnecessary newline
> for blocks that contain an FPW.
> 
> In --bkp-details block references are output on their own lines, like:
> 
> rmgr: SPGist  len (rec/tot):   4348/  4348, tx:980, lsn: 
> 0/01985818, prev 0/01983850, desc: PICKSPLIT ndel 92; nins 93
> blkref #0: rel 1663/16384/16967 fork main blk 3
> blkref #1: rel 1663/16384/16967 fork main blk 6
> blkref #2: rel 1663/16384/16967 fork main blk 5
> blkref #3: rel 1663/16384/16967 fork main blk 1
> rmgr: Heaplen (rec/tot): 69/69, tx:980, lsn: 
> 0/01986930, prev 0/01985818, desc: INSERT off 2 flags 0x00
> blkref #0: rel 1663/16384/16961 fork main blk 1
> 
> but unfortunately, when there's actually an FPW present, it looks like:
> 
> rmgr: XLOGlen (rec/tot): 75/ 11199, tx:977, lsn: 
> 0/019755E0, prev 0/0194EDD8, desc: FPI
> blkref #0: rel 1663/16384/16960 fork main blk 32 (FPW); hole: offset: 
> 548, length: 4484
> 
> blkref #1: rel 1663/16384/16960 fork main blk 33 (FPW); hole: offset: 
> 548, length: 4484
> 
> blkref #2: rel 1663/16384/16960 fork main blk 34 (FPW); hole: offset: 
> 548, length: 4484
> 
> rmgr: Heaplen (rec/tot):188/   188, tx:977, lsn: 
> 0/019781D0, prev 0/019755E0, desc: INPLACE off 23
> 
> which clearly seems unnecessary. Looking at the code it seems to me that
> 
> static void
> XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
> {
> ...
> printf("\tblkref #%u: rel %u/%u/%u fork %s blk %u",
>block_id,
>rnode.spcNode, rnode.dbNode, rnode.relNode,
>forkNames[forknum],
>blk);
> if (XLogRecHasBlockImage(record, block_id))
> {
> if (record->blocks[block_id].bimg_info &
> BKPIMAGE_IS_COMPRESSED)
> {
> printf(" (FPW%s); hole: offset: %u, length: %u, "
>"compression saved: %u\n",
>XLogRecBlockImageApply(record, block_id) ?
>"" : " for WAL verification",
>record->blocks[block_id].hole_offset,
>record->blocks[block_id].hole_length,
>BLCKSZ -
>record->blocks[block_id].hole_length -
>record->blocks[block_id].bimg_len);
> }
> else
> {
> printf(" (FPW%s); hole: offset: %u, length: %u\n",
>XLogRecBlockImageApply(record, block_id) ?
>"" : " for WAL verification",
>record->blocks[block_id].hole_offset,
>record->blocks[block_id].hole_length);
> }
> }
> putchar('\n');
> 
> was intended to not actually print a newline in the printfs in the if
> preceding the putchar.
> 
> This is a fairly longstanding bug, introduced in:
> 
> commit 2c03216d831160bedd72d45f712601b6f7d03f1c
> Author: Heikki Linnakangas 
> Date:   2014-11-20 17:56:26 +0200
> 
> Revamp the WAL record format.
> 
> 
> Does anybody have an opinion about fixing it just in master or also
> backpatching it? I guess there could be people having written parsers
> for the waldump output?  I'm inclined to backpatch.
> 
> 
> I also find a second minor bug:
> 
> static void
> XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
> {
> ...
> const char *id;
> ...
> id = desc->rm_identify(info);
> if (id == NULL)
> id = psprintf("UNKNOWN (%x)", info & ~XLR_INFO_MASK);
> ...
> printf("desc: %s ", id);
> 
> after that "id" is not referenced anymore. Which means we would leak
> memory if there were a lot of UNKNOWN records. This is from
> commit 604f7956b946019dd37bd3baea24cb669a47
> Author: Andres Freund 
> Date:   2014-09-22 16:48:14 +0200
> 
> Improve code around the recently added rm_identify rmgr callback.
> 
> While not a lot of memory, it's not absurd to run pg_waldump against a
> large amount of WAL, so backpatching seems mildly advised.
> 
> I'm inlined to think that the best fix is to just move the relevant code
> to the callsite, and not psprintf'ing into a temporary buffer. We'd need
> additional state to free the memory, as rm_identify returns a static
> buffer.
> 
> So I'll make it
> 
> id = desc->rm_identify(info);
> if (id == NULL)
> printf("desc: UNKNOWN (%x) ", info & ~XLR_INFO_MASK);
> else
> printf("desc: %s ", id);

Pushed fixes for these.

Greetings,

Andres Freund

61 matches

Mail list logo