date:20200610

On Thu, 11 Jun 2020 at 16:03, Amit Kapila  wrote:
> I think something on these lines would be a good idea especially
> keeping step-size proportional to relation size.  However, I am not
> completely sure if doubling the step-size with equal increase in
> relation size (ex. what is happening between 16MB~8192MB) is the best
> idea.  Why not double the step-size when relation size increases by
> four times?  Will some more tests help us to identify this?  I also
> don't know what is the right answer here so just trying to brainstorm.

Brainstorming sounds good. I'm by no means under any illusion that the
formula is correct.

But, why four times?  The way I did it tries to keep the number of
chunks roughly the same each time. I think the key is the number of
chunks more than the size of the chunks. Having fewer chunks increases
the chances of an imbalance of work between workers, and with what you
mention, the number of chunks will vary more than what I have proposed

The code I showed above will produce something between 512-1024 chunks
for all cases until we 2^20 pages, then we start capping the chunk
size to 1024.  I could probably get onboard with making it depend on
the number of parallel workers, but perhaps it would be better just to
divide by, say, 16384 rather than 1024, as I proposed above. That way
we'll be more fine-grained, but we'll still read in larger than 1024
chunk sizes when the relation gets beyond 128GB.

David

Re: Parallel Seq Scan vs kernel read ahead

On Thu, Jun 11, 2020 at 8:35 AM David Rowley  wrote:
>
> On Thu, 11 Jun 2020 at 14:09, Amit Kapila  wrote:
> >
> > On Thu, Jun 11, 2020 at 7:18 AM David Rowley  wrote:
> > >
> > > On Thu, 11 Jun 2020 at 01:24, Amit Kapila  wrote:
> > > > Can we try the same test with 4, 8, 16 workers as well?  I don't
> > > > foresee any problem with a higher number of workers but it might be
> > > > better to once check that if it is not too much additional work.
> > >
> > > I ran the tests again with up to 7 workers. The CPU here only has 8
> > > cores (a laptop), so I'm not sure if there's much sense in going
> > > higher than that?
> > >
> >
> > I think it proves your point that there is a value in going for step
> > size greater than 64.  However, I think the difference at higher sizes
> > is not significant.  For example, the difference between 8192 and
> > 16384 doesn't seem much if we leave higher worker count where the data
> > could be a bit misleading due to variation.  I could see that there is
> > a clear and significant difference till 1024 but after that difference
> > is not much.
>
> I guess the danger with going too big is that we have some Seqscan
> filter that causes some workers to do very little to nothing with the
> rows, despite discarding them and other workers are left with rows
> that are not filtered and require some expensive processing.  Keeping
> the number of blocks on the smaller side would reduce the chances of
> someone being hit by that.
>

Right and good point.

>   The algorithm I proposed above still can
> be capped by doing something like nblocks = Min(1024,
> pg_nextpower2_32(pbscan->phs_nblocks / 1024));  That way we'll end up
> with:
>

I think something on these lines would be a good idea especially
keeping step-size proportional to relation size.  However, I am not
completely sure if doubling the step-size with equal increase in
relation size (ex. what is happening between 16MB~8192MB) is the best
idea.  Why not double the step-size when relation size increases by
four times?  Will some more tests help us to identify this?  I also
don't know what is the right answer here so just trying to brainstorm.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speedup usages of pg_*toa() functions

On Wed, 10 Jun 2020 at 11:57, David Rowley  wrote:
>
> On Tue, 9 Jun 2020 at 22:08, Andrew Gierth  
> wrote:
> >
> > > "David" == David Rowley  writes:
> >
> >  David> This allows us to speed up a few cases. int2vectorout() should
> >  David> be faster and int8out() becomes a bit faster if we get rid of
> >  David> the strdup() call and replace it with a palloc()/memcpy() call.
> >
> > What about removing the memcpy entirely? I don't think we save anything
> > much useful here by pallocing the exact length, rather than doing what
> > int4out does and palloc a fixed size and convert the int directly into
> > it.
>
> The attached 0001 patch does this.

Pending any objections, I'd like to push both of these patches in the
next few days to master.

Anyone object to changing the signature of these functions in 0002, or
have concerns about allocating the maximum memory that we might
require in int8out()?

David

Re: Resetting spilled txn statistics in pg_stat_replication

On Thu, Jun 11, 2020 at 12:35 AM Magnus Hagander  wrote:
>
> On Wed, Jun 10, 2020 at 9:01 AM Masahiko Sawada 
>  wrote:
>>
>
>> If we move these values to replication slots, I think we can change
>> the code so that these statistics are managed by replication slots
>> (e.g. ReplicationSlot struct). Once ReplicationSlot has these values,
>> we can keep them beyond reconnections or multiple calls of SQL
>> interface functions. Of course, these values don’t need to be
>> persisted.
>
>
> Eh, why should they not be persisted?
>

Because these stats are corresponding to a ReoderBuffer (logical
decoding) which will be allocated fresh after restart and have no
relation with what has happened before restart.

Now, thinking about this again, I am not sure if these stats are
directly related to slots. These are stats for logical decoding which
can be performed either via WALSender or decoding plugin (via APIs).
So, why not have them displayed in a new view like pg_stat_logical (or
pg_stat_logical_decoding/pg_stat_logical_replication)?   In future, we
will need to add similar stats for streaming of in-progress
transactions as well (see patch 0007-Track-statistics-for-streaming at
[1]), so having a separate view for these doesn't sound illogical.

> The comparison would be temp_files and temp_bytes in pg_stat_database, and 
> those *are* persisted.

I am not able to see a one-on-one mapping of those stats with what is
being discussed here.

[1] - 
https://www.postgresql.org/message-id/CAFiTN-vXQx_161WC-a9HvNaF25nwO%3DJJRpRdTtyfGQHbM3Bd1Q%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Recording test runtimes with the buildfarm

On Thu, 11 Jun 2020 at 10:02, Tom Lane  wrote:
>
> Thomas Munro  writes:
> > I've been doing that in a little database that pulls down the results
> > and analyses them with primitive regexes.  First I wanted to know the
> > pass/fail history for each individual regression, isolation and TAP
> > script, then I wanted to build something that could identify tests
> > that are 'flapping', and work out when the started and stopped
> > flapping etc.  I soon realised it was all too noisy, but then I
> > figured that I could fix that by detecting crashes.  So I classify
> > every top level build farm run as SUCCESS, FAILURE or CRASH.  If the
> > top level run was CRASH, than I can disregard the individual per
> > script results, because they're all BS.
>
> If you can pin the crash on a particular test script, it'd be useful
> to track that as a kind of failure.  In general, though, both crashes
> and non-crash failures tend to cause collateral damage to later test
> scripts --- if you can't filter that out then the later scripts will
> have high false-positive rates.

I guess the fact that you've both needed to do analysis on individual
tests shows that there might be a call for this beyond just recording
the test's runtime.

If we had a table that stored the individual test details, pass/fail
and just stored the timing information along with that, then, even if
the timing was unstable, it could still be useful for some analysis.
I'd be happy enough even if that was only available as a csv file
download.  I imagine the buildfarm does not need to provide us with
any tools for doing analysis on this. Ideally, there would be some
run_id that we could link it back to the test run which would give us
the commit SHA, and the animal that it ran on. Joining to details
about the animal could be useful too, e.g perhaps a certain test
always fails on 32-bit machines.

I suppose that maybe we could modify pg_regress to add a command line
option to have it write out a machine-readable file, e.g:
testname,result,runtime\n, then just have the buildfarm client ship
that off to the buildfarm server to record in the database.

David

Re: Parallel Seq Scan vs kernel read ahead

On Thu, 11 Jun 2020 at 14:09, Amit Kapila  wrote:
>
> On Thu, Jun 11, 2020 at 7:18 AM David Rowley  wrote:
> >
> > On Thu, 11 Jun 2020 at 01:24, Amit Kapila  wrote:
> > > Can we try the same test with 4, 8, 16 workers as well?  I don't
> > > foresee any problem with a higher number of workers but it might be
> > > better to once check that if it is not too much additional work.
> >
> > I ran the tests again with up to 7 workers. The CPU here only has 8
> > cores (a laptop), so I'm not sure if there's much sense in going
> > higher than that?
> >
>
> I think it proves your point that there is a value in going for step
> size greater than 64.  However, I think the difference at higher sizes
> is not significant.  For example, the difference between 8192 and
> 16384 doesn't seem much if we leave higher worker count where the data
> could be a bit misleading due to variation.  I could see that there is
> a clear and significant difference till 1024 but after that difference
> is not much.

I guess the danger with going too big is that we have some Seqscan
filter that causes some workers to do very little to nothing with the
rows, despite discarding them and other workers are left with rows
that are not filtered and require some expensive processing.  Keeping
the number of blocks on the smaller side would reduce the chances of
someone being hit by that.   The algorithm I proposed above still can
be capped by doing something like nblocks = Min(1024,
pg_nextpower2_32(pbscan->phs_nblocks / 1024));  That way we'll end up
with:


 rel_size | stepsize
--+--
 16 kB|1
 32 kB|1
 64 kB|1
 128 kB   |1
 256 kB   |1
 512 kB   |1
 1024 kB  |1
 2048 kB  |1
 4096 kB  |1
 8192 kB  |1
 16 MB|2
 32 MB|4
 64 MB|8
 128 MB   |   16
 256 MB   |   32
 512 MB   |   64
 1024 MB  |  128
 2048 MB  |  256
 4096 MB  |  512
 8192 MB  | 1024
 16 GB| 1024
 32 GB| 1024
 64 GB| 1024
 128 GB   | 1024
 256 GB   | 1024
 512 GB   | 1024
 1024 GB  | 1024
 2048 GB  | 1024
 4096 GB  | 1024
 8192 GB  | 1024
 16 TB| 1024
 32 TB| 1024
(32 rows)

David

Re: [Proposal] Global temporary tables

2020-06-10 Thread 曾文旌


> 2020年6月9日 下午8:15，Prabhat Sahu  写道：
> 
> 
> 
> On Wed, Apr 29, 2020 at 8:52 AM 曾文旌  > wrote:
>> 2020年4月27日 下午9:48，Prabhat Sahu > > 写道：
>> 
>> Thanks Wenjing, for the fix patch for previous issues.
>> I have verified the issues, now those fix look good to me.
>> But the below error message is confusing(for gtt2).
>> 
>> postgres=# drop table gtt1;
>> ERROR:  cannot drop global temp table gtt1 when other backend attached it.
>> 
>> postgres=# drop table gtt2;
>> ERROR:  cannot drop index idx2 on global temp table gtt2 when other backend 
>> attached it.
>> 
>> I feel the above error message shown for "DROP TABLE gtt2;" is a bit 
>> confusing(looks similar to DROP INDEX gtt2;).
>> If possible, can we keep the error message simple as "ERROR:  cannot drop 
>> global temp table gtt2 when other backend attached it."?
>> I mean, without giving extra information for the index attached to that GTT.
> Fixed the error message to make the expression more accurate. In v33.
>  
> Thanks Wenjing. We verified your latest patch(gtt_v33) focusing on all 
> reported issues and they work fine. 
> Thanks.
> -- 

I'm very glad to hear such good news.
I am especially grateful for your professional work on GTT.
Please feel free to let me know if there is anything you think could be 
improved.


Thanks.


Wenjing

> With Regards,
> Prabhat Kumar Sahu
> EnterpriseDB: http://www.enterprisedb.com 



smime.p7s
Description: S/MIME cryptographic signature

Re: Parallel Seq Scan vs kernel read ahead

On Thu, Jun 11, 2020 at 7:18 AM David Rowley  wrote:
>
> On Thu, 11 Jun 2020 at 01:24, Amit Kapila  wrote:
> > Can we try the same test with 4, 8, 16 workers as well?  I don't
> > foresee any problem with a higher number of workers but it might be
> > better to once check that if it is not too much additional work.
>
> I ran the tests again with up to 7 workers. The CPU here only has 8
> cores (a laptop), so I'm not sure if there's much sense in going
> higher than that?
>

I think it proves your point that there is a value in going for step
size greater than 64.  However, I think the difference at higher sizes
is not significant.  For example, the difference between 8192 and
16384 doesn't seem much if we leave higher worker count where the data
could be a bit misleading due to variation.  I could see that there is
a clear and significant difference till 1024 but after that difference
is not much.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan vs kernel read ahead

On Thu, 11 Jun 2020 at 01:24, Amit Kapila  wrote:
> Can we try the same test with 4, 8, 16 workers as well?  I don't
> foresee any problem with a higher number of workers but it might be
> better to once check that if it is not too much additional work.

I ran the tests again with up to 7 workers. The CPU here only has 8
cores (a laptop), so I'm not sure if there's much sense in going
higher than that?

CPU = Intel i7-8565U. 16GB RAM.

Note that I did the power2 ramp-up with each of the patched tests this
time. Thomas' version ramps up 1 pages at a time, which is ok when
only ramping up to 64 pages, but not for these higher numbers I'm
testing with. (Patch attached)

Results attached in a graph format, or in text below:

Master:

workers=0: Time: 141175.935 ms (02:21.176) (742.7MB/sec)
workers=1: Time: 316854.538 ms (05:16.855) (330.9MB/sec)
workers=2: Time: 323471.791 ms (05:23.472) (324.2MB/sec)
workers=3: Time: 321637.945 ms (05:21.638) (326MB/sec)
workers=4: Time: 308689.599 ms (05:08.690) (339.7MB/sec)
workers=5: Time: 289014.709 ms (04:49.015) (362.8MB/sec)
workers=6: Time: 267785.27 ms (04:27.785) (391.6MB/sec)
workers=7: Time: 248735.817 ms (04:08.736) (421.6MB/sec)

Patched 64: (power 2 ramp-up)

workers=0: Time: 152752.558 ms (02:32.753) (686.5MB/sec)
workers=1: Time: 149940.841 ms (02:29.941) (699.3MB/sec)
workers=2: Time: 136534.043 ms (02:16.534) (768MB/sec)
workers=3: Time: 119387.248 ms (01:59.387) (878.3MB/sec)
workers=4: Time: 114080.131 ms (01:54.080) (919.2MB/sec)
workers=5: Time: 111472.144 ms (01:51.472) (940.7MB/sec)
workers=6: Time: 108290.608 ms (01:48.291) (968.3MB/sec)
workers=7: Time: 104349.947 ms (01:44.350) (1004.9MB/sec)

Patched 1024: (power 2 ramp-up)

workers=0: Time: 146106.086 ms (02:26.106) (717.7MB/sec)
workers=1: Time: 109832.773 ms (01:49.833) (954.7MB/sec)
workers=2: Time: 98921.515 ms (01:38.922) (1060MB/sec)
workers=3: Time: 94259.243 ms (01:34.259) (1112.4MB/sec)
workers=4: Time: 93275.637 ms (01:33.276) (1124.2MB/sec)
workers=5: Time: 93921.452 ms (01:33.921) (1116.4MB/sec)
workers=6: Time: 93988.386 ms (01:33.988) (1115.6MB/sec)
workers=7: Time: 92096.414 ms (01:32.096) (1138.6MB/sec)

Patched 8192: (power 2 ramp-up)

workers=0: Time: 143367.057 ms (02:23.367) (731.4MB/sec)
workers=1: Time: 103138.918 ms (01:43.139) (1016.7MB/sec)
workers=2: Time: 93368.573 ms (01:33.369) (1123.1MB/sec)
workers=3: Time: 89464.529 ms (01:29.465) (1172.1MB/sec)
workers=4: Time: 89921.393 ms (01:29.921) (1166.1MB/sec)
workers=5: Time: 93575.401 ms (01:33.575) (1120.6MB/sec)
workers=6: Time: 93636.584 ms (01:33.637) (1119.8MB/sec)
workers=7: Time: 93682.21 ms (01:33.682) (1119.3MB/sec)

Patched 16384 (power 2 ramp-up)

workers=0: Time: 144598.502 ms (02:24.599) (725.2MB/sec)
workers=1: Time: 97344.16 ms (01:37.344) (1077.2MB/sec)
workers=2: Time: 88025.42 ms (01:28.025) (1191.2MB/sec)
workers=3: Time: 97711.521 ms (01:37.712) (1073.1MB/sec)
workers=4: Time: 88877.913 ms (01:28.878) (1179.8MB/sec)
workers=5: Time: 96985.978 ms (01:36.986) (1081.2MB/sec)
workers=6: Time: 92368.543 ms (01:32.369) (1135.2MB/sec)
workers=7: Time: 87498.156 ms (01:27.498) (1198.4MB/sec)

David


parallel_step_size.patch
Description: Binary data

Re: hashagg slowdown due to spill changes

On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:
> Hi,
> 
> While preparing my pgcon talk I noticed that our hash-agg performance
> degraded noticeably. Looks to me like it's due to the spilling-
> hashagg
> changes.

Attached a proposed fix. (Might require some minor cleanup).

The only awkward part is that LookupTupleHashEntry() needs a new out
parameter to pass the hash value back to the caller. Ordinarily, the
caller can get that from the returned entry, but if isnew==NULL, then
the function might return NULL (and the caller wouldn't have an entry
from which to read the hash value).

Regards,
Jeff Davis

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 8be36ca7634..27dbf264766 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -24,9 +24,10 @@
 static int	TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
 static uint32 TupleHashTableHash_internal(struct tuplehash_hash *tb,
 		  const MinimalTuple tuple);
-static TupleHashEntry LookupTupleHashEntry_internal(TupleHashTable hashtable,
-	TupleTableSlot *slot,
-	bool *isnew, uint32 hash);
+static inline TupleHashEntry LookupTupleHashEntry_internal(
+	TupleHashTable hashtable,
+	TupleTableSlot *slot,
+	bool *isnew, uint32 hash);
 
 /*
  * Define parameters for tuple hash table code generation. The interface is
@@ -291,6 +292,9 @@ ResetTupleHashTable(TupleHashTable hashtable)
  * If isnew is NULL, we do not create new entries; we return NULL if no
  * match is found.
  *
+ * If hash is not NULL, we set it to the calculated hash value. This allows
+ * callers access to the hash value even if no entry is returned.
+ *
  * If isnew isn't NULL, then a new entry is created if no existing entry
  * matches.  On return, *isnew is true if the entry is newly created,
  * false if it existed already.  ->additional_data in the new entry has
@@ -298,11 +302,11 @@ ResetTupleHashTable(TupleHashTable hashtable)
  */
 TupleHashEntry
 LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
-	 bool *isnew)
+	 bool *isnew, uint32 *hash)
 {
 	TupleHashEntry entry;
 	MemoryContext oldContext;
-	uint32		hash;
+	uint32		local_hash;
 
 	/* Need to run the hash functions in short-lived context */
 	oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
@@ -312,8 +316,12 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
 	hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
 	hashtable->cur_eq_func = hashtable->tab_eq_func;
 
-	hash = TupleHashTableHash_internal(hashtable->hashtab, NULL);
-	entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
+	local_hash = TupleHashTableHash_internal(hashtable->hashtab, NULL);
+	if (hash != NULL)
+		*hash = local_hash;
+
+	entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, local_hash);
+	Assert(entry == NULL || entry->hash == local_hash);
 
 	MemoryContextSwitchTo(oldContext);
 
@@ -362,6 +370,7 @@ LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot,
 	hashtable->cur_eq_func = hashtable->tab_eq_func;
 
 	entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, hash);
+	Assert(entry == NULL || entry->hash == hash);
 
 	MemoryContextSwitchTo(oldContext);
 
@@ -480,7 +489,7 @@ TupleHashTableHash_internal(struct tuplehash_hash *tb,
  * NB: This function may or may not change the memory context. Caller is
  * expected to change it back.
  */
-static TupleHashEntry
+static inline TupleHashEntry
 LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
 			  bool *isnew, uint32 hash)
 {
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 331acee2814..0447d473c82 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -403,8 +403,9 @@ static int	hash_choose_num_partitions(uint64 input_groups,
 	   double hashentrysize,
 	   int used_bits,
 	   int *log2_npartittions);
-static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash,
-		  bool *in_hash_table);
+static void initialize_hash_entry(AggState *aggstate,
+  TupleHashTable hashtable,
+  TupleHashEntry entry);
 static void lookup_hash_entries(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
 static void agg_fill_hash_table(AggState *aggstate);
@@ -1979,75 +1980,39 @@ hash_choose_num_partitions(uint64 input_groups, double hashentrysize,
 }
 
 /*
- * Find or create a hashtable entry for the tuple group containing the current
- * tuple (already set in tmpcontext's outertuple slot), in the current grouping
- * set (which the caller must have selected - note that initialize_aggregate
- * depends on this).
- *
- * When called, CurrentMemoryContext should be the per-query context. The
- * already-calculated hash value for the tuple must be specified.
- *
-

Re: Default setting for enable_hashagg_disk

2020-06-10 Thread Melanie Plageman

On Wed, Jun 10, 2020 at 10:39 AM Jeff Davis  wrote:

> On Tue, 2020-06-09 at 18:20 -0700, Melanie Plageman wrote:
> > So, I was catching up on email and noticed the last email in this
> > thread.
> >
> > I think I am not fully understanding what
> > enable_groupingsets_hash_disk
> > does. Is it only for testing?
>
> It's mostly for testing. I could imagine cases where it would be useful
> to force groupingsets to use the disk, but I mainly wanted the setting
> there for testing the grouping sets hash disk code path.
>
> > Using the tests you added to src/test/regress/sql/groupingsets.sql, I
> > did get a plan that looks like hashagg is spilling to disk (goes
> > through
>
> I had something that worked as a test for a while, but then when I
> tweaked the costing, it started using the Sort path (therefore not
> testing my grouping sets hash disk code at all) and a bug crept in. So
> I thought it would be best to have a more forceful knob.
>
> Perhaps I should just get rid of that GUC and use the stats trick?
>
>
I like the idea of doing the stats trick. For extra security, you could
throw in that other trick that is used in groupingsets.sql and make some
of the grouping columns unhashable and some unsortable so you know that
you will not pick only the Sort Path and do just a GroupAgg.

This slight modification of my previous example will probably yield
consistent results:

set enable_hashagg_disk = true;
SET enable_groupingsets_hash_disk = false;
SET work_mem='64kB';
SET enable_hashagg = true;
drop table if exists gs_hash_1;
create table gs_hash_1 as
  select g%1000 as g1000, g%100 as g100, g%10 as g10, g,
  g::text::xid as g_unsortable, g::bit(4) as g_unhashable
   from generate_series(0,19) g;
analyze gs_hash_1;

alter table gs_hash_1 set (autovacuum_enabled = 'false');
update pg_class set reltuples = 10 where relname = 'gs_hash_1';

explain (analyze, costs off, timing off)
select g1000, g100, g10
  from gs_hash_1
  group by grouping sets ((g1000,g100), (g10, g_unhashable), (g100,
g_unsortable));

   QUERY PLAN

 MixedAggregate (actual rows=201080 loops=1)
   Hash Key: g100, g_unsortable
   Group Key: g1000, g100
   Sort Key: g10, g_unhashable
 Group Key: g10, g_unhashable
   Peak Memory Usage: 109 kB
   Disk Usage: 13504 kB
   HashAgg Batches: 10111
   ->  Sort (actual rows=20 loops=1)
 Sort Key: g1000, g100
 Sort Method: external merge  Disk: 9856kB
 ->  Seq Scan on gs_hash_1 (actual rows=20 loops=1)

While we are on the topic of the tests, I was wondering if you had
considered making a user defined type that had a lot of padding so that
the tests could use fewer rows. I did this for adaptive hashjoin and it
helped me with iteration time.
I don't know if that would still be the kind of test you are looking for
since a user probably wouldn't have a couple hundred really fat
untoasted tuples, but, I just thought I would check if that would be
useful.

-- 
Melanie Plageman

Re: Recording test runtimes with the buildfarm

Thomas Munro  writes:
> I've been doing that in a little database that pulls down the results
> and analyses them with primitive regexes.  First I wanted to know the
> pass/fail history for each individual regression, isolation and TAP
> script, then I wanted to build something that could identify tests
> that are 'flapping', and work out when the started and stopped
> flapping etc.  I soon realised it was all too noisy, but then I
> figured that I could fix that by detecting crashes.  So I classify
> every top level build farm run as SUCCESS, FAILURE or CRASH.  If the
> top level run was CRASH, than I can disregard the individual per
> script results, because they're all BS.

If you can pin the crash on a particular test script, it'd be useful
to track that as a kind of failure.  In general, though, both crashes
and non-crash failures tend to cause collateral damage to later test
scripts --- if you can't filter that out then the later scripts will
have high false-positive rates.

regards, tom lane

Re: Recording test runtimes with the buildfarm

2020-06-10 Thread Andres Freund

Hi, 

On June 10, 2020 2:13:51 PM PDT, David Rowley  wrote:
>On Thu, 11 Jun 2020 at 02:13, Tom Lane  wrote:
>> I have in the past scraped the latter results and tried to make sense
>of
>> them.  They are *mighty* noisy, even when considering just one animal
>> that I know to be running on a machine with little else to do.
>
>Do you recall if you looked at the parallel results or the serially
>executed ones?
>
>I imagine that the parallel ones will have much more noise since we
>run the tests up to 20 at a time. I imagine probably none, or at most
>not many of the animals have enough CPU cores not to be context
>switching a lot during those the parallel runs.  I thought the serial
>ones would be better but didn't have an idea of they'd be good enough
>to be useful.

I'd assume that a rolling average (maybe 10 runs or so) would hide noise enough 
to see at least some trends even for parallel runs.

We should be able to prototype this with a few queries over the bf database, 
right?

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Recording test runtimes with the buildfarm

2020-06-10 Thread Thomas Munro

On Thu, Jun 11, 2020 at 9:43 AM Thomas Munro  wrote:
> On Thu, Jun 11, 2020 at 2:13 AM Tom Lane  wrote:
> > I have in the past scraped the latter results and tried to make sense of
> > them.  They are *mighty* noisy, even when considering just one animal
> > that I know to be running on a machine with little else to do.  Maybe
> > averaging across the whole buildfarm could reduce the noise level, but
> > I'm not very hopeful.  Per-test-script times would likely be even
> > noisier (ISTM anyway, maybe I'm wrong).
>
> I've been doing that in a little database that pulls down the results
> and analyses them with primitive regexes.  First I wanted to know the
> pass/fail history for each individual regression, isolation and TAP
> script, then I wanted to build something that could identify tests
> that are 'flapping', and work out when the started and stopped
> flapping etc.  I soon realised it was all too noisy, but then I
> figured that I could fix that by detecting crashes.  So I classify
> every top level build farm run as SUCCESS, FAILURE or CRASH.  If the
> top level run was CRASH, than I can disregard the individual per
> script results, because they're all BS.

With more coffee I realise that you were talking about noise times,
not noisy pass/fail results.  But I still want to throw that idea out
there, if we're considering analysing the logs.

Re: Recording test runtimes with the buildfarm

David Rowley  writes:
> On Thu, 11 Jun 2020 at 02:13, Tom Lane  wrote:
>> I have in the past scraped the latter results and tried to make sense of
>> them.  They are *mighty* noisy, even when considering just one animal
>> that I know to be running on a machine with little else to do.

> Do you recall if you looked at the parallel results or the serially
> executed ones?

> I imagine that the parallel ones will have much more noise since we
> run the tests up to 20 at a time. I imagine probably none, or at most
> not many of the animals have enough CPU cores not to be context
> switching a lot during those the parallel runs.  I thought the serial
> ones would be better but didn't have an idea of they'd be good enough
> to be useful.

I can't claim to recall specifically, but I agree with your theory
about that, so I probably looked at the serial-schedule case.

Note that this is moot for animals using use_installcheck_parallel
... but it looks like that's still just a minority of them.

regards, tom lane

Re: Recording test runtimes with the buildfarm

2020-06-10 Thread Thomas Munro

On Thu, Jun 11, 2020 at 2:13 AM Tom Lane  wrote:
> I have in the past scraped the latter results and tried to make sense of
> them.  They are *mighty* noisy, even when considering just one animal
> that I know to be running on a machine with little else to do.  Maybe
> averaging across the whole buildfarm could reduce the noise level, but
> I'm not very hopeful.  Per-test-script times would likely be even
> noisier (ISTM anyway, maybe I'm wrong).

I've been doing that in a little database that pulls down the results
and analyses them with primitive regexes.  First I wanted to know the
pass/fail history for each individual regression, isolation and TAP
script, then I wanted to build something that could identify tests
that are 'flapping', and work out when the started and stopped
flapping etc.  I soon realised it was all too noisy, but then I
figured that I could fix that by detecting crashes.  So I classify
every top level build farm run as SUCCESS, FAILURE or CRASH.  If the
top level run was CRASH, than I can disregard the individual per
script results, because they're all BS.

Re: Recording test runtimes with the buildfarm

On Thu, 11 Jun 2020 at 02:13, Tom Lane  wrote:
> I have in the past scraped the latter results and tried to make sense of
> them.  They are *mighty* noisy, even when considering just one animal
> that I know to be running on a machine with little else to do.

Do you recall if you looked at the parallel results or the serially
executed ones?

I imagine that the parallel ones will have much more noise since we
run the tests up to 20 at a time. I imagine probably none, or at most
not many of the animals have enough CPU cores not to be context
switching a lot during those the parallel runs.  I thought the serial
ones would be better but didn't have an idea of they'd be good enough
to be useful.

David

Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-10 Thread Magnus Hagander

On Wed, Jun 10, 2020 at 9:01 AM Masahiko Sawada <
masahiko.saw...@2ndquadrant.com> wrote:

> On Tue, 9 Jun 2020 at 17:24, Magnus Hagander  wrote:
> >
> >
> >
> > On Tue, Jun 9, 2020 at 9:12 AM Masahiko Sawada <
> masahiko.saw...@2ndquadrant.com> wrote:
> >>
> >> On Tue, 2 Jun 2020 at 18:34, Amit Kapila 
> wrote:
> >> >
> >> > On Tue, Jun 2, 2020 at 11:48 AM Masahiko Sawada
> >> >  wrote:
> >> > >
> >> > > Hi all,
> >> > >
> >> > > Tracking of spilled transactions has been introduced to PG13. These
> >> > > new statistics values, spill_txns, spill_count, and spill_bytes, are
> >> > > cumulative total values unlike other statistics values in
> >> > > pg_stat_replication. How can we reset these values? We can reset
> >> > > statistics values in other statistics views using by
> >> > > pg_stat_reset_shared(), pg_stat_reset() and so on. It seems to me
> that
> >> > > the only option to reset spilled transactions is to restart logical
> >> > > replication but it's surely high cost.
> >
> >
> > You just have to "bounce" the worker though, right? You don't have to
> actually restart logical replication, just disconnect and reconnect?
>
> Right.
>
> >
> >
> >> > I see your point but I don't see a pressing need for such a function
> >> > for PG13.  Basically, these counters will be populated when we have
> >> > large transactions in the system so not sure how much is the use case
> >> > for such a function. Note that we need to add additional column
> >> > stats_reset in pg_stat_replication view as well similar to what we
> >> > have in pg_stat_archiver and pg_stat_bgwriter.  OTOH, I don't see any
> >> > big reason for not having such a function for PG14.
> >>
> >> Ok. I think the reset function is mostly for evaluations or rare
> >> cases. In either case, since it's not an urgent case we can postpone
> >> it to PG14 if necessary.
> >
> >
> > Reading through this thread, I agree that it's kind of weird to keep
> cumulative stats mixed with non-cumulative stats. (it always irks me, for
> example, that we have numbackends in pg_stat_database which behaves
> different from every other column in it)
> >
> > However, I don't see how they *are* cumulative. They are only cumulative
> while the client is connected -- as soon as it disconnects they go away. In
> that regard, they're more like the pg_stat_progress_xyz views for example.
> >
> > Which makes it mostly useless for long-term tracking anyway. Because no
> matter which way you snapshot it, you will lose data.
> >
> > ISTM the logical places to keep cumulative stats would be
> pg_replication_slots? (Or go create a pg_stat_replication_slots?) That is,
> that the logical grouping of these statistics for long-term is the
> replication slot, not the walsender?
>
> I personally prefer to display these values in pg_replication_slots.
> If we create a new stats view, it's only for logical replication
> slots? Or displaying both types of slots as physical replication slots
> might have statistics in the future?
>

Yeah, I think it's kind of a weird situation. There's already some things
in pg_replication_slots that should probably be in a stat_ view,  so if we
were to create one we would have to move those, and probably needlessly
break things for people.

i guess we could have separate views for logical and pysical slots since
there are things that only one of them will have. But there is that already
-- the database for example, and xmins. Splitting that apart now should be
a bigger thing, but I don't think it's worth it.


If we move these values to replication slots, I think we can change
> the code so that these statistics are managed by replication slots
> (e.g. ReplicationSlot struct). Once ReplicationSlot has these values,
> we can keep them beyond reconnections or multiple calls of SQL
> interface functions. Of course, these values don’t need to be
> persisted.
>

Eh, why should they not be persisted? The comparison would be temp_files
and temp_bytes in pg_stat_database, and those *are* persisted.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Default setting for enable_hashagg_disk

On Tue, 2020-04-07 at 11:20 -0700, Jeff Davis wrote:
> Now that we have Disk-based Hash Aggregation, there are a lot more
> situations where the planner can choose HashAgg. The
> enable_hashagg_disk GUC, if set to true, chooses HashAgg based on
> costing. If false, it only generates a HashAgg path if it thinks it
> will fit in work_mem, similar to the old behavior (though it wlil now
> spill to disk if the planner was wrong about it fitting in work_mem).
> The current default is true.
> 
> I expect this to be a win in a lot of cases, obviously. But as with
> any
> planner change, it will be wrong sometimes. We may want to be
> conservative and set the default to false depending on the experience
> during beta. I'm inclined to leave it as true for now though, because
> that will give us better information upon which to base any decision.

A compromise may be to multiply the disk costs for HashAgg by, e.g. a
1.5 - 2X penalty. That would make the plan changes less abrupt, and may
mitigate some of the concerns about I/O patterns that Tomas raised
here:


https://www.postgresql.org/message-id/20200519151202.u2p2gpiawoaznsv2@development

The issues were improved a lot, but it will take us a while to really
tune the IO behavior as well as Sort.

Regards,
Jeff Davis

Re: Default setting for enable_hashagg_disk

On Tue, 2020-06-09 at 21:15 -0500, Justin Pryzby wrote:
> The behavior of the GUC is inconsistent with the other GUCs, which is
> confusing.  See also Robert's comments in this thread.
> 
https://www.postgresql.org/message-id/20200407223900.GT2228%40telsasoft.com

enable_* GUCs are planner GUCs, so it would be confusing to me if they
affected execution-time behavior.

I think the point of confusion is that it's not enabling/disabling an
entire execution node; it only "disables" HashAgg if it thinks it will
spill. I agree that is a difference with the other GUCs, and could
cause confusion.

Stepping back, I was trying to solve two problems with these GUCs:

1. Testing the spilling of hashed grouping sets: I'm inclined to just
get rid of enable_groupingsets_hash_disk and use Melanie's stats-
hacking approach instead.

2. Trying to provide an escape hatch for someone who experiences a
performance regression and wants something like the old behavior back.
There are two aspects of the old behavior that a user could potentially
want back:
  a. Don't choose HashAgg if it's expected to have more groups than fit
into a work_mem-sized hashtable.
  b. If executing HashAgg, and the hash table exceeds work_mem, just
keep going.

The behavior in v13 master is, by default, analagous to Sort or
anything else that adapts at runtime to spill. If we had spillable
HashAgg the whole time, we wouldn't be worried about #2 at all. But,
out of conservatism, I am trying to accommodate users who want an
escape hatch, at least for a release or two until users feel more
comfortable with disk-based HashAgg.

Setting enable_hash_disk=false implements 2(a). This name apparently
causes confusion, but it's hard to come up with a better one because
the v12 behavior has nuance that's hard to express succinctly. I don't
think the names you suggested quite fit, but the idea to use a more
interesting GUC value might help express the behavior. Perhaps making
enable_hashagg a ternary "enable_hashagg=on|off|avoid_disk"? The word
"reject" is too definite for the planner, which is working with
imperfect information.

In master, there is no explicit way to get 2(b), but you can just set
work_mem higher in a lot of cases. If enough people want 2(b), I can
add it easily. Perhaps hashagg_overflow=on|off, which would control
execution time behavior?

Regards,
Jeff Davis

Re: Fast DSM segments

On Tue, Jun 9, 2020 at 6:03 PM Thomas Munro  wrote:
> That all makes sense.  Now I'm wondering if I should use exactly that
> word in the GUC... dynamic_shared_memory_preallocate?

I tend to prefer verb-object rather than object-verb word ordering,
because that's how English normally works, but I realize this is not a
unanimous view.

It's a little strange because the fact of preallocating it makes it
not dynamic any more. I don't know what to do about that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Command statistics system (cmdstats)

On Wed, Mar 4, 2020 at 10:43 PM Mark Dilger
 wrote:
> The two main differences are that
>
> (1) This implementation is based on commandtags as enums, not strings and
> (2) This implementation uses techniques to reduce lock contention
>
> I think (2) is the more important part.

My spidey sense is tingling here, telling me that we need some actual
benchmarking. Like, suppose we test the two patches under normal cases
and under cases that are constructed to be as bad as possible for each
of them. Or suppose we test this patch with the lock mitigation
strategies and then remove the mitigations for some inexpensive
command (e.g. SHOW) and then use pgbench to spam that command. Like
you, I suspect that the locking mitigations are important in some
workloads, but it would be good to have some figures to back that out,
as well as to find out whether there's still too much overhead.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Default setting for enable_hashagg_disk

On Tue, 2020-06-09 at 18:20 -0700, Melanie Plageman wrote:
> So, I was catching up on email and noticed the last email in this
> thread.
> 
> I think I am not fully understanding what
> enable_groupingsets_hash_disk
> does. Is it only for testing?

It's mostly for testing. I could imagine cases where it would be useful
to force groupingsets to use the disk, but I mainly wanted the setting
there for testing the grouping sets hash disk code path.

> Using the tests you added to src/test/regress/sql/groupingsets.sql, I
> did get a plan that looks like hashagg is spilling to disk (goes
> through

I had something that worked as a test for a while, but then when I
tweaked the costing, it started using the Sort path (therefore not
testing my grouping sets hash disk code at all) and a bug crept in. So
I thought it would be best to have a more forceful knob.

Perhaps I should just get rid of that GUC and use the stats trick?

Regards,
Jeff Davis

Re: Internal key management system

2020-06-10 Thread Bruce Momjian

On Fri, Jun  5, 2020 at 03:34:54PM +0200, Fabien COELHO wrote:
> Obviously it requires some more thinking and design, but my point is that
> postgres should not hold a KEK, ever, nor presume how DEK are to be managed
> by a DMS, and that is not very difficult to achieve by putting it outside of
> pg and defining how interactions take place. Providing a reference/example
> implementation would be nice as well, and Masahiko-san code can be rewrapped
> quite easily.

Well, the decrypted keys are already stored in backend memory, so what
risk does haveing the KEK in memory for a brief period avoid?

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

On Tue, Jun 9, 2020 at 6:54 PM Andres Freund  wrote:
> What do you think about my idea of having a BEGIN/END_SIGNAL_HANDLER?
> That'd make it much easier to write assertions forbidding palloc, 64bit
> atomics, ...

I must have missed the previous place where you suggested this, but I
think it's a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Physical replication slot advance is not persistent

2020-06-10 Thread Alexey Kondratov

On 2020-06-10 11:38, Kyotaro Horiguchi wrote:

At Wed, 10 Jun 2020 15:53:53 +0900, Michael Paquier
 wrote in

> > I find it really depressing how much obviously untested stuff gets
> > added in this area.
>
> Prior to this patch pg_replication_slot_advance was not being tested
> at all.
> Unfortunately, added tests appeared to be not enough to cover all
> cases. It
> seems that the whole machinery of WAL holding and trimming is worth
> to be
> tested more thoroughly.

I think that it would be interesting if we had a SQL representation of
the contents of XLogCtlData (wanted that a couple of times).  Now we
are actually limited to use a checkpoint and check that past segments
are getting recycled by looking at the contents of pg_wal.  Doing that
here does not cause the existing tests to be much more expensive as we
only need one extra call to pg_switch_wal(), mostly.  Please see the
attached.

The test in the patch looks fine to me and worked well for me.

Using smaller wal_segment_size (1(MB) worked for me) reduces the cost
of the check, but I'm not sure it's worth doing.

New test reproduces this issue well. Left it running for a couple of 
hours in repeat and it seems to be stable.

Just noted that we do not need to keep $phys_restart_lsn_pre:

my $phys_restart_lsn_pre = $node_master->safe_psql('postgres',
	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 
'$phys_slot';"

);
chomp($phys_restart_lsn_pre);

we can safely use $current_lsn used for pg_replication_slot_advance(), 
since reatart_lsn is set as is there. It may make the test a bit simpler 
as well.

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

COPY, bytea streaming and memory footprint

2020-06-10 Thread Jerome Wagner

Hello,

I am trying to understand/optimize how a COPY operation behaves when
transfering a bytea from the database to a client.

For simplicity, I'll consider that I have only one bytea _image_ in the
_images_ table.

Starting with
COPY (SELECT image FROM images) TO STDOUT BINARY

I understand that :
 - the server will create a linear buffer on the server side holding the
whole image and then start sending it over the network in one big copyData
message chunked in 64KB network chunks
 - the client can manage to extract this copyData payload by re-assembling
those chunks in memory or by streaming the relevant data parts of the
chunks elsewhere.

so the problem I see in a streaming situation is that the server actually
needs to buffer the whole image in memory.

Now the image is already compressed so if I
ALTER TABLE images ALTER image SET STORAGE EXTERNAL
I can use the fact that substring on non compressed toasted values will
fetch only the needed parts and do

COPY (
  SELECT (
SELECT substring(image from n for 65536) from images)
FROM generate_series(1, (select length(image) from images), 65536) n
  ) TO STDOUT BINARY

As I understand it, this would be less memory intensive on the server side
if the server starts sending rows before all rows of the subselect are
built because it would only need to prepare a sequence of  65536 bytes long
buffers for the rows it would decide to have in memory.

but is there a way to know if such a COPY/SELECT statement will indeed
start sending rows before they are all prepared on the server ? Does it
depend on the request and  is there a difference if I add an order by on
the select versus the natural order of the table ?
How many rows will be needed in memory before the sending begins ?

I hope my explanation was clear. I am looking for help in better
understanding how the server decides to stream the COPY data out of the
server vs the internal retrieval of the COPY'd subselect.

Thank you
Jérôme

Re: Default setting for enable_hashagg_disk

2020-06-10 Thread Melanie Plageman

On Tue, Jun 9, 2020 at 7:15 PM Justin Pryzby  wrote:

> On Tue, Jun 09, 2020 at 06:20:13PM -0700, Melanie Plageman wrote:
> > On Thu, Apr 9, 2020 at 1:02 PM Jeff Davis  wrote:
> >
> > > 2. enable_groupingsets_hash_disk (default false):
> > >
> > > This is about how we choose which grouping sets to hash and which to
> > > sort when generating mixed mode paths.
> > >
> > > Even before this patch, there are quite a few paths that could be
> > > generated. It tries to estimate the size of each grouping set's hash
> > > table, and then see how many it can fit in work_mem (knapsack), while
> > > also taking advantage of any path keys, etc.
> > >
> > > With Disk-based Hash Aggregation, in principle we can generate paths
> > > representing any combination of hashing and sorting for the grouping
> > > sets. But that would be overkill (and grow to a huge number of paths if
> > > we have more than a handful of grouping sets). So I think the existing
> > > planner logic for grouping sets is fine for now. We might come up with
> > > a better approach later.
> > >
> > > But that created a testing problem, because if the planner estimates
> > > correctly, no hashed grouping sets will spill, and the spilling code
> > > won't be exercised. This GUC makes the planner disregard which grouping
> > > sets' hash tables will fit, making it much easier to exercise the
> > > spilling code. Is there a better way I should be testing this code
> > > path?
> >
> > So, I was catching up on email and noticed the last email in this
> > thread.
> >
> > I think I am not fully understanding what enable_groupingsets_hash_disk
> > does. Is it only for testing?
>
> If so, it should be in category: "Developer Options".
>
> > Using the tests you added to src/test/regress/sql/groupingsets.sql, I
> > did get a plan that looks like hashagg is spilling to disk (goes through
> > hashagg_spill_tuple() code path and has number of batches reported in
> > Explain) in a MixedAgg plan for a grouping sets query even with
> > enable_groupingsets_hash_disk set to false.
>
> > I'm not sure if this is more what you were looking for--or maybe I am
> > misunderstanding the guc.
>
> The behavior of the GUC is inconsistent with the other GUCs, which is
> confusing.  See also Robert's comments in this thread.
> https://www.postgresql.org/message-id/20200407223900.GT2228%40telsasoft.com
>
> The old (pre-13) behavior was:
>  - work_mem is the amount of RAM to which each query node tries to
> constrain
>itself, and the planner will reject a plan if it's expected to exceed
> that.
>...But a chosen plan might exceed work_mem anyway.
>
> The new behavior in v13 seems to be:
>  - HashAgg now respects work_mem, but instead enable*hash_disk are
>opportunisitic.  A node which is *expected* to spill to disk will be
>rejected.
>...But at execution time, a node which exceeds work_mem will be spilled.
>
> If someone sees a plan which spills to disk and wants to improve
> performance by
> avoid spilling, they might SET enable_hashagg_disk=off, which might do what
> they want (if the plan is rejected at plan time), or it might not, which I
> think will be a surprise every time.
>
>
But I thought that the enable_groupingsets_hash_disk GUC allows us to
test the following scenario:

The following is true:
- planner thinks grouping sets' hashtables table would fit in memory
  (spilling is *not* expected)
- user is okay with spilling
- some grouping keys happen to be sortable and some hashable

The following happens:
- Planner generates some HashAgg grouping sets paths
- A MixedAgg plan is created
- During execution of the MixedAgg plan, one or more grouping sets'
  hashtables would exceed work_mem, so the executor spills those tuples
  to disk instead of exceeding work_mem

Especially given the code and comment:
/*
* If we have sortable columns to work with (gd->rollups is non-empty)
* and enable_groupingsets_hash_disk is disabled, don't generate
* hash-based paths that will exceed work_mem.
*/
if (!enable_groupingsets_hash_disk &&
hashsize > work_mem * 1024L && gd->rollups)
return; /* nope, won't fit */

If this is the scenario that the GUC is designed to test, it seems like
you could exercise it without the enable_groupingsets_hash_disk GUC by
lying about the stats, no?

-- 
Melanie Plageman

Re: Recording test runtimes with the buildfarm

2020-06-10 Thread Stephen Frost

Greetings,

* Andrew Dunstan (andrew.duns...@2ndquadrant.com) wrote:
> On 6/10/20 10:13 AM, Tom Lane wrote:
> > Andrew Dunstan  writes:
> >> Alternatively, people with access to the database could extract the logs
> >> and post-process them using perl or python. That would involve no work
> >> on my part :-) But it would not be automated.
> > Yeah, we could easily extract per-test-script runtimes, since pg_regress
> > started to print those.  But ...
> >
> >> What we do record (in build_status_log) is the time each step took. So
> >> any regression test that suddenly blew out should likewise cause a
> >> blowout in the time the whole "make check" took.
> > I have in the past scraped the latter results and tried to make sense of
> > them.  They are *mighty* noisy, even when considering just one animal
> > that I know to be running on a machine with little else to do.  Maybe
> > averaging across the whole buildfarm could reduce the noise level, but
> > I'm not very hopeful.  Per-test-script times would likely be even
> > noisier (ISTM anyway, maybe I'm wrong).
> >
> > The entire reason we've been discussing a separate performance farm
> > is the expectation that buildfarm timings will be too noisy to be
> > useful to detect any but the most obvious performance effects.
> 
> Yes, but will the performance farm be testing regression timings?

We are not currently envisioning that, no.

> Maybe we're going to need several test suites, one of which could be
> regression tests. But the regression tests themselves are not really
> intended for performance testing.

Agree with this- better would be tests which are specifically written to
test performance instead.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Recording test runtimes with the buildfarm

2020-06-10 Thread Andrew Dunstan



On 6/10/20 10:13 AM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> Alternatively, people with access to the database could extract the logs
>> and post-process them using perl or python. That would involve no work
>> on my part :-) But it would not be automated.
> Yeah, we could easily extract per-test-script runtimes, since pg_regress
> started to print those.  But ...
>
>> What we do record (in build_status_log) is the time each step took. So
>> any regression test that suddenly blew out should likewise cause a
>> blowout in the time the whole "make check" took.
> I have in the past scraped the latter results and tried to make sense of
> them.  They are *mighty* noisy, even when considering just one animal
> that I know to be running on a machine with little else to do.  Maybe
> averaging across the whole buildfarm could reduce the noise level, but
> I'm not very hopeful.  Per-test-script times would likely be even
> noisier (ISTM anyway, maybe I'm wrong).
>
> The entire reason we've been discussing a separate performance farm
> is the expectation that buildfarm timings will be too noisy to be
> useful to detect any but the most obvious performance effects.
>
>   



Yes, but will the performance farm be testing regression timings?


Maybe we're going to need several test suites, one of which could be
regression tests. But the regression tests themselves are not really
intended for performance testing.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: password_encryption default

2020-06-10 Thread Jonathan S. Katz

On 6/10/20 10:47 AM, Peter Eisentraut wrote:
> On 2020-05-28 15:28, Jonathan S. Katz wrote:
>> On 5/28/20 8:10 AM, Peter Eisentraut wrote:
>>> On 2020-05-27 15:25, Jonathan S. Katz wrote:
 $ initdb -D data --auth-local=scram-sha-256 --auth-host=md5

 Got an error message:

 "initdb: error: must specify a password for the superuser to enable md5
 authentication"

 For the last two, that behavior is to be expected (after all, you've
 set
 the two login vectors to require passwords), but the error message
 seems
 odd now. Perhaps we tweak it to be:


 "initdb: error: must specify a password for the superuser when
 requiring
 passwords for both local and host authentication."
>>>
>>> That message is a bit long.  Maybe just
>>>
>>> must specify a password for the superuser to enable password
>>> authentication
>>>
>>> without reference to the specific method.  I think the idea is clear
>>> there.
>>
>> +1
> 
> committed

Yay!!! Thank you!

Jonathan



signature.asc
Description: OpenPGP digital signature

Re: password_encryption default

2020-06-10 Thread Peter Eisentraut


On 2020-05-28 15:28, Jonathan S. Katz wrote:

On 5/28/20 8:10 AM, Peter Eisentraut wrote:

On 2020-05-27 15:25, Jonathan S. Katz wrote:

$ initdb -D data --auth-local=scram-sha-256 --auth-host=md5

Got an error message:

"initdb: error: must specify a password for the superuser to enable md5
authentication"

For the last two, that behavior is to be expected (after all, you've set
the two login vectors to require passwords), but the error message seems
odd now. Perhaps we tweak it to be:


"initdb: error: must specify a password for the superuser when requiring
passwords for both local and host authentication."


That message is a bit long.  Maybe just

must specify a password for the superuser to enable password authentication

without reference to the specific method.  I think the idea is clear there.


+1


committed

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: elog(DEBUG2 in SpinLocked section.

On Tue, Jun 9, 2020 at 8:12 PM Andres Freund  wrote:
> I don't think the size is worth of concern in this case, and I'm not
> sure there's any current case where it's really worth spending effort
> reducing size. But if there is: It seems possible to reduce the size.

Yeah, I don't think it's very important.

> First, we could remove the tranche from the lwlock, and instead perform
> more work when we need to know it. Which is only when we're going to
> sleep, so it'd be ok if it's not that much work. Perhaps we could even
> defer determining the tranche to the the *read* side of the wait event
> (presumably that'd require making the pgstat side a bit more
> complicated).
>
> Second, it seems like it should be doable to reduce the size of the
> waiters list. We e.g. could have a separate 'array of wait lists' array
> in shared memory, which gets assigned to an lwlock whenever a backend
> wants to wait for an lwlock. The number of processes waiting for lwlocks
> is clearly limited by MAX_BACKENDS / 2^18-1 backends waiting, so one 4
> byte integer pointing to a wait list obviously would suffice.
>
> But again, I'm not sure the current size a real problem anywhere.

Honestly, both of these sound more painful than it's worth. We're not
likely to have enough LWLocks that using 16 bytes for each one rather
than 8 is a major problem. With regard to the first of these ideas,
bear in mind that the LWLock might be in a DSM segment that the reader
doesn't have mapped.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Terminate the idle sessions

2020-06-10 Thread Adam Brusselback

>
> > Why not implement it in the core of Postgres? Are there any
disadvantages of
> implementing it in the core of Postgres?
I was surprised this wasn't a feature when I looked into it a couple years
ago. I'd use it if it were built in, but I am not installing something
extra just for this.

> I’m curious as to the use case because I cannot imagine using this.

My use case is, I have a primary application that connects to the DB, most
users work through that (setting is useless for this scenario, app manages
it's connections well enough). I also have a number of internal users who
deal with data ingestion and connect to the DB directly to work, and those
users sometimes leave query windows open for days accidentally. Generally
not an issue, but would be nice to be able to time those connections out.

Just my $0.02, but I am +1.
-Adam

Re: Recording test runtimes with the buildfarm

Andrew Dunstan  writes:
> Alternatively, people with access to the database could extract the logs
> and post-process them using perl or python. That would involve no work
> on my part :-) But it would not be automated.

Yeah, we could easily extract per-test-script runtimes, since pg_regress
started to print those.  But ...

> What we do record (in build_status_log) is the time each step took. So
> any regression test that suddenly blew out should likewise cause a
> blowout in the time the whole "make check" took.

I have in the past scraped the latter results and tried to make sense of
them.  They are *mighty* noisy, even when considering just one animal
that I know to be running on a machine with little else to do.  Maybe
averaging across the whole buildfarm could reduce the noise level, but
I'm not very hopeful.  Per-test-script times would likely be even
noisier (ISTM anyway, maybe I'm wrong).

The entire reason we've been discussing a separate performance farm
is the expectation that buildfarm timings will be too noisy to be
useful to detect any but the most obvious performance effects.

regards, tom lane

pg_upgrade fails if vacuum_defer_cleanup_age > 0

2020-06-10 Thread Laurenz Albe

A customer's upgrade failed, and it took me a while to
figure out that the problem was that they had set
"vacuum_defer_cleanup_age=1" on the new cluster.

The consequence was that the "vacuumdb --freeze" that
takes place before copying commit log files failed to
freeze "pg_database".
That caused later updates to the table to fail with
"Could not open file "pg_xact/": No such file or directory."

I think it would increase the robustness of pg_upgrade to
force "vacuum_defer_cleanup_age" to 0 on the new cluster.

Suggested patch attached.

Yours,
Laurenz Albe
From e8f84eaadb132b8647406bc28028862d6d4d8717 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Wed, 10 Jun 2020 15:57:36 +0200
Subject: [PATCH] Force vacuum_defer_cleanup_age=0 during pg_upgrade

This is important on the new cluster, because tuples must
be frozen before the commit logs can be safely copied from
the old server.

Author: Laurenz Albe
Backpatch-through: 9.5
---
 src/bin/pg_upgrade/server.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_upgrade/server.c b/src/bin/pg_upgrade/server.c
index 7d17280ecb..79ec3f04c0 100644
--- a/src/bin/pg_upgrade/server.c
+++ b/src/bin/pg_upgrade/server.c
@@ -239,6 +239,9 @@ start_postmaster(ClusterInfo *cluster, bool report_and_exit_on_error)
 	 * we only modify the new cluster, so only use it there.  If there is a
 	 * crash, the new cluster has to be recreated anyway.  fsync=off is a big
 	 * win on ext4.
+	 *
+	 * Force vacuum_defer_cleanup_age to 0 on the new cluster, so that
+	 * vacuumdb --freeze actually freezes the tuples.
 	 */
 	snprintf(cmd, sizeof(cmd),
 			 "\"%s/pg_ctl\" -w -l \"%s\" -D \"%s\" -o \"-p %d%s%s %s%s\" start",
@@ -247,7 +250,7 @@ start_postmaster(ClusterInfo *cluster, bool report_and_exit_on_error)
 			  BINARY_UPGRADE_SERVER_FLAG_CAT_VER) ? " -b" :
 			 " -c autovacuum=off -c autovacuum_freeze_max_age=20",
 			 (cluster == &new_cluster) ?
-			 " -c synchronous_commit=off -c fsync=off -c full_page_writes=off" : "",
+			 " -c synchronous_commit=off -c fsync=off -c full_page_writes=off -c vacuum_defer_cleanup_age=0" : "",
 			 cluster->pgopts ? cluster->pgopts : "", socket_string);
 
 	/*
-- 
2.21.3

Re: Recording test runtimes with the buildfarm

2020-06-10 Thread Andrew Dunstan

On 6/10/20 8:58 AM, David Rowley wrote:
> Hi,
>
> I've just had some thoughts about the possible usefulness of having
> the buildfarm record the run-time of each regression test to allow us
> to have some sort of ability to track the run-time history of each
> test.
>
> I thought the usefulness might be two-fold:
>
> 1. We could quickly identify when someone adds some overly complex
> test and slows down the regression tests too much.
> 2. We might get some faster insight into performance regressions.
>
> I can think of about 3 reasons that a test might slow down.
>
> a) Someone adds some new tests within the test file.
> b) Actual performance regression in Postgres
> c) Animal busy with other work.
>
> We likely could do a semi-decent job of telling a) and b) apart by
> just recording the latest commit that changed the .sql file for the
> test. We could also likely see when c) is at play by the results
> returning back to normal again a few runs after some spike.  We'd only
> want to pay attention to consistent slowdowns. Perhaps there would be
> too much variability with the parallel tests, but maybe we could just
> record it for the serial tests in make check-world.
>
> I only thought of this after reading [1]. If we went ahead with that,
> as of now, it feels like someone could quite easily break that
> optimisation and nobody would notice for a long time.
>
> I admit to not having looked at the buildfarm code to determine how
> practical such a change would be. I've assumed there is a central
> database that stores all the results.
>

David,

Yes we have a central database. But we don't have anything that digs
into each step. In fact, on the server side the code isn't even really
aware that it's Postgres that's being tested.

The basic structures are:

buildsystems - one row per animal

build_status - one row per run

build_status_log - one row per step within each run

The last table has a blob containing the output of the step (e.g. "make
check") plus any relevant files (like the postmaster.log).

But we don't have any structure corresponding to an individual
regression test.

Architecturally, we could add a table for named times, and have the
client extract and report them.

Alternatively, people with access to the database could extract the logs
and post-process them using perl or python. That would involve no work
on my part :-) But it would not be automated.

What we do record (in build_status_log) is the time each step took. So
any regression test that suddenly blew out should likewise cause a
blowout in the time the whole "make check" took. It might be possible to
create a page that puts stats like that up. I think that's probably a
better way to go.

Tying results to an individual commit is harder. There could be dozens
of commits that happened between the current run and the previous run on
a given animal. But usually what has caused any event is fairly clear
when you look at it.

cheers

andrew

-- 

Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Terminate the idle sessions

2020-06-10 Thread Li Japin



On Jun 10, 2020, at 4:25 PM, Michael Paquier 
mailto:mich...@paquier.xyz>> wrote:

Idle sessions staying around can be a problem in the long run as they
impact snapshot building.  You could for example use a background
worker to do this work, like that:
https://github.com/michaelpq/pg_plugins/tree/master/kill_idle

Why not implement it in the core of Postgres? Are there any disadvantages of
implementing it in the core of Postgres?

Japin Li

Re: Atomic operations within spinlocks

Robert Haas  writes:
> On Fri, Jun 5, 2020 at 8:19 PM Andres Freund  wrote:
>> Randomly noticed while looking at the code:
>>  uint64  flagbit = UINT64CONST(1) << (uint64) type;
>> 
>> that shouldn't be 64bit, right?

> I'm going to admit ignorance here. What's the proper coding rule?

The shift distance can't exceed 64, so there's no need for it to be
wider than int.  "type" is an enum, so explicitly casting it to an
integral type seems like good practice, but int is sufficient.

ISTR older compilers insisting that the shift distance not be
wider than int.  But C99 doesn't seem to require that -- it only
restricts the value of the right operand.

regards, tom lane

Re: Parallel Seq Scan vs kernel read ahead

On Wed, Jun 10, 2020 at 6:04 PM David Rowley  wrote:
>
> On Wed, 10 Jun 2020 at 17:39, David Rowley  wrote:
> >
> > On Wed, 10 Jun 2020 at 17:21, Thomas Munro  wrote:
> > > I also heard from Andres that he likes this patch with his AIO
> > > prototype, because of the way request merging works.  So it seems like
> > > there are several reasons to want it.
> > >
> > > But ... where should we get the maximum step size from?  A GUC?
> >
> > I guess we'd need to determine if other step sizes were better under
> > any conditions.  I guess one condition would be if there was a LIMIT
> > clause. I could check if setting it to 1024 makes any difference, but
> > I'm thinking it won't since I got fairly consistent results on all
> > worker settings with the patched version.
>
> I did another round of testing on the same machine trying some step
> sizes larger than 64 blocks. I can confirm that it does improve the
> situation further going bigger than 64.
>

Can we try the same test with 4, 8, 16 workers as well?  I don't
foresee any problem with a higher number of workers but it might be
better to once check that if it is not too much additional work.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Recording test runtimes with the buildfarm

2020-06-10 Thread Stephen Frost

Greetings,

* David Rowley (dgrowle...@gmail.com) wrote:
> 1. We could quickly identify when someone adds some overly complex
> test and slows down the regression tests too much.

Sure, makes sense to me.  We do track the individual 'stage_duration'
but we don't track things down to a per-regression-test basis.  To do
that, I think we'd need the regression system to spit that kind of
detailed information out somewhere (in a structured format) that the
buildfarm client would then be able to pick it up and send to the server
to write into an appropriate table.

> 2. We might get some faster insight into performance regressions.

There's some effort going into continuing to build out a "performance"
farm, whose specific goal is to try and help exactly this issue.  Trying
to do that with the buildfarm has the challenge that many of those
systems aren't dedicated and therefore the timing could vary wildly
between runs due to entirely independent things than our code.

Would certainly be great to have more people working on that.  Currently
it's primarily Ilaria (GSoC student), Ads and I.

Current repo is https://github.com/PGPerfFarm/pgperffarm if folks want
to look at it, but we're in the process of making some pretty serious
changes, so now might not be the best time to look at the code.  We're
coordinating on the 'Postgresteam' slack in #perffarm for anyone
interested tho.

> I only thought of this after reading [1]. If we went ahead with that,
> as of now, it feels like someone could quite easily break that
> optimisation and nobody would notice for a long time.

One of the goals with the perffarm is to be able to support different
types of benchmarks, beyond just pgbench, so that we'd be able to add a
benchmark for "numeric", perhaps, or maybe create a script with pgbench
that ends up being heavy on numerics or such.

> I admit to not having looked at the buildfarm code to determine how
> practical such a change would be. I've assumed there is a central
> database that stores all the results.

Yes, there's a central database where results are pushed and that's what
you see when you go to buildfarm.postgresql.org, there's also an archive
server which has the logs going much farther back (and is quite a bit
larger, of course..).

Thanks,

Stephen

signature.asc
Description: PGP signature

Recording test runtimes with the buildfarm

Hi,

I've just had some thoughts about the possible usefulness of having
the buildfarm record the run-time of each regression test to allow us
to have some sort of ability to track the run-time history of each
test.

I thought the usefulness might be two-fold:

1. We could quickly identify when someone adds some overly complex
test and slows down the regression tests too much.
2. We might get some faster insight into performance regressions.

I can think of about 3 reasons that a test might slow down.

a) Someone adds some new tests within the test file.
b) Actual performance regression in Postgres
c) Animal busy with other work.

We likely could do a semi-decent job of telling a) and b) apart by
just recording the latest commit that changed the .sql file for the
test. We could also likely see when c) is at play by the results
returning back to normal again a few runs after some spike. We'd only
want to pay attention to consistent slowdowns. Perhaps there would be
too much variability with the parallel tests, but maybe we could just
record it for the serial tests in make check-world.

I only thought of this after reading [1]. If we went ahead with that,
as of now, it feels like someone could quite easily break that
optimisation and nobody would notice for a long time.

I admit to not having looked at the buildfarm code to determine how
practical such a change would be. I've assumed there is a central
database that stores all the results.

David

[1]
https://www.postgresql.org/message-id/CAJ3gD9eEXJ2CHMSiOehvpTZu3Ap2GMi5jaXhoZuW=3xjlmz...@mail.gmail.com

Re: [PATCH] Add support for choosing huge page size

2020-06-10 Thread Odin Ugedal

Thanks again Thomas,

> Oh, so maybe we need a configure test for them?  And if you don't have
> it, a runtime error if you try to set the page size to something other
> than 0 (like we do for effective_io_concurrency if you don't have a
> posix_fadvise() function).

Ahh, yes, that sounds reasonable. Did some fiddling with the configure
script to add a check, and think I got it right (but not 100% sure
tho.). Added new v3 patch.

> If you set it to an unsupported size, that seems reasonable to me.  If
> you set it to an unsupported size and have huge_pages=try, do we fall
> back to using no huge pages?

Yes, the "fallback" with huge_pages=try is the same for both
huge_page_size=0 and huge_page_size=nMB, and is the same as without
this patch.

> For what it's worth, here's what I know about this on other operating systems:

Thanks for all the background info!

> 1.  AIX can do huge pages, but only if you use System V shared memory
> (not for mmap() anonymous shared).  In
> https://commitfest.postgresql.org/25/1960/ we got as far as adding
> support for shared_memory_type=sysv, but to go further we'll need
> someone willing to hack on the patch on an AIX system, preferably with
> root access so they can grant the postgres user wired memory
> privileges (or whatever they call that over there).  But at a glance,
> they don't have a way to ask for a specific page size, just "large".

Interesting. I might get access to some AIX systems at university this fall,
so maybe I will get some time to dive into the patch.


Odin
From 8cb876bf73258646044a6a99d72e7c12d1d03e3a Mon Sep 17 00:00:00 2001
From: Odin Ugedal 
Date: Sun, 7 Jun 2020 21:04:57 +0200
Subject: [PATCH v3] Add support for choosing huge page size

This adds support for using non-default huge page sizes for shared
memory. This is achived via the new "huge_page_size" config entry.
The config value defaults to 0, meaning it will use the system default.
---
 configure | 26 +++
 configure.in  |  4 ++
 doc/src/sgml/config.sgml  | 27 
 doc/src/sgml/runtime.sgml | 41 +++-
 src/backend/port/sysv_shmem.c | 67 ++-
 src/backend/utils/misc/guc.c  | 25 +++
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/include/pg_config.h.in|  8 +++
 src/include/pg_config_manual.h|  6 ++
 src/include/storage/pg_shmem.h|  1 +
 10 files changed, 176 insertions(+), 31 deletions(-)

diff --git a/configure b/configure
index 2feff37fe3..11e3112ee4 100755
--- a/configure
+++ b/configure
@@ -15488,6 +15488,32 @@ _ACEOF
 
 fi # fi
 
+# Check if system supports mmap flags for allocating huge page memory with page sizes
+# other than the default
+ac_fn_c_check_decl "$LINENO" "MAP_HUGE_MASK" "ac_cv_have_decl_MAP_HUGE_MASK" "#include 
+"
+if test "x$ac_cv_have_decl_MAP_HUGE_MASK" = xyes; then :
+  ac_have_decl=1
+else
+  ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_MAP_HUGE_MASK $ac_have_decl
+_ACEOF
+ac_fn_c_check_decl "$LINENO" "MAP_HUGE_SHIFT" "ac_cv_have_decl_MAP_HUGE_SHIFT" "#include 
+"
+if test "x$ac_cv_have_decl_MAP_HUGE_SHIFT" = xyes; then :
+  ac_have_decl=1
+else
+  ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_MAP_HUGE_SHIFT $ac_have_decl
+_ACEOF
+
+
 ac_fn_c_check_decl "$LINENO" "fdatasync" "ac_cv_have_decl_fdatasync" "#include 
 "
 if test "x$ac_cv_have_decl_fdatasync" = xyes; then :
diff --git a/configure.in b/configure.in
index 0188c6ff07..f56c06eb3d 100644
--- a/configure.in
+++ b/configure.in
@@ -1687,6 +1687,10 @@ AC_CHECK_FUNCS(posix_fadvise)
 AC_CHECK_DECLS(posix_fadvise, [], [], [#include ])
 ]) # fi
 
+# Check if system supports mmap flags for allocating huge page memory with page sizes
+# other than the default
+AC_CHECK_DECLS([MAP_HUGE_MASK, MAP_HUGE_SHIFT], [], [], [#include ])
+
 AC_CHECK_DECLS(fdatasync, [], [], [#include ])
 AC_CHECK_DECLS([strlcat, strlcpy, strnlen])
 # This is probably only present on macOS, but may as well check always
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index aca8f73a50..42f06a41cb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1582,6 +1582,33 @@ include_dir 'conf.d'
   
  
 
+ 
+  huge_page_size (integer)
+  
+   huge_page_size configuration parameter
+  
+  
+  
+   
+Controls what size of huge pages is used in conjunction with
+.
+The default is zero (0).
+When set to 0, the default huge page size on the system will
+be used.
+   
+   
+Some commonly available page sizes on modern 64 bit server architectures include:
+2MB and 1GB (Intel and AMD), 16MB and
+16GB (IBM POWER), and 64kB, 2MB,
+32MB and 1GB (ARM). For more information
+about usage and support, see .
+   
+   
+

Re: Parallel Seq Scan vs kernel read ahead

On Wed, 10 Jun 2020 at 17:39, David Rowley  wrote:
>
> On Wed, 10 Jun 2020 at 17:21, Thomas Munro  wrote:
> > I also heard from Andres that he likes this patch with his AIO
> > prototype, because of the way request merging works.  So it seems like
> > there are several reasons to want it.
> >
> > But ... where should we get the maximum step size from?  A GUC?
>
> I guess we'd need to determine if other step sizes were better under
> any conditions.  I guess one condition would be if there was a LIMIT
> clause. I could check if setting it to 1024 makes any difference, but
> I'm thinking it won't since I got fairly consistent results on all
> worker settings with the patched version.

I did another round of testing on the same machine trying some step
sizes larger than 64 blocks. I can confirm that it does improve the
situation further going bigger than 64.

I got up as far as 16384, but made a couple of additional changes for
that run only. Instead of increasing the ramp-up 1 block at a time, I
initialised phsw_step_size to 1 and multiplied it by 2 until I reached
the chosen step size. With numbers that big, ramping up 1 block at a
time was slow enough that I'd never have reached the target step size

Here are the results of the testing:

Master:

max_parallel_workers_per_gather = 0: Time: 148481.244 ms (02:28.481)
(706.2MB/sec)
max_parallel_workers_per_gather = 1: Time: 327556.121 ms (05:27.556)
(320.1MB/sec)
max_parallel_workers_per_gather = 2: Time: 329055.530 ms (05:29.056)
(318.6MB/sec)

Patched stepsize = 64:

max_parallel_workers_per_gather = 0: Time: 141363.991 ms (02:21.364)
(741.7MB/sec)
max_parallel_workers_per_gather = 1: Time: 144982.202 ms (02:24.982)
(723.2MB/sec)
max_parallel_workers_per_gather = 2: Time: 143355.656 ms (02:23.356)
(731.4MB/sec)

Patched stepsize = 1024:

max_parallel_workers_per_gather = 0: Time: 152599.159 ms (02:32.599)
(687.1MB/sec)
max_parallel_workers_per_gather = 1: Time: 104227.232 ms (01:44.227)
(1006.04MB/sec)
max_parallel_workers_per_gather = 2: Time: 97149.343 ms (01:37.149)
(1079.3MB/sec)

Patched stepsize = 8192:

max_parallel_workers_per_gather = 0: Time: 143524.038 ms (02:23.524)
(730.59MB/sec)
max_parallel_workers_per_gather = 1: Time: 102899.288 ms (01:42.899)
(1019.0MB/sec)
max_parallel_workers_per_gather = 2: Time: 91148.340 ms (01:31.148)
(1150.4MB/sec)

Patched stepsize = 16384 (power 2 ramp-up)

max_parallel_workers_per_gather = 0: Time: 144598.502 ms (02:24.599)
(725.16MB/sec)
max_parallel_workers_per_gather = 1: Time: 97344.160 ms (01:37.344)
(1077.1MB/sec)
max_parallel_workers_per_gather = 2: Time: 88025.420 ms (01:28.025)
(1191.2MB/sec)

I thought about what you mentioned about a GUC, and I think it's a bad
idea to do that. I think it would be better to choose based on the
relation size. For smaller relations, we want to keep the step size
small. Someone may enable parallel query on such a small relation if
they're doing something like calling an expensive function on the
results, so we do need to avoid going large for small relations.

I considered something like:

create function nextpower2(a bigint) returns bigint as $$ declare n
bigint := 1; begin while n < a loop n := n * 2; end loop; return n;
end; $$ language plpgsql;
select pg_size_pretty(power(2,p)::numeric * 8192) rel_size,
nextpower2(power(2,p)::bigint / 1024) as stepsize from
generate_series(1,32) p;
 rel_size | stepsize
--+--
 16 kB|1
 32 kB|1
 64 kB|1
 128 kB   |1
 256 kB   |1
 512 kB   |1
 1024 kB  |1
 2048 kB  |1
 4096 kB  |1
 8192 kB  |1
 16 MB|2
 32 MB|4
 64 MB|8
 128 MB   |   16
 256 MB   |   32
 512 MB   |   64
 1024 MB  |  128
 2048 MB  |  256
 4096 MB  |  512
 8192 MB  | 1024
 16 GB| 2048
 32 GB| 4096
 64 GB| 8192
 128 GB   |16384
 256 GB   |32768
 512 GB   |65536
 1024 GB  |   131072
 2048 GB  |   262144
 4096 GB  |   524288
 8192 GB  |  1048576
 16 TB|  2097152
 32 TB|  4194304

So with that algorithm with this 100GB table that I've been using in
my test, we'd go with a step size of 16384. I think we'd want to avoid
going any more than that. The above code means we'll do between just
below 0.1% and 0.2% of the relation per step. If I divided the number
of blocks by say 128 instead of 1024, then that would be about 0.78%
and 1.56% of the relation each time. It's not unrealistic today that
someone might throw that many workers at a job, so, I'd say dividing
by 1024 or even 2048 would likely be about right.

David

Re: Auto-vectorization speeds up multiplication of large-precision numerics

2020-06-10 Thread Amit Khandekar

On Wed, 10 Jun 2020 at 04:20, Peter Eisentraut
 wrote:
>
> On 2020-06-09 13:50, Amit Khandekar wrote:
> > Also, the regress/sql/numeric_big test itself speeds up by 80%
>
> That's nice.  I can confirm the speedup:
>
> -O3 without the patch:
>
>   numeric  ... ok  737 ms
> test numeric_big  ... ok 1014 ms
>
> -O3 with the patch:
>
>   numeric  ... ok  680 ms
> test numeric_big  ... ok  580 ms
>
> Also:
>
> -O2 without the patch:
>
>   numeric  ... ok  693 ms
> test numeric_big  ... ok 1160 ms
>
> -O2 with the patch:
>
>   numeric  ... ok  677 ms
> test numeric_big  ... ok  917 ms
>
> So the patch helps either way.

Oh, I didn't observe that the patch helps numeric_big.sql to speed up
to some extent even with -O2. For me, it takes 805 on head and 732 ms
with patch. One possible reason that I can think of is : Because of
the forward loop traversal, pre-fetching might be helping. But this is
just a wild guess.

-O3 : HEAD
test numeric  ... ok  102 ms
test numeric_big  ... ok  803 ms

-O3 : patched :
test numeric  ... ok  100 ms
test numeric_big  ... ok  450 ms

-O2 : HEAD
test numeric  ... ok  100 ms
test numeric_big  ... ok  805 ms

-O2 patched :
test numeric  ... ok  103 ms
test numeric_big  ... ok  732 ms

> But it also seems that without the patch, -O3 might
> be a bit slower in some cases. This might need more testing.

For me, there is no observed change in the times with -O2 versus -O3,
on head. Are you getting a consistent slower numeric*.sql tests with
-O3 on current code ? Not sure what might be the reason.
But this is not related to the patch. Is it with the context of patch
that you are suggesting that it might need more testing ? There might
be existing cases that might be running a bit slower with O3, but
that's strange actually. Probably optimization in those cases might
not be working as thought by the compiler, and in fact they might be
working negatively. Probably that's one of the reasons -O2 is the
default choice.

>
> > For the for loop to be auto-vectorized, I had to simplify it to
> > something like this :
>
> Well, how do we make sure we keep it that way?  How do we prevent some
> random rearranging of the code or some random compiler change to break
> this again?

I believe the compiler rearranges the code segments w.r.t. one another
when those are independent of each other. I guess the compiler is able
to identify that. With our case, it's the for loop. There are some
variables used inside it, and that would prevent it from moving the
for loop. Even if the compiler finds it safe to move relative to
surrounding code, it would not spilt the for loop contents themselves,
so the for loop will remain intact, and so would the vectorization,
although it seems to keep an unrolled version of the loop in case it
is called with smaller iteration counts. But yes, if someone in the
future tries to change the for loop, it would possibly break the
auto-vectorization. So, we should have appropriate comments (patch has
those). Let me know if you find any possible reasons due to which the
compiler would in the future stop the vectorization even when there is
no change in the for loop.

It might look safer if we take the for loop out into an inline
function; just to play it safe ?

walkdir does not honor elevel because of call to AllocateDir, possibly causing issues in abort handler

2020-06-10 Thread Jelte Fennema

walkdir is used indirectly in the abort handler of SharedFileSetOnDetach, which 
has the following comment:

/*
* Callback function that will be invoked when this backend detaches from a
* DSM segment holding a SharedFileSet that it has created or attached to. If
* we are the last to detach, then try to remove the directories and
* everything in them. We can't raise an error on failures, because this runs
* in error cleanup paths.
*/

walkdir itself has elevel, which is set to LOG in that case, so it should ot 
throw an ERROR.

However, since walkdir calls AllocateDir this is not always true. AllocateDir 
throws an ERROR in two places:

  1.  
https://github.com/postgres/postgres/blob/5a4ada71a8f944600c348a6e4f5feb388ba8bd37/src/backend/storage/file/fd.c#L2590-L2593
  2.  and inderictly in reserveAllocatedDesc 
https://github.com/postgres/postgres/blob/5a4ada71a8f944600c348a6e4f5feb388ba8bd37/src/backend/storage/file/fd.c#L2266-L2268

The fix seems simple enough: AllocateDir and reserveAllocatedDesc should take 
an elevel argument and honor that. To not change the signature of AllocateDir 
and possibly break extions, it could simply become a wrapper of a new function 
like AllocateDirWithElevel(dirname, ERROR).

Re: calling procedures is slow and consumes extra much memory against calling function

2020-06-10 Thread Pavel Stehule

st 10. 6. 2020 v 12:26 odesílatel Amit Khandekar 
napsal:

> On Sat, 16 May 2020 at 00:07, Pavel Stehule 
> wrote:
> >
> > Hi
> >
> >>>
> >>> The problem is in plpgsql implementation of CALL statement
> >>>
> >>> In non atomic case -  case of using procedures from DO block, the
> expression plan is not cached, and plan is generating any time. This is
> reason why it is slow.
> >>>
> >>> Unfortunately, generated plans are not released until SPI_finish.
> Attached patch fixed this issue.
> >>
> >>
> >> But now, recursive calling doesn't work :-(. So this patch is not enough
> >
> >
> > Attached patch is working - all tests passed
>
> Could you show an example testcase that tests this recursive scenario,
> with which your earlier patch fails the test, and this v2 patch passes
> it ? I am trying to understand the recursive scenario and the re-use
> of expr->plan.
>

it hangs on plpgsql tests. So you can apply first version of patch

and "make check"


> >
> > It doesn't solve performance, and doesn't solve all memory problems, but
> significantly reduce memory requirements from 5007 bytes to 439 bytes per
> one CALL
>
> So now  this patch's intention is to reduce memory consumption, and it
> doesn't target slowness improvement, right ?
>

yes. There is a problem with planning every execution when the procedure
was called from not top context.



> --
> Thanks,
> -Amit Khandekar
> Huawei Technologies
>

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jun 10, 2020 at 5:02 PM Dilip Kumar  wrote:
>
> On Wed, Jun 10, 2020 at 4:00 PM Amit Kapila  wrote:
> >
> > > 2. Files should not be closed at the end of the transaction:
> > > Currently, files opened with BufFileCreateShared/BufFileOpenShared are
> > > registered to be closed on EOXACT.  Basically, we need to open the
> > > changes file on the stream start and keep it open until stream stop,
> > > so we can not afford to get it closed on the EOXACT.  I have added a
> > > flag for the same.
> > >
> >
> > But where do we end the transaction before the stream stop which can
> > lead to closure of this file?
>
> Currently, I am keeping the transaction only while creating/opening
> the files and closing immediately after that,  maybe we can keep the
> transaction until stream stop, then we can avoid this changes,  and we
> can also avoid creating extra resource owner?  What is your thought on
> this?
>

I would prefer to keep the transaction until the stream stop unless
there are good reasons for not doing so.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-10 Thread Dilip Kumar

On Wed, Jun 10, 2020 at 4:00 PM Amit Kapila  wrote:
>
> On Wed, Jun 10, 2020 at 2:30 PM Dilip Kumar  wrote:
> >
> >
> > Currently, I am done with a working prototype of using the BufFile
> > infrastructure for the tempfile.  Meanwhile, I want to discuss a few
> > interface changes required for the BufFIle infrastructure.
> >
> > 1. Support read-write mode for "BufFileOpenShared",  Basically, in
> > workers we will be opening the xid's changes and subxact files per
> > stream, so we need an RW mode even in the open.  I have passed a flag
> > for the same.
> >
>
> Generally file open APIs have mode as a parameter to indicate
> read_only or read_write.  Using flag here seems a bit odd to me.

Let me think about it, we can try to pass the mode.

> > 2. Files should not be closed at the end of the transaction:
> > Currently, files opened with BufFileCreateShared/BufFileOpenShared are
> > registered to be closed on EOXACT.  Basically, we need to open the
> > changes file on the stream start and keep it open until stream stop,
> > so we can not afford to get it closed on the EOXACT.  I have added a
> > flag for the same.
> >
>
> But where do we end the transaction before the stream stop which can
> lead to closure of this file?

Currently, I am keeping the transaction only while creating/opening
the files and closing immediately after that,  maybe we can keep the
transaction until stream stop, then we can avoid this changes,  and we
can also avoid creating extra resource owner?  What is your thought on
this?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Atomic operations within spinlocks

On Fri, Jun 5, 2020 at 8:19 PM Andres Freund  wrote:
> Randomly noticed while looking at the code:
> uint64  flagbit = UINT64CONST(1) << (uint64) type;
>
> that shouldn't be 64bit, right?

I'm going to admit ignorance here. What's the proper coding rule?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: FailedAssertion at ReorderBufferCheckMemoryLimit()

On Wed, Jun 10, 2020 at 9:15 AM Fujii Masao  wrote:
>
> On 2020/06/10 12:00, Amit Kapila wrote:
> > On Tue, Jun 9, 2020 at 2:28 PM Amit Kapila  wrote:
> >>
> >> On Tue, Jun 9, 2020 at 1:56 PM Fujii Masao  
> >> wrote:
> >>>
> >>
> >>> I wonder if we may
> >>> need to keep evicting the transactions until we don't exceed
> >>> memory limit.
> >>>
> >>
> >> Yes, that should be the right fix here.
> >>
> >
> > Can you please check whether the attached patch fixes the problem for you?
>
> Thanks for the patch! The patch looks good to me.
>

Thanks, pushed!

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Inlining of couple of functions in pl_exec.c improves performance

2020-06-10 Thread Amit Khandekar

On Tue, 9 Jun 2020 at 21:49, Pavel Stehule  wrote:
> Is your patch in commitfest in commitfest application?

Thanks for reminding me. Just added.
https://commitfest.postgresql.org/28/2590/

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jun 10, 2020 at 2:30 PM Dilip Kumar  wrote:
>
>
> Currently, I am done with a working prototype of using the BufFile
> infrastructure for the tempfile.  Meanwhile, I want to discuss a few
> interface changes required for the BufFIle infrastructure.
>
> 1. Support read-write mode for "BufFileOpenShared",  Basically, in
> workers we will be opening the xid's changes and subxact files per
> stream, so we need an RW mode even in the open.  I have passed a flag
> for the same.
>

Generally file open APIs have mode as a parameter to indicate
read_only or read_write.  Using flag here seems a bit odd to me.

> 2. Files should not be closed at the end of the transaction:
> Currently, files opened with BufFileCreateShared/BufFileOpenShared are
> registered to be closed on EOXACT.  Basically, we need to open the
> changes file on the stream start and keep it open until stream stop,
> so we can not afford to get it closed on the EOXACT.  I have added a
> flag for the same.
>

But where do we end the transaction before the stream stop which can
lead to closure of this file?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: calling procedures is slow and consumes extra much memory against calling function

2020-06-10 Thread Amit Khandekar

On Sat, 16 May 2020 at 00:07, Pavel Stehule  wrote:
>
> Hi
>
>>>
>>> The problem is in plpgsql implementation of CALL statement
>>>
>>> In non atomic case -  case of using procedures from DO block, the 
>>> expression plan is not cached, and plan is generating any time. This is 
>>> reason why it is slow.
>>>
>>> Unfortunately, generated plans are not released until SPI_finish. Attached 
>>> patch fixed this issue.
>>
>>
>> But now, recursive calling doesn't work :-(. So this patch is not enough
>
>
> Attached patch is working - all tests passed

Could you show an example testcase that tests this recursive scenario,
with which your earlier patch fails the test, and this v2 patch passes
it ? I am trying to understand the recursive scenario and the re-use
of expr->plan.

>
> It doesn't solve performance, and doesn't solve all memory problems, but 
> significantly reduce memory requirements from 5007 bytes to 439 bytes per one 
> CALL

So now  this patch's intention is to reduce memory consumption, and it
doesn't target slowness improvement, right ?

-- 
Thanks,
-Amit Khandekar
Huawei Technologies

Re: Relation wide 'LIKE' clause

2020-06-10 Thread Georgios




‐‐‐ Original Message ‐‐‐
On Wednesday, June 10, 2020 12:08 PM, Peter Eisentraut 
 wrote:

> On 2020-06-10 11:42, Georgios wrote:
>
> > Postgres create table statement supports `LIKE source_table [like_option... 
> > ]`
> > to specify `a table from which the new table automatically copies all 
> > column names, their data types, and their not-null constraints.` according 
> > to
> > documentation [1].
> > I am wondering if a similar clause would make sense to copy relation wide
> > settings. For example consider a relation created like this:
> > `CREATE TABLE source_table ([column, ...]) USING customam WITH 
> > (storage_parameter1 = value1, ... )`
>
> We already have LIKE INCLUDING STORAGE. Maybe that should just be
> updated to work like what you are asking for.

This is correct. However I do see some limitations there. Consider the valid 
scenario:

`CREATE TABLE target (LIKE source_am1 INCLUDING ALL, LIKE source_am2 INCLUDING 
ALL)`

Which source relation should be used for the access method and storage 
parameters?

Also I _think_ that the current `LIKE` clause specifically targets column 
definitions
in the SQL standard. I am a bit hesitant on the last part, yet this is my
current understanding.

Please, let me know what you think.

>
> --
>
> Peter Eisentraut http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Relation wide 'LIKE' clause

2020-06-10 Thread Peter Eisentraut


On 2020-06-10 11:42, Georgios wrote:

Postgres create table statement supports `LIKE source_table [like_option... ]`
to specify `a table from which the new table automatically copies all column
names, their data types, and their not-null constraints.` according to
documentation [1].

I am wondering if a similar clause would make sense to copy relation wide
settings. For example consider a relation created like this:

`CREATE TABLE source_table ([column, ...]) USING customam WITH 
(storage_parameter1 = value1, ... )`


We already have LIKE INCLUDING STORAGE.  Maybe that should just be 
updated to work like what you are asking for.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: typos in comments referring to macros