Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-07-25 Thread Dilip Kumar
On Sat, Jul 25, 2020 at 5:08 PM Amit Kapila  wrote:
>
> On Fri, Jul 24, 2020 at 7:17 PM Dilip Kumar  wrote:
> >
> > Your changes look fine to me.  Additionally, I have changed a test
> > case of getting the streaming changes in 0002.  Instead of just
> > showing the count, I am showing that the transaction is actually
> > streaming.
> >
>
> If you want to show the changes then there is no need to display 157
> rows probably a few (10-15) should be sufficient.  If we can do that
> by increasing the size of the row then good, otherwise, I think it is
> better to retain the test to display the count.

I think in existing test cases also we are displaying multiple lines
e.g. toast.out is showing 235 rows.  But maybe I will try to reduce it
to the less number of rows.

> Today, I have again looked at the first patch
> (v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't
> find any more problems with it so planning to commit the same unless
> you or someone else want to add more to it.   Just for ease of others,
> "the next patch extends the logical decoding output plugin API with
> stream methods".   It adds seven methods to the output plugin API,
> adding support for streaming changes for large in-progress
> transactions. The methods are stream_start, stream_stop, stream_abort,
> stream_commit, stream_change, stream_message, and stream_truncate.
> Most of this is a simple extension of the existing methods, with the
> semantic difference that the transaction (or subtransaction) is
> incomplete and may be aborted later (which is something the regular
> API does not really need to deal with).
>
> This also extends the 'test_decoding' plugin, implementing these new
> stream methods.  The stream_start/start_stop are used to demarcate a
> chunk of changes streamed for a particular toplevel transaction.
>
> This commit simply adds these new APIs and the upcoming patch to
> "allow the streaming mode in ReorderBuffer" will use these APIs.

LGTM

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Peter Geoghegan
On Sat, Jul 25, 2020 at 5:31 PM Peter Geoghegan  wrote:
> I'm glad that this better principled approach is possible. It's hard
> to judge how much of a problem this really is, though. We'll need to
> think about this aspect some more.

BTW, your HLL patch ameliorates the problem with my extreme "sorted vs
random input" test case from this morning [1] (the thing that I just
discussed with Tomas). Without the HLL patch the sorted case had 2424
batches. With the HLL patch it has 20. That at least seems like a
notable improvement.

[1] 
https://postgr.es/m/CAH2-Wz=ur7MQKpaUZJP=Adtg0TPMx5M_WoNE=ke2vUU=amd...@mail.gmail.com
-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Peter Geoghegan
On Sat, Jul 25, 2020 at 4:56 PM Jeff Davis  wrote:
> I wrote a quick patch to use HyperLogLog to estimate the number of
> groups contained in a spill file. It seems to reduce the
> overpartitioning effect, and is a more principled approach than what I
> was doing before.

This pretty much fixes the issue that I observed with overparitioning.
At least in the sense that the number of partitions grows more
predictably -- even when the number of partitions planned is reduced
the change in the number of batches seems smooth-ish. It "looks nice".

> It does seem to hurt the runtime slightly when spilling to disk in some
> cases. I haven't narrowed down whether this is because we end up
> recursing multiple times, or if it's just more efficient to
> overpartition, or if the cost of doing the HLL itself is significant.

I'm glad that this better principled approach is possible. It's hard
to judge how much of a problem this really is, though. We'll need to
think about this aspect some more.

Thanks
-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Peter Geoghegan
On Sat, Jul 25, 2020 at 5:05 PM Tomas Vondra
 wrote:
> I'm not sure what you mean by "reported memory usage doesn't reflect the
> space used for transition state"? Surely it does include that, we've
> built the memory accounting stuff pretty much exactly to do that.
>
> I think it's pretty clear what's happening - in the sorted case there's
> only a single group getting new values at any moment, so when we decide
> to spill we'll only add rows to that group and everything else will be
> spilled to disk.

Right.

> In the unsorted case however we manage to initialize all groups in the
> hash table, but at that point the groups are tiny an fit into work_mem.
> As we process more and more data the groups grow, but we can't evict
> them - at the moment we don't have that capability. So we end up
> processing everything in memory, but significantly exceeding work_mem.

work_mem was set to 200MB, which is more than the reported "Peak
Memory Usage: 1605334kB". So either the random case significantly
exceeds work_mem and the "Peak Memory Usage" accounting is wrong
(because it doesn't report this excess), or the random case really
doesn't exceed work_mem but has a surprising advantage over the sorted
case.

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Tomas Vondra

On Sat, Jul 25, 2020 at 10:07:37AM -0700, Peter Geoghegan wrote:

On Sat, Jul 25, 2020 at 9:39 AM Peter Geoghegan  wrote:

"Peak Memory Usage: 1605334kB"

Hash agg avoids spilling entirely (so the planner gets it right this
time around). It even uses notably less memory.


I guess that this is because the reported memory usage doesn't reflect
the space used for transition state, which is presumably most of the
total -- array_agg() is used in the query.



I'm not sure what you mean by "reported memory usage doesn't reflect the
space used for transition state"? Surely it does include that, we've
built the memory accounting stuff pretty much exactly to do that.

I think it's pretty clear what's happening - in the sorted case there's
only a single group getting new values at any moment, so when we decide
to spill we'll only add rows to that group and everything else will be
spilled to disk.

In the unsorted case however we manage to initialize all groups in the
hash table, but at that point the groups are tiny an fit into work_mem.
As we process more and more data the groups grow, but we can't evict
them - at the moment we don't have that capability. So we end up
processing everything in memory, but significantly exceeding work_mem.


FWIW all my tests are done on the same TPC-H data set clustered by
l_shipdate (so probably random with respect to other columns).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Jeff Davis
On Sat, 2020-07-25 at 13:27 -0700, Peter Geoghegan wrote:
> It's not clear to me that overpartitioning is a real problem in
> > this
> > case -- but I think the fact that it's causing confusion is enough
> > reason to see if we can fix it.
> 
> I'm not sure about that either.
> 
> FWIW I notice that when I reduce work_mem a little further (to 3MB)
> with the same query, the number of partitions is still 128, while the
> number of run time batches is 16,512 (an increase from 11,456 from
> 6MB
> work_mem). I notice that 16512/128 is 129, which hints at the nature
> of what's going on with the recursion. I guess it would be ideal if
> the growth in batches was more gradual as I subtract memory.

I wrote a quick patch to use HyperLogLog to estimate the number of
groups contained in a spill file. It seems to reduce the
overpartitioning effect, and is a more principled approach than what I
was doing before.

It does seem to hurt the runtime slightly when spilling to disk in some
cases. I haven't narrowed down whether this is because we end up
recursing multiple times, or if it's just more efficient to
overpartition, or if the cost of doing the HLL itself is significant.

Regards,
Jeff Davis

From 424f611f442e36a91747fd39cd28acba42eb958b Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Sat, 25 Jul 2020 14:29:59 -0700
Subject: [PATCH] HLL

---
 src/backend/executor/nodeAgg.c | 64 ++
 src/include/executor/nodeAgg.h |  2 +-
 2 files changed, 43 insertions(+), 23 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index b79c845a6b7..25771cb4869 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -245,9 +245,11 @@
 #include "catalog/pg_aggregate.h"
 #include "catalog/pg_proc.h"
 #include "catalog/pg_type.h"
+#include "common/hashfn.h"
 #include "executor/execExpr.h"
 #include "executor/executor.h"
 #include "executor/nodeAgg.h"
+#include "lib/hyperloglog.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
@@ -295,6 +297,14 @@
 #define HASHAGG_READ_BUFFER_SIZE BLCKSZ
 #define HASHAGG_WRITE_BUFFER_SIZE BLCKSZ
 
+/*
+ * HyperLogLog is used for estimating the cardinality of the spilled tuples in
+ * a given partition. 5 bits corresponds to a size of about 32 bytes and a
+ * worst-case error of around 18%. That's effective enough to choose a
+ * reasonable number of partitions when recursing.
+ */
+#define HASHAGG_HLL_BIT_WIDTH 5
+
 /*
  * Estimate chunk overhead as a constant 16 bytes. XXX: should this be
  * improved?
@@ -339,6 +349,7 @@ typedef struct HashAggSpill
 	int64	   *ntuples;		/* number of tuples in each partition */
 	uint32		mask;			/* mask to find partition from hash value */
 	int			shift;			/* after masking, shift by this amount */
+	hyperLogLogState *hll_card;	/* cardinality estimate for contents */
 } HashAggSpill;
 
 /*
@@ -357,6 +368,7 @@ typedef struct HashAggBatch
 	LogicalTapeSet *tapeset;	/* borrowed reference to tape set */
 	int			input_tapenum;	/* input partition tape */
 	int64		input_tuples;	/* number of tuples in this batch */
+	double		input_card;		/* estimated group cardinality */
 } HashAggBatch;
 
 /* used to find referenced colnos */
@@ -409,7 +421,7 @@ static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
 static long hash_choose_num_buckets(double hashentrysize,
 	long estimated_nbuckets,
 	Size memory);
-static int	hash_choose_num_partitions(uint64 input_groups,
+static int	hash_choose_num_partitions(double input_groups,
 	   double hashentrysize,
 	   int used_bits,
 	   int *log2_npartittions);
@@ -429,10 +441,11 @@ static void hashagg_finish_initial_spills(AggState *aggstate);
 static void hashagg_reset_spill_state(AggState *aggstate);
 static HashAggBatch *hashagg_batch_new(LogicalTapeSet *tapeset,
 	   int input_tapenum, int setno,
-	   int64 input_tuples, int used_bits);
+	   int64 input_tuples, double input_card,
+	   int used_bits);
 static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
 static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
-			   int used_bits, uint64 input_tuples,
+			   int used_bits, double input_groups,
 			   double hashentrysize);
 static Size hashagg_spill_tuple(AggState *aggstate, HashAggSpill *spill,
 TupleTableSlot *slot, uint32 hash);
@@ -1775,7 +1788,7 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
  * substantially larger than the initial value.
  */
 void
-hash_agg_set_limits(double hashentrysize, uint64 input_groups, int used_bits,
+hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
 	Size *mem_limit, uint64 *ngroups_limit,
 	int *num_partitions)
 {
@@ -1967,7 +1980,7 @@ hash_choose_num_buckets(double hashentrysize, long ngroups, Size memory)
  * *log2_npartitions to the 

Re: hashagg slowdown due to spill changes

2020-07-25 Thread Peter Geoghegan
On Sat, Jul 25, 2020 at 12:41 PM Peter Geoghegan  wrote:
> I have added a new open item for this separate
> LookupTupleHashEntryHash()/lookup_hash_entry() pipeline-stall issue.

Attached is a rebased version of Andres' now-bitrot 2020-06-12 patch
("aggspeed.diff").

I find that Andres original "SELECT cat, count(*) FROM
fewgroups_many_rows GROUP BY 1;" test case is noticeably improved by
the patch. Without the patch, v13 takes ~11.46 seconds. With the
patch, it takes only ~10.64 seconds.

Didn't test it against v12 yet, but I have no reason to doubt Andres'
explanation. I gather that if we can get this patch committed, we can
close the relevant LookupTupleHashEntryHash() open item.

Can you take this off my hands, Jeff?

Thanks
-- 
Peter Geoghegan


0001-Fix-LookupTupleHashEntryHash-pipeline-stall-issue.patch
Description: Binary data


Re: [PATCH] Performance Improvement For Copy From Binary Files

2020-07-25 Thread Tom Lane
Amit Langote  writes:
> [ v7-0001-Improve-performance-of-binary-COPY-FROM-with-buff.patch ]

Pushed with cosmetic changes.

I'd always supposed that stdio does enough internal buffering that short
fread()s shouldn't be much worse than memcpy().  But I reproduced your
result of ~30% speedup for data with a lot of narrow text columns, using
RHEL 8.2.  Thinking this an indictment of glibc, I also tried it on
current macOS, and saw an even bigger speedup, approaching 50%.  So
there's definitely something to this.  I wonder if we ought to check
other I/O-constrained users of fread and fwrite, like pg_dump/pg_restore.

A point that I did not see addressed in the thread is whether this
has any negative impact on the copy-from-frontend code path, where
there's no fread() to avoid; short reads from CopyGetData() are
already going to be satisfied by memcpy'ing from the fe_msgbuf.
However, a quick check suggests that this patch is still a small
win for that case too --- apparently the control overhead in
CopyGetData() is not negligible.

So the patch seems fine functionally, but there were some cosmetic
things I didn't like:

* Removing CopyGetInt32 and CopyGetInt16 seemed like a pretty bad
idea, because it made the callers much uglier and more error-prone.
This is a particularly bad example:

/* Header extension length */
-   if (!CopyGetInt32(cstate, ) ||
-   tmp < 0)
+   if (CopyReadBinaryData(cstate, (char *) , sizeof(tmp)) !=
+   sizeof(tmp) || (tmp = (int32) pg_ntoh32(tmp)) < 0)

Putting side-effects into late stages of an if-condition is just
awful coding practice.  They're easy for a reader to miss and they
are magnets for bugs, because of the possibility that control doesn't
reach that part of the condition.

You can get the exact same speedup without any of those disadvantages
by marking these two functions "inline", so that's what I did.

* I dropped the DRAIN_COPY_RAW_BUF macro too, as in my estimation it was
a net negative for readability.  With only two use-cases, having it made
the code longer not shorter; I was also pretty unconvinced about the
wisdom of having some of the loop's control logic inside the macro and
some outside.

* BTW, the macro definitions weren't particularly per project style
anyway.  We generally put at least one space before line-ending
backslashes.  I don't think pgindent will fix this for you; IME
it doesn't touch macro definitions at all.

* Did some more work on the comments.

regards, tom lane




Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Peter Geoghegan
On Sat, Jul 25, 2020 at 1:10 PM Jeff Davis  wrote:
> On Sat, 2020-07-25 at 11:05 -0700, Peter Geoghegan wrote:
> > What worries me a bit is the sharp discontinuities when spilling with
> > significantly less work_mem than the "optimal" amount. For example,
> > with Tomas' TPC-H query (against my smaller TPC-H dataset), I find
> > that setting work_mem to 6MB looks like this:
>
> ...
>
> >  Planned Partitions: 128  Peak Memory Usage: 6161kB  Disk
> > Usage: 2478080kB  HashAgg Batches: 128
>
> ...
>
> >  Planned Partitions: 128  Peak Memory Usage: 5393kB  Disk
> > Usage: 2482152kB  HashAgg Batches: 11456

> It's not clear to me that overpartitioning is a real problem in this
> case -- but I think the fact that it's causing confusion is enough
> reason to see if we can fix it.

I'm not sure about that either.

FWIW I notice that when I reduce work_mem a little further (to 3MB)
with the same query, the number of partitions is still 128, while the
number of run time batches is 16,512 (an increase from 11,456 from 6MB
work_mem). I notice that 16512/128 is 129, which hints at the nature
of what's going on with the recursion. I guess it would be ideal if
the growth in batches was more gradual as I subtract memory.

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Jeff Davis
On Sat, 2020-07-25 at 11:05 -0700, Peter Geoghegan wrote:
> What worries me a bit is the sharp discontinuities when spilling with
> significantly less work_mem than the "optimal" amount. For example,
> with Tomas' TPC-H query (against my smaller TPC-H dataset), I find
> that setting work_mem to 6MB looks like this:

...

>  Planned Partitions: 128  Peak Memory Usage: 6161kB  Disk
> Usage: 2478080kB  HashAgg Batches: 128

...

>  Planned Partitions: 128  Peak Memory Usage: 5393kB  Disk
> Usage: 2482152kB  HashAgg Batches: 11456

...

> My guess that this is because the
> recursive hash aggregation misbehaves in a self-similar fashion once
> a
> certain tipping point has been reached.

It looks like it might be fairly easy to use HyperLogLog as an
estimator for the recursive step. That should reduce the
overpartitioning, which I believe is the cause of this discontinuity.

It's not clear to me that overpartitioning is a real problem in this
case -- but I think the fact that it's causing confusion is enough
reason to see if we can fix it.

Regards,
Jeff Davis






Re: hashagg slowdown due to spill changes

2020-07-25 Thread Peter Geoghegan
On Fri, Jul 24, 2020 at 4:51 PM Andres Freund  wrote:
> This is still not resolved. We're right now slower than 12. It's
> effectively not possible to do performance comparisons right now. This
> was nearly two months ago.

I have added a new open item for this separate
LookupTupleHashEntryHash()/lookup_hash_entry() pipeline-stall issue.

(For the record I mistakenly believed that commit 23023022 resolved
all of the concerns raised on this thread, which is why I closed out
the open item associated with this thread. Evidently work remains to
fix a remaining regression that affects simple in-memory hash
aggregation, though.)

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Peter Geoghegan
On Sat, Jul 25, 2020 at 10:23 AM Jeff Davis  wrote:
> There's also another effect at work that can cause the total number of
> batches to be higher for larger work_mem values: when we do recurse, we
> again need to estimate the number of partitions needed. Right now, we
> overestimate the number of partitions needed (to be conservative),
> which leads to a wider fan-out and lots of tiny partitions, and
> therefore more batches.

What worries me a bit is the sharp discontinuities when spilling with
significantly less work_mem than the "optimal" amount. For example,
with Tomas' TPC-H query (against my smaller TPC-H dataset), I find
that setting work_mem to 6MB looks like this:

   ->  HashAggregate  (cost=2700529.47..3020654.22 rows=1815500
width=40) (actual time=21039.788..32278.703 rows=200 loops=1)
 Output: lineitem.l_partkey, (0.2 * avg(lineitem.l_quantity))
 Group Key: lineitem.l_partkey
 Planned Partitions: 128  Peak Memory Usage: 6161kB  Disk
Usage: 2478080kB  HashAgg Batches: 128

(And we have a sensible looking number of batches that match the
number of planned partitions with higher work_mem settings, too.)

However, if I set work_mem to 5MB (or less), it looks like this:

->  HashAggregate  (cost=2700529.47..3020654.22 rows=1815500
width=40) (actual time=20849.490..37027.533 rows=200 loops=1)
 Output: lineitem.l_partkey, (0.2 * avg(lineitem.l_quantity))
 Group Key: lineitem.l_partkey
 Planned Partitions: 128  Peak Memory Usage: 5393kB  Disk
Usage: 2482152kB  HashAgg Batches: 11456

So the number of partitions is still 128, but the number of batches
explodes to 11,456 all at once. My guess that this is because the
recursive hash aggregation misbehaves in a self-similar fashion once a
certain tipping point has been reached. I expect that the exact nature
of that tipping point is very complicated, and generally dependent on
the dataset, clustering, etc. But I don't think that this kind of
effect will be uncommon.

(FWIW this example requires ~620MB work_mem to complete without
spilling at all -- so it's kind of extreme, though not quite as
extreme as many of the similar test results from Tomas.)

-- 
Peter Geoghegan




Re: Difference for Binary format vs Text format for client-server communication

2020-07-25 Thread Peter Eisentraut

On 2020-07-16 18:52, Andy Fan wrote:

The reason I ask this is because I have a task to make numeric output
similar to oracle.

Oracle:

SQL> select 2 / 1.0 from dual;

      2/1.0
--
          2

PG:

postgres=# select  2 / 1.0;
       ?column?

  2.
(1 row)

If the user uses text format, I can just hack some numeric_out function, 
but if they
use binary format,  looks I have to change the driver they used for it.  
Am I

understand it correctly?


I think what you should be looking at is why the numeric division 
function produces that scale and possibly make changes there.  By the 
time the type's output or send function is invoked, that's already 
decided.  The output/send functions are not the place to make scale or 
other semantic adjustments.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Jeff Davis
On Thu, 2020-07-23 at 19:33 -0700, Peter Geoghegan wrote:
> That does make it sound like the costs of the hash agg aren't being
> represented. I suppose it isn't clear if this is a costing issue
> because it isn't clear if the execution time performance itself is
> pathological or is instead something that must be accepted as the
> cost
> of spilling the hash agg in a general kind of way.

I have a feeling that this is mostly a costing problem. Sort uses its
memory in two different phases:

  1. when writing the sorted runs, it needs the memory to hold the run
before sorting it, and only a single buffer for the output tape; and
  2. when merging, it needs a lot of read buffers

But in HashAgg, it needs to hold all of the groups in memory *at the
same time* as it needs a lot of output buffers (one for each
partition). This doesn't matter a lot at high values of work_mem,
because the buffers will only be 8MB at most.

I did attempt to cost this properly: hash_agg_set_limits() takes into
account the memory the partitions will use, and the remaining memory is
what's used in cost_agg(). But there's a lot of room for error in
there.

If someone sees an obvious error in the costing, please let me know.
Otherwise, I think it will just take some time to make it better
reflect reality in a variety of cases. For v13, and we will either need
to live with it, or pessimize the costing for HashAgg until we get it
right.

Many costing issues can deal with a lot of slop -- e.g. HashJoin vs
MergeJoin -- because a small factor often doesn't make the difference
between plans. But HashAgg and Sort are more competitive with each
other, so costing needs to be more precise.

Regards,
Jeff Davis






Re: estimation problems for DISTINCT ON with FDW

2020-07-25 Thread Jeff Janes
On Fri, Jul 3, 2020 at 5:50 PM Tom Lane  wrote:

>
> OK, I'll go ahead and push the patch I proposed yesterday.
>

Thank you.  I tested 12_STABLE with my real queries on the real data set,
and the "hard coded" estimate of 200 distinct rows (when use_remote_estimte
is turned back on) is enough to get rid of the worst plans I was seeing in
12.3.

Cheers,

Jeff


Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Jeff Davis
On Fri, 2020-07-24 at 10:40 +0200, Tomas Vondra wrote:
> FWIW one more suspicious thing that I forgot to mention is the
> behavior
> of the "planned partitions" depending on work_mem, which looks like
> this:
> 
>2MB   Planned Partitions:  64HashAgg Batches:  4160
>4MB   Planned Partitions: 128HashAgg Batches: 16512
>8MB   Planned Partitions: 256HashAgg Batches: 21488
>   64MB   Planned Partitions:  32HashAgg Batches:  2720
>  256MB   Planned Partitions:   8HashAgg Batches: 8
> 
> I'd expect the number of planned partitions to decrease (slowly) as
> work_mem increases, but it seems to increase initially. Seems a bit
> strange, but maybe it's expected.

The space for open-partition buffers is also limited to about 25% of
memory. Each open partition takes BLCKSZ memory, so those numbers are
exactly what I'd expect (64*8192 = 512kB).

There's also another effect at work that can cause the total number of
batches to be higher for larger work_mem values: when we do recurse, we
again need to estimate the number of partitions needed. Right now, we
overestimate the number of partitions needed (to be conservative),
which leads to a wider fan-out and lots of tiny partitions, and
therefore more batches.

I think we can improve this by using something like a HyperLogLog on
the hash values of the spilled tuples to get a better estimate for the
number of groups (and therefore the number of partitions) that we need
when we recurse, which would reduce the number of overall batches at
higher work_mem settings. But I didn't get a chance to implement that
yet.

Regards,
Jeff Davis






Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Peter Geoghegan
On Sat, Jul 25, 2020 at 9:39 AM Peter Geoghegan  wrote:
> "Peak Memory Usage: 1605334kB"
>
> Hash agg avoids spilling entirely (so the planner gets it right this
> time around). It even uses notably less memory.

I guess that this is because the reported memory usage doesn't reflect
the space used for transition state, which is presumably most of the
total -- array_agg() is used in the query.

-- 
Peter Geoghegan




Re: Mark unconditionally-safe implicit coercions as leakproof

2020-07-25 Thread Tom Lane
Robert Haas  writes:
> On Fri, Jul 24, 2020 at 12:17 PM Tom Lane  wrote:
>> I went through the system's built-in implicit coercions to see
>> which ones are unconditionally successful.  These could all be
>> marked leakproof, as per attached patch.

> IMHO, this is a nice improvement.

Thanks; pushed.  On second reading I found that there are a few
non-implicit coercions that could usefully be marked leakproof
as well --- notably float4_numeric and float8_numeric, which should
be error-free now that infinities can be converted.

regards, tom lane




Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Peter Geoghegan
On Fri, Jul 24, 2020 at 12:55 PM Peter Geoghegan  wrote:
> Could that be caused by clustering in the data?
>
> If the input data is in totally random order then we have a good
> chance of never having to spill skewed "common" values. That is, we're
> bound to encounter common values before entering spill mode, and so
> those common values will continue to be usefully aggregated until
> we're done with the initial groups (i.e. until the in-memory hash
> table is cleared in order to process spilled input tuples). This is
> great because the common values get aggregated without ever spilling,
> and most of the work is done before we even begin with spilled tuples.
>
> If, on the other hand, the common values are concentrated together in
> the input...

I still don't know if that was a factor in your example, but I can
clearly demonstrate that the clustering of data can matter a lot to
hash aggs in Postgres 13. I attach a contrived example where it makes
a *huge* difference.

I find that the sorted version of the aggregate query takes
significantly longer to finish, and has the following spill
characteristics:

"Peak Memory Usage: 205086kB  Disk Usage: 2353920kB  HashAgg Batches: 2424"

Note that the planner doesn't expect any partitions here, but we still
get 2424 batches -- so the planner seems to get it totally wrong.
OTOH, the same query against a randomized version of the same data (no
longer in sorted order, no clustering) works perfectly with the same
work_mem (200MB):

"Peak Memory Usage: 1605334kB"

Hash agg avoids spilling entirely (so the planner gets it right this
time around). It even uses notably less memory.

-- 
Peter Geoghegan


test-agg-sorted.sql
Description: Binary data


Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Jeff Davis
On Fri, 2020-07-24 at 21:16 +0200, Tomas Vondra wrote:
> Surely more recursive spilling should do more I/O, but the Disk Usage
> reported by explain analyze does not show anything like ...

I suspect that's because of disk reuse in logtape.c. 

Regards,
Jeff Davis






Re: INSERT INTO SELECT, Why Parallelism is not selected?

2020-07-25 Thread Tom Lane
Amit Kapila  writes:
> On Fri, Jul 24, 2020 at 7:36 PM Tom Lane  wrote:
>> Yeah, the proposed comment changes don't actually add much.  Also
>> please try to avoid inserting non-ASCII  into the source code;
>> at least in my mail reader, that attachment has some.

> I don't see any non-ASCII characters in the patch.  I have applied and
> checked (via vi editor) the patch as well but don't see any non-ASCII
> characters.  How can I verify that?

They're definitely there:

$ od -c 0001-Fix-comments-in-heapam.c.patch
...
0002740   h   e  \n   +  \t   *   l   e   a   d   e   r   c
0002760   a   n   p   e   r   f   o   r   m   t   h   e   i
0003000   n   s   e   r   t   . 302 240   T   h   i   s   r   e
0003020   s   t   r   i   c   t   i   o   n   c   a   n   b   e
0003040   u   p   l   i   f   t   e   d   o   n   c   e   w
0003060   e  \n   +  \t   *   a   l   l   o   w   t   h   e
0003100 302 240   p   l   a   n   n   e   r   t   o   g   e   n
0003120   e   r   a   t   e   p   a   r   a   l   l   e   l   p
0003140   l   a   n   s   f   o   r   i   n   s   e   r   t   s
0003160   .  \n  \t   *   /  \n  \t   i   f   (   I   s
...

I'm not sure if "git diff --check" would whine about this.

regards, tom lane




Re: Add header support to text format and matching feature

2020-07-25 Thread Daniel Verite
Rémi Lapeyre wrote:

> Here’s a new version that fix all the issues.

Here's a review of v4. 

The patch improves COPY in two ways:

- COPY TO and COPY FROM now accept "HEADER ON" for the TEXT format
(previously it was only for CSV)

- COPY FROM also accepts "HEADER match" to tell that there's a header
and that its contents must match the columns of the destination table.
This works for both the CSV and TEXT formats. The syntax for the
columns is the same as for the data and the match is case-sensitive.

The first improvement when submitted alone (in 2018 by Simon Muller)
has been judged not useful enough or even hazardous without any
"match" feature. It was returned with feedback in 2018-10 and
resubmitted by Rémi in 2020-02 with the match feature.

The patches  apply cleanly, "make check" and "make check-world" pass.

In my tests it works fine except for one crash that I can reproduce
on a fresh build and default configuration with:

$ cat >file.txt
i
1

$ psql
postgres=# create table x(i int);
CREATE TABLE
postgres=# \copy x(i) from file.txt with (header match)
COPY 1
postgres=# \copy x(i) from file.txt with (header match)
COPY 1
postgres=# \copy x(i) from file.txt with (header match)
COPY 1
postgres=# \copy x(i) from file.txt with (header match)
COPY 1
postgres=# \copy x(i) from file.txt with (header match)
COPY 1
postgres=# \copy x(i) from file.txt with (header match)
PANIC:  ERRORDATA_STACK_SIZE exceeded
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

I suspect the reason is the way that PG_TRY/PG_CATCH is used, see below.

Code comments:


+/*
+ * Represents whether the header must be absent, present or present and
match.
+ */
+typedef enum CopyHeader
+{
+   COPY_HEADER_ABSENT,
+   COPY_HEADER_PRESENT,
+   COPY_HEADER_MATCH
+} CopyHeader;
+
 /*
  * This struct contains all the state variables used throughout a COPY
  * operation. For simplicity, we use the same struct for all variants of
COPY,
@@ -136,7 +146,7 @@ typedef struct CopyStateData
boolbinary; /* binary format? */
boolfreeze; /* freeze rows on loading? */
boolcsv_mode;   /* Comma Separated Value
format? */
-   boolheader_line;/* CSV or text header line? */
+   CopyHeader  header_line;/* CSV or text header line? */


After the redefinition into this enum type, there are still a
bunch of references to header_line that treat it like a boolean:

1190:   if (cstate->header_line)
1398:   if (cstate->binary && cstate->header_line)
2119:   if (cstate->header_line)
3635:   if (cstate->cur_lineno == 0 && cstate->header_line)

It works fine since COPY_HEADER_ABSENT is 0 as the first value of the enum,
but maybe it's not good style to count on that.



+   PG_TRY();
+   {
+   if (defGetBoolean(defel))
+   cstate->header_line =
COPY_HEADER_PRESENT;
+   else
+   cstate->header_line =
COPY_HEADER_ABSENT;
+   }
+   PG_CATCH();
+   {
+   char*sval = defGetString(defel);
+
+   if (!cstate->is_copy_from)
+   PG_RE_THROW();
+
+   if (pg_strcasecmp(sval, "match") == 0)
+   cstate->header_line =
COPY_HEADER_MATCH;
+   else
+   ereport(ERROR,
+  
(errcode(ERRCODE_SYNTAX_ERROR),
+errmsg("header requires a
boolean or \"match\"")));
+   }
+   PG_END_TRY();

It seems wrong to use a PG_CATCH block for this. I understand that
it's because defGetBoolean() calls ereport() on non-booleans, but then
it should be split into an error-throwing function and a
non-error-throwing lexical analysis of the boolean, the above code
calling the latter.
Besides the comments in elog.h above PG_TRY say that
 "the error recovery code
  can either do PG_RE_THROW to propagate the error outwards, or do a
  (sub)transaction abort. Failure to do so may leave the system in an
  inconsistent state for further processing."
Maybe this is what happens with the repeated uses of "match"
eventually failing with ERRORDATA_STACK_SIZE exceeded.


-HEADER [ boolean ]
+HEADER { match | true |
false }

This should be enclosed in square brackets because HEADER
with no argument is still accepted.




+  names from the table. On input, the first line is discarded when set
+  to 

Re: proposal: possibility to read dumped table's name from file

2020-07-25 Thread vignesh C
On Tue, Jul 14, 2020 at 12:03 PM Pavel Stehule  wrote:
>> I meant can this:
>> printf(_("  --filter=FILENAMEread object name filter
>> expressions from file\n"));
>> be changed to:
>> printf(_("  --filter=FILENAMEdump objects and data based
>> on the filter expressions from the filter file\n"));
>
> done in today patch
>

Thanks for fixing the  comments.
Few comments:
+ /* use "-" as symbol for stdin */
+ if (strcmp(filename, "-") != 0)
+ {
+ fp = fopen(filename, "r");
+ if (!fp)
+ fatal("could not open the input file \"%s\": %m",
+   filename);
+ }
+ else
+ fp = stdin;

We could use STDIN itself instead of -, it will be a more easier
option to understand.

+ /* when first char is hash, ignore whole line */
+ if (*line == '#')
+ continue;

If line starts with # we ignore that line, I feel this should be
included in the documentation.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Improving connection scalability: GetSnapshotData()

2020-07-25 Thread Ranier Vilela
Em sex., 24 de jul. de 2020 às 21:00, Andres Freund 
escreveu:

> On 2020-07-24 18:15:15 -0300, Ranier Vilela wrote:
> > Em sex., 24 de jul. de 2020 às 14:16, Andres Freund 
> > escreveu:
> >
> > > On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote:
> > > > Latest Postgres
> > > > Windows 64 bits
> > > > msvc 2019 64 bits
> > > >
> > > > Patches applied v12-0001 to v12-0007:
> > > >
> > > >  C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning
> > > C4013:
> > > > 'GetOldestXmin' indefinido; assumindo extern retornando int
> > > > [C:\dll\postgres
> > > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29):
> warning
> > > > C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int
> > > > [C:\dll\postgres\pg_visibility.
> > > > vcxproj]
> > > >  C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error
> C2065:
> > > > 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> > > > [C:\dll\postgres\pgstattuple.vcxproj]
> > > >   C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58):
> error
> > > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> > > > [C:\dll\postgres\pg_visibility.vcxproj]
> > > >   C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70):
> error
> > > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> > > > [C:\dll\postgres\pg_visibility.vcxproj]
> > >
> > > I don't know that's about - there's no call to GetOldestXmin() in
> > > pgstatapprox and pg_visibility after patch 0002? And similarly, the
> > > PROCARRAY_* references are also removed in the same patch?
> > >
> > Maybe need to remove them from these places, not?
> > C:\dll\postgres\contrib>grep -d GetOldestXmin *.c
> > File pgstattuple\pgstatapprox.c:
> > OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM);
> > File pg_visibility\pg_visibility.c:
> > OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
> >  * deadlocks, because surely
> > GetOldestXmin() should never take
> > RecomputedOldestXmin =
> GetOldestXmin(NULL,
> > PROCARRAY_FLAGS_VACUUM);
>
> The 0002 patch changed those files:
>
> diff --git a/contrib/pg_visibility/pg_visibility.c
> b/contrib/pg_visibility/pg_visibility.c
> index 68d580ed1e0..37206c50a21 100644
> --- a/contrib/pg_visibility/pg_visibility.c
> +++ b/contrib/pg_visibility/pg_visibility.c
> @@ -563,17 +563,14 @@ collect_corrupt_items(Oid relid, bool all_visible,
> bool all_frozen)
> BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
> TransactionId OldestXmin = InvalidTransactionId;
>
> -   if (all_visible)
> -   {
> -   /* Don't pass rel; that will fail in recovery. */
> -   OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
> -   }
> -
> rel = relation_open(relid, AccessShareLock);
>
> /* Only some relkinds have a visibility map */
> check_relation_relkind(rel);
>
> +   if (all_visible)
> +   OldestXmin = GetOldestNonRemovableTransactionId(rel);
> +
> nblocks = RelationGetNumberOfBlocks(rel);
>
> /*
> @@ -679,11 +676,12 @@ collect_corrupt_items(Oid relid, bool all_visible,
> bool all_frozen)
>  * From a concurrency point of view, it
> sort of sucks to
>  * retake ProcArrayLock here while we're
> holding the buffer
>  * exclusively locked, but it should be
> safe against
> -* deadlocks, because surely
> GetOldestXmin() should never take
> -* a buffer lock. And this shouldn't
> happen often, so it's
> -* worth being careful so as to avoid
> false positives.
> +* deadlocks, because surely
> GetOldestNonRemovableTransactionId()
> +* should never take a buffer lock. And
> this shouldn't happen
> +* often, so it's worth being careful so
> as to avoid false
> +* positives.
>  */
> -   RecomputedOldestXmin = GetOldestXmin(NULL,
> PROCARRAY_FLAGS_VACUUM);
> +   RecomputedOldestXmin =
> GetOldestNonRemovableTransactionId(rel);
>
> if (!TransactionIdPrecedes(OldestXmin,
> RecomputedOldestXmin))
> record_corrupt_item(items,
> _self);
>
> diff --git a/contrib/pgstattuple/pgstatapprox.c
> b/contrib/pgstattuple/pgstatapprox.c
> index dbc0fa11f61..3a99333d443 100644
> --- a/contrib/pgstattuple/pgstatapprox.c
> +++ b/contrib/pgstattuple/pgstatapprox.c
> @@ -71,7 +71,7 @@ statapprox_heap(Relation rel, output_type *stat)
> BufferAccessStrategy bstrategy;
> TransactionId OldestXmin;
>
> -   OldestXmin = GetOldestXmin(rel, 

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-07-25 Thread Amit Kapila
On Fri, Jul 24, 2020 at 7:17 PM Dilip Kumar  wrote:
>
> Your changes look fine to me.  Additionally, I have changed a test
> case of getting the streaming changes in 0002.  Instead of just
> showing the count, I am showing that the transaction is actually
> streaming.
>

If you want to show the changes then there is no need to display 157
rows probably a few (10-15) should be sufficient.  If we can do that
by increasing the size of the row then good, otherwise, I think it is
better to retain the test to display the count.

Today, I have again looked at the first patch
(v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't
find any more problems with it so planning to commit the same unless
you or someone else want to add more to it.   Just for ease of others,
"the next patch extends the logical decoding output plugin API with
stream methods".   It adds seven methods to the output plugin API,
adding support for streaming changes for large in-progress
transactions. The methods are stream_start, stream_stop, stream_abort,
stream_commit, stream_change, stream_message, and stream_truncate.
Most of this is a simple extension of the existing methods, with the
semantic difference that the transaction (or subtransaction) is
incomplete and may be aborted later (which is something the regular
API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these new
stream methods.  The stream_start/start_stop are used to demarcate a
chunk of changes streamed for a particular toplevel transaction.

This commit simply adds these new APIs and the upcoming patch to
"allow the streaming mode in ReorderBuffer" will use these APIs.

-- 
With Regards,
Amit Kapila.




Re: display offset along with block number in vacuum errors

2020-07-25 Thread Michael Paquier
On Fri, Jul 24, 2020 at 11:18:43PM +0530, Mahendra Singh Thalor wrote:
> In commit b61d161(Introduce vacuum errcontext to display additional
> information), we added vacuum errcontext to display additional
> information(block number) so that in case of vacuum error, we can identify
> which block we are getting error.  Addition to block number, if we can
> display offset, then it will be more helpful for users. So to display
> offset, here proposing two different methods(Thanks Robert for suggesting
> these 2 methods):

new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
vacrelstats->blkno = new_rel_pages;
+   vacrelstats->offnum = InvalidOffsetNumber;

Adding more context would be interesting for some cases, but not all
contrary to what your patch does in some code paths like
lazy_truncate_heap() as you would show up an offset of 0 for an
invalid value.  This could confuse some users.  Note that we are
careful enough to not print a context message if the block number is
invalid.
--
Michael


signature.asc
Description: PGP signature


Re: handle a ECPG_bytea typo

2020-07-25 Thread Michael Paquier
On Sat, Jul 25, 2020 at 07:22:15AM +0530, vignesh C wrote:
> I felt the changes look correct. The reason it might be working
> earlier is because the structure members are the same for both the
> data structures ECPGgeneric_bytea & ECPGgeneric_varchar.

ECPGset_noind_null() and ECPGis_noind_null() in misc.c show that
ECPGgeneric_bytea is attached to ECPGt_bytea.  The two structures may
be the same now, but if a bug fix or a code change involves a change
in the structure definition we could run into problems.  So let's fix
and back-patch this change.  I am not spotting other areas impacted,
and I'll try to take care at the beginning of next week.
--
Michael


signature.asc
Description: PGP signature