Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Sat, Jul 25, 2020 at 5:08 PM Amit Kapila wrote: > > On Fri, Jul 24, 2020 at 7:17 PM Dilip Kumar wrote: > > > > Your changes look fine to me. Additionally, I have changed a test > > case of getting the streaming changes in 0002. Instead of just > > showing the count, I am showing that the transaction is actually > > streaming. > > > > If you want to show the changes then there is no need to display 157 > rows probably a few (10-15) should be sufficient. If we can do that > by increasing the size of the row then good, otherwise, I think it is > better to retain the test to display the count. I think in existing test cases also we are displaying multiple lines e.g. toast.out is showing 235 rows. But maybe I will try to reduce it to the less number of rows. > Today, I have again looked at the first patch > (v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't > find any more problems with it so planning to commit the same unless > you or someone else want to add more to it. Just for ease of others, > "the next patch extends the logical decoding output plugin API with > stream methods". It adds seven methods to the output plugin API, > adding support for streaming changes for large in-progress > transactions. The methods are stream_start, stream_stop, stream_abort, > stream_commit, stream_change, stream_message, and stream_truncate. > Most of this is a simple extension of the existing methods, with the > semantic difference that the transaction (or subtransaction) is > incomplete and may be aborted later (which is something the regular > API does not really need to deal with). > > This also extends the 'test_decoding' plugin, implementing these new > stream methods. The stream_start/start_stop are used to demarcate a > chunk of changes streamed for a particular toplevel transaction. > > This commit simply adds these new APIs and the upcoming patch to > "allow the streaming mode in ReorderBuffer" will use these APIs. LGTM -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Default setting for enable_hashagg_disk
On Sat, Jul 25, 2020 at 5:31 PM Peter Geoghegan wrote: > I'm glad that this better principled approach is possible. It's hard > to judge how much of a problem this really is, though. We'll need to > think about this aspect some more. BTW, your HLL patch ameliorates the problem with my extreme "sorted vs random input" test case from this morning [1] (the thing that I just discussed with Tomas). Without the HLL patch the sorted case had 2424 batches. With the HLL patch it has 20. That at least seems like a notable improvement. [1] https://postgr.es/m/CAH2-Wz=ur7MQKpaUZJP=Adtg0TPMx5M_WoNE=ke2vUU=amd...@mail.gmail.com -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Sat, Jul 25, 2020 at 4:56 PM Jeff Davis wrote: > I wrote a quick patch to use HyperLogLog to estimate the number of > groups contained in a spill file. It seems to reduce the > overpartitioning effect, and is a more principled approach than what I > was doing before. This pretty much fixes the issue that I observed with overparitioning. At least in the sense that the number of partitions grows more predictably -- even when the number of partitions planned is reduced the change in the number of batches seems smooth-ish. It "looks nice". > It does seem to hurt the runtime slightly when spilling to disk in some > cases. I haven't narrowed down whether this is because we end up > recursing multiple times, or if it's just more efficient to > overpartition, or if the cost of doing the HLL itself is significant. I'm glad that this better principled approach is possible. It's hard to judge how much of a problem this really is, though. We'll need to think about this aspect some more. Thanks -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Sat, Jul 25, 2020 at 5:05 PM Tomas Vondra wrote: > I'm not sure what you mean by "reported memory usage doesn't reflect the > space used for transition state"? Surely it does include that, we've > built the memory accounting stuff pretty much exactly to do that. > > I think it's pretty clear what's happening - in the sorted case there's > only a single group getting new values at any moment, so when we decide > to spill we'll only add rows to that group and everything else will be > spilled to disk. Right. > In the unsorted case however we manage to initialize all groups in the > hash table, but at that point the groups are tiny an fit into work_mem. > As we process more and more data the groups grow, but we can't evict > them - at the moment we don't have that capability. So we end up > processing everything in memory, but significantly exceeding work_mem. work_mem was set to 200MB, which is more than the reported "Peak Memory Usage: 1605334kB". So either the random case significantly exceeds work_mem and the "Peak Memory Usage" accounting is wrong (because it doesn't report this excess), or the random case really doesn't exceed work_mem but has a surprising advantage over the sorted case. -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Sat, Jul 25, 2020 at 10:07:37AM -0700, Peter Geoghegan wrote: On Sat, Jul 25, 2020 at 9:39 AM Peter Geoghegan wrote: "Peak Memory Usage: 1605334kB" Hash agg avoids spilling entirely (so the planner gets it right this time around). It even uses notably less memory. I guess that this is because the reported memory usage doesn't reflect the space used for transition state, which is presumably most of the total -- array_agg() is used in the query. I'm not sure what you mean by "reported memory usage doesn't reflect the space used for transition state"? Surely it does include that, we've built the memory accounting stuff pretty much exactly to do that. I think it's pretty clear what's happening - in the sorted case there's only a single group getting new values at any moment, so when we decide to spill we'll only add rows to that group and everything else will be spilled to disk. In the unsorted case however we manage to initialize all groups in the hash table, but at that point the groups are tiny an fit into work_mem. As we process more and more data the groups grow, but we can't evict them - at the moment we don't have that capability. So we end up processing everything in memory, but significantly exceeding work_mem. FWIW all my tests are done on the same TPC-H data set clustered by l_shipdate (so probably random with respect to other columns). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Sat, 2020-07-25 at 13:27 -0700, Peter Geoghegan wrote: > It's not clear to me that overpartitioning is a real problem in > > this > > case -- but I think the fact that it's causing confusion is enough > > reason to see if we can fix it. > > I'm not sure about that either. > > FWIW I notice that when I reduce work_mem a little further (to 3MB) > with the same query, the number of partitions is still 128, while the > number of run time batches is 16,512 (an increase from 11,456 from > 6MB > work_mem). I notice that 16512/128 is 129, which hints at the nature > of what's going on with the recursion. I guess it would be ideal if > the growth in batches was more gradual as I subtract memory. I wrote a quick patch to use HyperLogLog to estimate the number of groups contained in a spill file. It seems to reduce the overpartitioning effect, and is a more principled approach than what I was doing before. It does seem to hurt the runtime slightly when spilling to disk in some cases. I haven't narrowed down whether this is because we end up recursing multiple times, or if it's just more efficient to overpartition, or if the cost of doing the HLL itself is significant. Regards, Jeff Davis From 424f611f442e36a91747fd39cd28acba42eb958b Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Sat, 25 Jul 2020 14:29:59 -0700 Subject: [PATCH] HLL --- src/backend/executor/nodeAgg.c | 64 ++ src/include/executor/nodeAgg.h | 2 +- 2 files changed, 43 insertions(+), 23 deletions(-) diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index b79c845a6b7..25771cb4869 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -245,9 +245,11 @@ #include "catalog/pg_aggregate.h" #include "catalog/pg_proc.h" #include "catalog/pg_type.h" +#include "common/hashfn.h" #include "executor/execExpr.h" #include "executor/executor.h" #include "executor/nodeAgg.h" +#include "lib/hyperloglog.h" #include "miscadmin.h" #include "nodes/makefuncs.h" #include "nodes/nodeFuncs.h" @@ -295,6 +297,14 @@ #define HASHAGG_READ_BUFFER_SIZE BLCKSZ #define HASHAGG_WRITE_BUFFER_SIZE BLCKSZ +/* + * HyperLogLog is used for estimating the cardinality of the spilled tuples in + * a given partition. 5 bits corresponds to a size of about 32 bytes and a + * worst-case error of around 18%. That's effective enough to choose a + * reasonable number of partitions when recursing. + */ +#define HASHAGG_HLL_BIT_WIDTH 5 + /* * Estimate chunk overhead as a constant 16 bytes. XXX: should this be * improved? @@ -339,6 +349,7 @@ typedef struct HashAggSpill int64 *ntuples; /* number of tuples in each partition */ uint32 mask; /* mask to find partition from hash value */ int shift; /* after masking, shift by this amount */ + hyperLogLogState *hll_card; /* cardinality estimate for contents */ } HashAggSpill; /* @@ -357,6 +368,7 @@ typedef struct HashAggBatch LogicalTapeSet *tapeset; /* borrowed reference to tape set */ int input_tapenum; /* input partition tape */ int64 input_tuples; /* number of tuples in this batch */ + double input_card; /* estimated group cardinality */ } HashAggBatch; /* used to find referenced colnos */ @@ -409,7 +421,7 @@ static void hashagg_recompile_expressions(AggState *aggstate, bool minslot, static long hash_choose_num_buckets(double hashentrysize, long estimated_nbuckets, Size memory); -static int hash_choose_num_partitions(uint64 input_groups, +static int hash_choose_num_partitions(double input_groups, double hashentrysize, int used_bits, int *log2_npartittions); @@ -429,10 +441,11 @@ static void hashagg_finish_initial_spills(AggState *aggstate); static void hashagg_reset_spill_state(AggState *aggstate); static HashAggBatch *hashagg_batch_new(LogicalTapeSet *tapeset, int input_tapenum, int setno, - int64 input_tuples, int used_bits); + int64 input_tuples, double input_card, + int used_bits); static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp); static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo, - int used_bits, uint64 input_tuples, + int used_bits, double input_groups, double hashentrysize); static Size hashagg_spill_tuple(AggState *aggstate, HashAggSpill *spill, TupleTableSlot *slot, uint32 hash); @@ -1775,7 +1788,7 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck) * substantially larger than the initial value. */ void -hash_agg_set_limits(double hashentrysize, uint64 input_groups, int used_bits, +hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits, Size *mem_limit, uint64 *ngroups_limit, int *num_partitions) { @@ -1967,7 +1980,7 @@ hash_choose_num_buckets(double hashentrysize, long ngroups, Size memory) * *log2_npartitions to the
Re: hashagg slowdown due to spill changes
On Sat, Jul 25, 2020 at 12:41 PM Peter Geoghegan wrote: > I have added a new open item for this separate > LookupTupleHashEntryHash()/lookup_hash_entry() pipeline-stall issue. Attached is a rebased version of Andres' now-bitrot 2020-06-12 patch ("aggspeed.diff"). I find that Andres original "SELECT cat, count(*) FROM fewgroups_many_rows GROUP BY 1;" test case is noticeably improved by the patch. Without the patch, v13 takes ~11.46 seconds. With the patch, it takes only ~10.64 seconds. Didn't test it against v12 yet, but I have no reason to doubt Andres' explanation. I gather that if we can get this patch committed, we can close the relevant LookupTupleHashEntryHash() open item. Can you take this off my hands, Jeff? Thanks -- Peter Geoghegan 0001-Fix-LookupTupleHashEntryHash-pipeline-stall-issue.patch Description: Binary data
Re: [PATCH] Performance Improvement For Copy From Binary Files
Amit Langote writes: > [ v7-0001-Improve-performance-of-binary-COPY-FROM-with-buff.patch ] Pushed with cosmetic changes. I'd always supposed that stdio does enough internal buffering that short fread()s shouldn't be much worse than memcpy(). But I reproduced your result of ~30% speedup for data with a lot of narrow text columns, using RHEL 8.2. Thinking this an indictment of glibc, I also tried it on current macOS, and saw an even bigger speedup, approaching 50%. So there's definitely something to this. I wonder if we ought to check other I/O-constrained users of fread and fwrite, like pg_dump/pg_restore. A point that I did not see addressed in the thread is whether this has any negative impact on the copy-from-frontend code path, where there's no fread() to avoid; short reads from CopyGetData() are already going to be satisfied by memcpy'ing from the fe_msgbuf. However, a quick check suggests that this patch is still a small win for that case too --- apparently the control overhead in CopyGetData() is not negligible. So the patch seems fine functionally, but there were some cosmetic things I didn't like: * Removing CopyGetInt32 and CopyGetInt16 seemed like a pretty bad idea, because it made the callers much uglier and more error-prone. This is a particularly bad example: /* Header extension length */ - if (!CopyGetInt32(cstate, ) || - tmp < 0) + if (CopyReadBinaryData(cstate, (char *) , sizeof(tmp)) != + sizeof(tmp) || (tmp = (int32) pg_ntoh32(tmp)) < 0) Putting side-effects into late stages of an if-condition is just awful coding practice. They're easy for a reader to miss and they are magnets for bugs, because of the possibility that control doesn't reach that part of the condition. You can get the exact same speedup without any of those disadvantages by marking these two functions "inline", so that's what I did. * I dropped the DRAIN_COPY_RAW_BUF macro too, as in my estimation it was a net negative for readability. With only two use-cases, having it made the code longer not shorter; I was also pretty unconvinced about the wisdom of having some of the loop's control logic inside the macro and some outside. * BTW, the macro definitions weren't particularly per project style anyway. We generally put at least one space before line-ending backslashes. I don't think pgindent will fix this for you; IME it doesn't touch macro definitions at all. * Did some more work on the comments. regards, tom lane
Re: Default setting for enable_hashagg_disk
On Sat, Jul 25, 2020 at 1:10 PM Jeff Davis wrote: > On Sat, 2020-07-25 at 11:05 -0700, Peter Geoghegan wrote: > > What worries me a bit is the sharp discontinuities when spilling with > > significantly less work_mem than the "optimal" amount. For example, > > with Tomas' TPC-H query (against my smaller TPC-H dataset), I find > > that setting work_mem to 6MB looks like this: > > ... > > > Planned Partitions: 128 Peak Memory Usage: 6161kB Disk > > Usage: 2478080kB HashAgg Batches: 128 > > ... > > > Planned Partitions: 128 Peak Memory Usage: 5393kB Disk > > Usage: 2482152kB HashAgg Batches: 11456 > It's not clear to me that overpartitioning is a real problem in this > case -- but I think the fact that it's causing confusion is enough > reason to see if we can fix it. I'm not sure about that either. FWIW I notice that when I reduce work_mem a little further (to 3MB) with the same query, the number of partitions is still 128, while the number of run time batches is 16,512 (an increase from 11,456 from 6MB work_mem). I notice that 16512/128 is 129, which hints at the nature of what's going on with the recursion. I guess it would be ideal if the growth in batches was more gradual as I subtract memory. -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Sat, 2020-07-25 at 11:05 -0700, Peter Geoghegan wrote: > What worries me a bit is the sharp discontinuities when spilling with > significantly less work_mem than the "optimal" amount. For example, > with Tomas' TPC-H query (against my smaller TPC-H dataset), I find > that setting work_mem to 6MB looks like this: ... > Planned Partitions: 128 Peak Memory Usage: 6161kB Disk > Usage: 2478080kB HashAgg Batches: 128 ... > Planned Partitions: 128 Peak Memory Usage: 5393kB Disk > Usage: 2482152kB HashAgg Batches: 11456 ... > My guess that this is because the > recursive hash aggregation misbehaves in a self-similar fashion once > a > certain tipping point has been reached. It looks like it might be fairly easy to use HyperLogLog as an estimator for the recursive step. That should reduce the overpartitioning, which I believe is the cause of this discontinuity. It's not clear to me that overpartitioning is a real problem in this case -- but I think the fact that it's causing confusion is enough reason to see if we can fix it. Regards, Jeff Davis
Re: hashagg slowdown due to spill changes
On Fri, Jul 24, 2020 at 4:51 PM Andres Freund wrote: > This is still not resolved. We're right now slower than 12. It's > effectively not possible to do performance comparisons right now. This > was nearly two months ago. I have added a new open item for this separate LookupTupleHashEntryHash()/lookup_hash_entry() pipeline-stall issue. (For the record I mistakenly believed that commit 23023022 resolved all of the concerns raised on this thread, which is why I closed out the open item associated with this thread. Evidently work remains to fix a remaining regression that affects simple in-memory hash aggregation, though.) -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Sat, Jul 25, 2020 at 10:23 AM Jeff Davis wrote: > There's also another effect at work that can cause the total number of > batches to be higher for larger work_mem values: when we do recurse, we > again need to estimate the number of partitions needed. Right now, we > overestimate the number of partitions needed (to be conservative), > which leads to a wider fan-out and lots of tiny partitions, and > therefore more batches. What worries me a bit is the sharp discontinuities when spilling with significantly less work_mem than the "optimal" amount. For example, with Tomas' TPC-H query (against my smaller TPC-H dataset), I find that setting work_mem to 6MB looks like this: -> HashAggregate (cost=2700529.47..3020654.22 rows=1815500 width=40) (actual time=21039.788..32278.703 rows=200 loops=1) Output: lineitem.l_partkey, (0.2 * avg(lineitem.l_quantity)) Group Key: lineitem.l_partkey Planned Partitions: 128 Peak Memory Usage: 6161kB Disk Usage: 2478080kB HashAgg Batches: 128 (And we have a sensible looking number of batches that match the number of planned partitions with higher work_mem settings, too.) However, if I set work_mem to 5MB (or less), it looks like this: -> HashAggregate (cost=2700529.47..3020654.22 rows=1815500 width=40) (actual time=20849.490..37027.533 rows=200 loops=1) Output: lineitem.l_partkey, (0.2 * avg(lineitem.l_quantity)) Group Key: lineitem.l_partkey Planned Partitions: 128 Peak Memory Usage: 5393kB Disk Usage: 2482152kB HashAgg Batches: 11456 So the number of partitions is still 128, but the number of batches explodes to 11,456 all at once. My guess that this is because the recursive hash aggregation misbehaves in a self-similar fashion once a certain tipping point has been reached. I expect that the exact nature of that tipping point is very complicated, and generally dependent on the dataset, clustering, etc. But I don't think that this kind of effect will be uncommon. (FWIW this example requires ~620MB work_mem to complete without spilling at all -- so it's kind of extreme, though not quite as extreme as many of the similar test results from Tomas.) -- Peter Geoghegan
Re: Difference for Binary format vs Text format for client-server communication
On 2020-07-16 18:52, Andy Fan wrote: The reason I ask this is because I have a task to make numeric output similar to oracle. Oracle: SQL> select 2 / 1.0 from dual; 2/1.0 -- 2 PG: postgres=# select 2 / 1.0; ?column? 2. (1 row) If the user uses text format, I can just hack some numeric_out function, but if they use binary format, looks I have to change the driver they used for it. Am I understand it correctly? I think what you should be looking at is why the numeric division function produces that scale and possibly make changes there. By the time the type's output or send function is invoked, that's already decided. The output/send functions are not the place to make scale or other semantic adjustments. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Thu, 2020-07-23 at 19:33 -0700, Peter Geoghegan wrote: > That does make it sound like the costs of the hash agg aren't being > represented. I suppose it isn't clear if this is a costing issue > because it isn't clear if the execution time performance itself is > pathological or is instead something that must be accepted as the > cost > of spilling the hash agg in a general kind of way. I have a feeling that this is mostly a costing problem. Sort uses its memory in two different phases: 1. when writing the sorted runs, it needs the memory to hold the run before sorting it, and only a single buffer for the output tape; and 2. when merging, it needs a lot of read buffers But in HashAgg, it needs to hold all of the groups in memory *at the same time* as it needs a lot of output buffers (one for each partition). This doesn't matter a lot at high values of work_mem, because the buffers will only be 8MB at most. I did attempt to cost this properly: hash_agg_set_limits() takes into account the memory the partitions will use, and the remaining memory is what's used in cost_agg(). But there's a lot of room for error in there. If someone sees an obvious error in the costing, please let me know. Otherwise, I think it will just take some time to make it better reflect reality in a variety of cases. For v13, and we will either need to live with it, or pessimize the costing for HashAgg until we get it right. Many costing issues can deal with a lot of slop -- e.g. HashJoin vs MergeJoin -- because a small factor often doesn't make the difference between plans. But HashAgg and Sort are more competitive with each other, so costing needs to be more precise. Regards, Jeff Davis
Re: estimation problems for DISTINCT ON with FDW
On Fri, Jul 3, 2020 at 5:50 PM Tom Lane wrote: > > OK, I'll go ahead and push the patch I proposed yesterday. > Thank you. I tested 12_STABLE with my real queries on the real data set, and the "hard coded" estimate of 200 distinct rows (when use_remote_estimte is turned back on) is enough to get rid of the worst plans I was seeing in 12.3. Cheers, Jeff
Re: Default setting for enable_hashagg_disk
On Fri, 2020-07-24 at 10:40 +0200, Tomas Vondra wrote: > FWIW one more suspicious thing that I forgot to mention is the > behavior > of the "planned partitions" depending on work_mem, which looks like > this: > >2MB Planned Partitions: 64HashAgg Batches: 4160 >4MB Planned Partitions: 128HashAgg Batches: 16512 >8MB Planned Partitions: 256HashAgg Batches: 21488 > 64MB Planned Partitions: 32HashAgg Batches: 2720 > 256MB Planned Partitions: 8HashAgg Batches: 8 > > I'd expect the number of planned partitions to decrease (slowly) as > work_mem increases, but it seems to increase initially. Seems a bit > strange, but maybe it's expected. The space for open-partition buffers is also limited to about 25% of memory. Each open partition takes BLCKSZ memory, so those numbers are exactly what I'd expect (64*8192 = 512kB). There's also another effect at work that can cause the total number of batches to be higher for larger work_mem values: when we do recurse, we again need to estimate the number of partitions needed. Right now, we overestimate the number of partitions needed (to be conservative), which leads to a wider fan-out and lots of tiny partitions, and therefore more batches. I think we can improve this by using something like a HyperLogLog on the hash values of the spilled tuples to get a better estimate for the number of groups (and therefore the number of partitions) that we need when we recurse, which would reduce the number of overall batches at higher work_mem settings. But I didn't get a chance to implement that yet. Regards, Jeff Davis
Re: Default setting for enable_hashagg_disk
On Sat, Jul 25, 2020 at 9:39 AM Peter Geoghegan wrote: > "Peak Memory Usage: 1605334kB" > > Hash agg avoids spilling entirely (so the planner gets it right this > time around). It even uses notably less memory. I guess that this is because the reported memory usage doesn't reflect the space used for transition state, which is presumably most of the total -- array_agg() is used in the query. -- Peter Geoghegan
Re: Mark unconditionally-safe implicit coercions as leakproof
Robert Haas writes: > On Fri, Jul 24, 2020 at 12:17 PM Tom Lane wrote: >> I went through the system's built-in implicit coercions to see >> which ones are unconditionally successful. These could all be >> marked leakproof, as per attached patch. > IMHO, this is a nice improvement. Thanks; pushed. On second reading I found that there are a few non-implicit coercions that could usefully be marked leakproof as well --- notably float4_numeric and float8_numeric, which should be error-free now that infinities can be converted. regards, tom lane
Re: Default setting for enable_hashagg_disk
On Fri, Jul 24, 2020 at 12:55 PM Peter Geoghegan wrote: > Could that be caused by clustering in the data? > > If the input data is in totally random order then we have a good > chance of never having to spill skewed "common" values. That is, we're > bound to encounter common values before entering spill mode, and so > those common values will continue to be usefully aggregated until > we're done with the initial groups (i.e. until the in-memory hash > table is cleared in order to process spilled input tuples). This is > great because the common values get aggregated without ever spilling, > and most of the work is done before we even begin with spilled tuples. > > If, on the other hand, the common values are concentrated together in > the input... I still don't know if that was a factor in your example, but I can clearly demonstrate that the clustering of data can matter a lot to hash aggs in Postgres 13. I attach a contrived example where it makes a *huge* difference. I find that the sorted version of the aggregate query takes significantly longer to finish, and has the following spill characteristics: "Peak Memory Usage: 205086kB Disk Usage: 2353920kB HashAgg Batches: 2424" Note that the planner doesn't expect any partitions here, but we still get 2424 batches -- so the planner seems to get it totally wrong. OTOH, the same query against a randomized version of the same data (no longer in sorted order, no clustering) works perfectly with the same work_mem (200MB): "Peak Memory Usage: 1605334kB" Hash agg avoids spilling entirely (so the planner gets it right this time around). It even uses notably less memory. -- Peter Geoghegan test-agg-sorted.sql Description: Binary data
Re: Default setting for enable_hashagg_disk
On Fri, 2020-07-24 at 21:16 +0200, Tomas Vondra wrote: > Surely more recursive spilling should do more I/O, but the Disk Usage > reported by explain analyze does not show anything like ... I suspect that's because of disk reuse in logtape.c. Regards, Jeff Davis
Re: INSERT INTO SELECT, Why Parallelism is not selected?
Amit Kapila writes: > On Fri, Jul 24, 2020 at 7:36 PM Tom Lane wrote: >> Yeah, the proposed comment changes don't actually add much. Also >> please try to avoid inserting non-ASCII into the source code; >> at least in my mail reader, that attachment has some. > I don't see any non-ASCII characters in the patch. I have applied and > checked (via vi editor) the patch as well but don't see any non-ASCII > characters. How can I verify that? They're definitely there: $ od -c 0001-Fix-comments-in-heapam.c.patch ... 0002740 h e \n + \t * l e a d e r c 0002760 a n p e r f o r m t h e i 0003000 n s e r t . 302 240 T h i s r e 0003020 s t r i c t i o n c a n b e 0003040 u p l i f t e d o n c e w 0003060 e \n + \t * a l l o w t h e 0003100 302 240 p l a n n e r t o g e n 0003120 e r a t e p a r a l l e l p 0003140 l a n s f o r i n s e r t s 0003160 . \n \t * / \n \t i f ( I s ... I'm not sure if "git diff --check" would whine about this. regards, tom lane
Re: Add header support to text format and matching feature
Rémi Lapeyre wrote: > Here’s a new version that fix all the issues. Here's a review of v4. The patch improves COPY in two ways: - COPY TO and COPY FROM now accept "HEADER ON" for the TEXT format (previously it was only for CSV) - COPY FROM also accepts "HEADER match" to tell that there's a header and that its contents must match the columns of the destination table. This works for both the CSV and TEXT formats. The syntax for the columns is the same as for the data and the match is case-sensitive. The first improvement when submitted alone (in 2018 by Simon Muller) has been judged not useful enough or even hazardous without any "match" feature. It was returned with feedback in 2018-10 and resubmitted by Rémi in 2020-02 with the match feature. The patches apply cleanly, "make check" and "make check-world" pass. In my tests it works fine except for one crash that I can reproduce on a fresh build and default configuration with: $ cat >file.txt i 1 $ psql postgres=# create table x(i int); CREATE TABLE postgres=# \copy x(i) from file.txt with (header match) COPY 1 postgres=# \copy x(i) from file.txt with (header match) COPY 1 postgres=# \copy x(i) from file.txt with (header match) COPY 1 postgres=# \copy x(i) from file.txt with (header match) COPY 1 postgres=# \copy x(i) from file.txt with (header match) COPY 1 postgres=# \copy x(i) from file.txt with (header match) PANIC: ERRORDATA_STACK_SIZE exceeded server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. I suspect the reason is the way that PG_TRY/PG_CATCH is used, see below. Code comments: +/* + * Represents whether the header must be absent, present or present and match. + */ +typedef enum CopyHeader +{ + COPY_HEADER_ABSENT, + COPY_HEADER_PRESENT, + COPY_HEADER_MATCH +} CopyHeader; + /* * This struct contains all the state variables used throughout a COPY * operation. For simplicity, we use the same struct for all variants of COPY, @@ -136,7 +146,7 @@ typedef struct CopyStateData boolbinary; /* binary format? */ boolfreeze; /* freeze rows on loading? */ boolcsv_mode; /* Comma Separated Value format? */ - boolheader_line;/* CSV or text header line? */ + CopyHeader header_line;/* CSV or text header line? */ After the redefinition into this enum type, there are still a bunch of references to header_line that treat it like a boolean: 1190: if (cstate->header_line) 1398: if (cstate->binary && cstate->header_line) 2119: if (cstate->header_line) 3635: if (cstate->cur_lineno == 0 && cstate->header_line) It works fine since COPY_HEADER_ABSENT is 0 as the first value of the enum, but maybe it's not good style to count on that. + PG_TRY(); + { + if (defGetBoolean(defel)) + cstate->header_line = COPY_HEADER_PRESENT; + else + cstate->header_line = COPY_HEADER_ABSENT; + } + PG_CATCH(); + { + char*sval = defGetString(defel); + + if (!cstate->is_copy_from) + PG_RE_THROW(); + + if (pg_strcasecmp(sval, "match") == 0) + cstate->header_line = COPY_HEADER_MATCH; + else + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), +errmsg("header requires a boolean or \"match\""))); + } + PG_END_TRY(); It seems wrong to use a PG_CATCH block for this. I understand that it's because defGetBoolean() calls ereport() on non-booleans, but then it should be split into an error-throwing function and a non-error-throwing lexical analysis of the boolean, the above code calling the latter. Besides the comments in elog.h above PG_TRY say that "the error recovery code can either do PG_RE_THROW to propagate the error outwards, or do a (sub)transaction abort. Failure to do so may leave the system in an inconsistent state for further processing." Maybe this is what happens with the repeated uses of "match" eventually failing with ERRORDATA_STACK_SIZE exceeded. -HEADER [ boolean ] +HEADER { match | true | false } This should be enclosed in square brackets because HEADER with no argument is still accepted. + names from the table. On input, the first line is discarded when set + to
Re: proposal: possibility to read dumped table's name from file
On Tue, Jul 14, 2020 at 12:03 PM Pavel Stehule wrote: >> I meant can this: >> printf(_(" --filter=FILENAMEread object name filter >> expressions from file\n")); >> be changed to: >> printf(_(" --filter=FILENAMEdump objects and data based >> on the filter expressions from the filter file\n")); > > done in today patch > Thanks for fixing the comments. Few comments: + /* use "-" as symbol for stdin */ + if (strcmp(filename, "-") != 0) + { + fp = fopen(filename, "r"); + if (!fp) + fatal("could not open the input file \"%s\": %m", + filename); + } + else + fp = stdin; We could use STDIN itself instead of -, it will be a more easier option to understand. + /* when first char is hash, ignore whole line */ + if (*line == '#') + continue; If line starts with # we ignore that line, I feel this should be included in the documentation. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Improving connection scalability: GetSnapshotData()
Em sex., 24 de jul. de 2020 às 21:00, Andres Freund escreveu: > On 2020-07-24 18:15:15 -0300, Ranier Vilela wrote: > > Em sex., 24 de jul. de 2020 às 14:16, Andres Freund > > escreveu: > > > > > On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote: > > > > Latest Postgres > > > > Windows 64 bits > > > > msvc 2019 64 bits > > > > > > > > Patches applied v12-0001 to v12-0007: > > > > > > > > C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning > > > C4013: > > > > 'GetOldestXmin' indefinido; assumindo extern retornando int > > > > [C:\dll\postgres > > > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): > warning > > > > C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int > > > > [C:\dll\postgres\pg_visibility. > > > > vcxproj] > > > > C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error > C2065: > > > > 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > > > > [C:\dll\postgres\pgstattuple.vcxproj] > > > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): > error > > > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > > > > [C:\dll\postgres\pg_visibility.vcxproj] > > > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): > error > > > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > > > > [C:\dll\postgres\pg_visibility.vcxproj] > > > > > > I don't know that's about - there's no call to GetOldestXmin() in > > > pgstatapprox and pg_visibility after patch 0002? And similarly, the > > > PROCARRAY_* references are also removed in the same patch? > > > > > Maybe need to remove them from these places, not? > > C:\dll\postgres\contrib>grep -d GetOldestXmin *.c > > File pgstattuple\pgstatapprox.c: > > OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM); > > File pg_visibility\pg_visibility.c: > > OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM); > > * deadlocks, because surely > > GetOldestXmin() should never take > > RecomputedOldestXmin = > GetOldestXmin(NULL, > > PROCARRAY_FLAGS_VACUUM); > > The 0002 patch changed those files: > > diff --git a/contrib/pg_visibility/pg_visibility.c > b/contrib/pg_visibility/pg_visibility.c > index 68d580ed1e0..37206c50a21 100644 > --- a/contrib/pg_visibility/pg_visibility.c > +++ b/contrib/pg_visibility/pg_visibility.c > @@ -563,17 +563,14 @@ collect_corrupt_items(Oid relid, bool all_visible, > bool all_frozen) > BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD); > TransactionId OldestXmin = InvalidTransactionId; > > - if (all_visible) > - { > - /* Don't pass rel; that will fail in recovery. */ > - OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM); > - } > - > rel = relation_open(relid, AccessShareLock); > > /* Only some relkinds have a visibility map */ > check_relation_relkind(rel); > > + if (all_visible) > + OldestXmin = GetOldestNonRemovableTransactionId(rel); > + > nblocks = RelationGetNumberOfBlocks(rel); > > /* > @@ -679,11 +676,12 @@ collect_corrupt_items(Oid relid, bool all_visible, > bool all_frozen) > * From a concurrency point of view, it > sort of sucks to > * retake ProcArrayLock here while we're > holding the buffer > * exclusively locked, but it should be > safe against > -* deadlocks, because surely > GetOldestXmin() should never take > -* a buffer lock. And this shouldn't > happen often, so it's > -* worth being careful so as to avoid > false positives. > +* deadlocks, because surely > GetOldestNonRemovableTransactionId() > +* should never take a buffer lock. And > this shouldn't happen > +* often, so it's worth being careful so > as to avoid false > +* positives. > */ > - RecomputedOldestXmin = GetOldestXmin(NULL, > PROCARRAY_FLAGS_VACUUM); > + RecomputedOldestXmin = > GetOldestNonRemovableTransactionId(rel); > > if (!TransactionIdPrecedes(OldestXmin, > RecomputedOldestXmin)) > record_corrupt_item(items, > _self); > > diff --git a/contrib/pgstattuple/pgstatapprox.c > b/contrib/pgstattuple/pgstatapprox.c > index dbc0fa11f61..3a99333d443 100644 > --- a/contrib/pgstattuple/pgstatapprox.c > +++ b/contrib/pgstattuple/pgstatapprox.c > @@ -71,7 +71,7 @@ statapprox_heap(Relation rel, output_type *stat) > BufferAccessStrategy bstrategy; > TransactionId OldestXmin; > > - OldestXmin = GetOldestXmin(rel,
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, Jul 24, 2020 at 7:17 PM Dilip Kumar wrote: > > Your changes look fine to me. Additionally, I have changed a test > case of getting the streaming changes in 0002. Instead of just > showing the count, I am showing that the transaction is actually > streaming. > If you want to show the changes then there is no need to display 157 rows probably a few (10-15) should be sufficient. If we can do that by increasing the size of the row then good, otherwise, I think it is better to retain the test to display the count. Today, I have again looked at the first patch (v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't find any more problems with it so planning to commit the same unless you or someone else want to add more to it. Just for ease of others, "the next patch extends the logical decoding output plugin API with stream methods". It adds seven methods to the output plugin API, adding support for streaming changes for large in-progress transactions. The methods are stream_start, stream_stop, stream_abort, stream_commit, stream_change, stream_message, and stream_truncate. Most of this is a simple extension of the existing methods, with the semantic difference that the transaction (or subtransaction) is incomplete and may be aborted later (which is something the regular API does not really need to deal with). This also extends the 'test_decoding' plugin, implementing these new stream methods. The stream_start/start_stop are used to demarcate a chunk of changes streamed for a particular toplevel transaction. This commit simply adds these new APIs and the upcoming patch to "allow the streaming mode in ReorderBuffer" will use these APIs. -- With Regards, Amit Kapila.
Re: display offset along with block number in vacuum errors
On Fri, Jul 24, 2020 at 11:18:43PM +0530, Mahendra Singh Thalor wrote: > In commit b61d161(Introduce vacuum errcontext to display additional > information), we added vacuum errcontext to display additional > information(block number) so that in case of vacuum error, we can identify > which block we are getting error. Addition to block number, if we can > display offset, then it will be more helpful for users. So to display > offset, here proposing two different methods(Thanks Robert for suggesting > these 2 methods): new_rel_pages = count_nondeletable_pages(onerel, vacrelstats); vacrelstats->blkno = new_rel_pages; + vacrelstats->offnum = InvalidOffsetNumber; Adding more context would be interesting for some cases, but not all contrary to what your patch does in some code paths like lazy_truncate_heap() as you would show up an offset of 0 for an invalid value. This could confuse some users. Note that we are careful enough to not print a context message if the block number is invalid. -- Michael signature.asc Description: PGP signature
Re: handle a ECPG_bytea typo
On Sat, Jul 25, 2020 at 07:22:15AM +0530, vignesh C wrote: > I felt the changes look correct. The reason it might be working > earlier is because the structure members are the same for both the > data structures ECPGgeneric_bytea & ECPGgeneric_varchar. ECPGset_noind_null() and ECPGis_noind_null() in misc.c show that ECPGgeneric_bytea is attached to ECPGt_bytea. The two structures may be the same now, but if a bug fix or a code change involves a change in the structure definition we could run into problems. So let's fix and back-patch this change. I am not spotting other areas impacted, and I'll try to take care at the beginning of next week. -- Michael signature.asc Description: PGP signature