Re: Code cleanup patch submission for extended_stats.c

2017-11-25 Thread Michael Paquier
On Sun, Nov 26, 2017 at 11:07 AM, Mark Dilger  wrote:
>
>> On Nov 25, 2017, at 2:05 PM, Tom Lane  wrote:
>>
>> Mark Dilger  writes:
>>> It looks to me like Alvaro introduced this in the original version of the 
>>> file which
>>> was created in commit 7b504eb282ca2f5104b5c00b4f05a3ef6bb1385b.  Grep'ing
>>> through the code base, it seems the following would be more consistent with
>>> how these initializations are handled elsewhere:
>>
>>>memset(nulls, 1, sizeof(nulls));
>>>memset(replaces, 0, sizeof(replaces));
>>>memset(values, 0, sizeof(values));
>>
>> +1.  I'd be inclined to use "false" and "true" for the init values of
>> the boolean arrays, too.
>
> Done.

That's better practice. More places deserve the same treatment if you
grep for them:
$ git grep memset | grep nulls | grep "[1|0]"
contrib/dblink/dblink.c:memset(nulls, 0, sizeof(nulls));
contrib/pageinspect/ginfuncs.c:memset(nulls, 0, sizeof(nulls));
contrib/pageinspect/ginfuncs.c:memset(nulls, 0, sizeof(nulls));
contrib/pageinspect/ginfuncs.c:memset(nulls, 0, sizeof(nulls));
contrib/pageinspect/heapfuncs.c:memset(nulls, 0, sizeof(nulls));
contrib/pageinspect/rawpage.c:memset(nulls, 0, sizeof(nulls));
contrib/pg_stat_statements/pg_stat_statements.c:memset(nulls,
0, sizeof(nulls));
contrib/pgstattuple/pgstatapprox.c:memset(nulls, 0, sizeof(nulls));
src/backend/access/heap/heapam.c:memset(nulls, 1, sizeof(nulls));
src/backend/catalog/pg_collation.c:memset(nulls, 0, sizeof(nulls));
src/backend/catalog/pg_range.c:memset(nulls, 0, sizeof(nulls));
src/backend/commands/extension.c:memset(nulls, 0, sizeof(nulls));
src/backend/commands/extension.c:memset(nulls, 0, sizeof(nulls));
src/backend/commands/extension.c:memset(nulls, 0, sizeof(nulls));
src/backend/commands/extension.c:memset(nulls, 0, sizeof(nulls));
src/backend/commands/extension.c:memset(nulls, 0, sizeof(nulls));
src/backend/commands/sequence.c:memset(pgs_nulls, 0, sizeof(pgs_nulls));
src/backend/commands/tablecmds.c:memset(nulls, 0, sizeof(nulls));
src/backend/libpq/hba.c:memset(nulls, 0, sizeof(nulls));
src/backend/libpq/hba.c:memset([1], true,
(NUM_PG_HBA_FILE_RULES_ATTS - 2) * sizeof(bool));
src/backend/replication/logical/logicalfuncs.c:memset(nulls, 0,
sizeof(nulls));
src/backend/replication/logical/origin.c:memset(, 0,
sizeof(nulls));
src/backend/replication/logical/origin.c:memset(nulls, 1,
sizeof(nulls));
src/backend/replication/slotfuncs.c:memset(nulls, 0, sizeof(nulls));
src/backend/replication/slotfuncs.c:memset(nulls, 0, sizeof(nulls));
src/backend/replication/walsender.c:memset(nulls, 0, sizeof(nulls));
src/backend/statistics/extended_stats.c:memset(nulls, 1,
Natts_pg_statistic_ext * sizeof(bool));
src/backend/utils/adt/genfile.c:memset(nulls, 0, sizeof(nulls));
src/backend/utils/misc/guc.c:memset(nulls, 0, sizeof(nulls));
-- 
Michael



Re: no library dependency in Makefile?

2017-11-25 Thread 高增琦
Hi, all

Update version:
1. Re-base with head of master
2. Add some basic support for PGXS

After replacing submake-libpgport/submake-libpgfeutils with
$(stlib_pgport)/$(stlib_pgfeutils) in
Makefiles of client programs, I think, may be we should add static lib
dependency for PGXS.

I can think two ways to do that: first, add  a new PG_STLIBS variable, user
need to
add static libs to it; second, we generate static lib dependency
automatically
from PG_LIBS variable. Which one is better?

Thanks

2017-11-20 17:00 GMT+08:00 高增琦 :

> The attached patch use normal dependency instead of order-only dependency
> for static libraries.
>
> 2017-11-20 12:58 GMT+08:00 高增琦 :
>
>>
>>
>> 2017-11-20 2:25 GMT+08:00 Tom Lane :
>>
>>> =?UTF-8?B?6auY5aKe55Cm?=  writes:
>>> > I very much look forward to hearing everyone's views on this issue.
>>> > If the solution mentioned before is ok, I will start to complete it.
>>>
>>> Please don't top-post, it makes the flow of the conversation very hard
>>> to follow.
>>>
>>
>> Sorry, my fault, I am so anxious.
>>
>>
>>> > Recently, I found 'psql' is not rebuilt automatically after
>>> > editing 'fe_utils/psqlscan.l'.
>>> > ...
>>> > It's OK for 'libpq' since 'libpq' is a dynamic library.
>>> > For a static library such as 'libpgfeutils.a', should we
>>> > add dependency rule in Makefile?
>>>
>>> Hm.  I think what you're saying is that when we copied the makefile
>>> patterns used for libpq.so to use for libpgport and libpgfeutils,
>>> we did the wrong thing because those are static not dynamic libraries.
>>> We don't have to relink psql if libpq.so gets some non-API-relevant
>>> changes, but we do need to relink it if libpgport.a does, so I suppose
>>> you are right.  However, this doesn't seem like the right way to
>>> fix it:
>>>
>>> > # add STLIBS as normal prerequisite
>>> > psql: $(OBJS) $(STLIBS) | submake-libpq submake-libpgport
>>> submake-libpgfeutils
>>> >   $(CC) $(CFLAGS) $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o
>>> $@$(X)
>>>
>>> Your point is that the static libraries should be treated as normal
>>> dependencies not order-only ones, so building that behavior like
>>> this seems pretty bizarre.
>>>
>>> I think what we want is something more like
>>>
>>> ../../src/port/libpgport.a: submake-libpgport
>>>
>>> ../../src/fe_utils/libpgfeutils.a: submake-libpgfeutils
>>>
>>> psql: $(OBJS) ../../src/port/libpgport.a
>>> ../../src/fe_utils/libpgfeutils.a | submake-libpq
>>> $(CC) $(CFLAGS) $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS)
>>> -o $@$(X)
>>>
>>> where of course the library file names need to be wrapped up in macros,
>>> but this is what it'd look like after macro expansion.  (I'm not sure
>>> this is right in detail, but my point is that we don't want order-only
>>> dependencies for these libraries.)
>>>
>>> regards, tom lane
>>>
>>
>> Thank you, I will try it this way.
>>
>>
>> --
>> GaoZengqi
>> pgf...@gmail.com
>> zengqi...@gmail.com
>>
>
>
>
> --
> GaoZengqi
> pgf...@gmail.com
> zengqi...@gmail.com
>



-- 
GaoZengqi
pgf...@gmail.com
zengqi...@gmail.com


0001-Add-build-dependency-between-client-and-static-lib.patch
Description: Binary data


Re: [PATCH] Fix crash in int8_avg_combine().

2017-11-25 Thread Tom Lane
Hadi Moshayedi  writes:
> While doing some tests on REL_10_STABLE, I was getting run-time exceptions
> at int8_avg_combine() at the following line:
> state1->sumX = state2->sumX;
> After some debugging, I noticed that palloc()’s alignment is 8-bytes, while
> this statement (which moves a __int128 from one memory location to another
> memory location) expects 16-byte memory alignments. So when either state1
> or state2 is not 16-byte aligned, this crashes.

I believe we already dealt with this:

Author: Tom Lane 
Branch: REL_10_STABLE [619a8c47d] 2017-11-14 17:49:49 -0500
Branch: REL9_6_STABLE [4a15f87d2] 2017-11-14 17:49:49 -0500
Branch: REL9_5_STABLE [d4e38489f] 2017-11-14 17:49:49 -0500

Prevent int128 from requiring more than MAXALIGN alignment.

Our initial work with int128 neglected alignment considerations, an
oversight that came back to bite us in bug #14897 from Vincent Lachenal.
It is unsurprising that int128 might have a 16-byte alignment requirement;
what's slightly more surprising is that even notoriously lax Intel chips
sometimes enforce that.

Raising MAXALIGN seems out of the question: the costs in wasted disk and
memory space would be significant, and there would also be an on-disk
compatibility break.  Nor does it seem very practical to try to allow some
data structures to have more-than-MAXALIGN alignment requirement, as we'd
have to push knowledge of that throughout various code that copies data
structures around.

The only way out of the box is to make type int128 conform to the system's
alignment assumptions.  Fortunately, gcc supports that via its
__attribute__(aligned()) pragma; and since we don't currently support
int128 on non-gcc-workalike compilers, we shouldn't be losing any platform
support this way.

Although we could have just done pg_attribute_aligned(MAXIMUM_ALIGNOF) and
called it a day, I did a little bit of extra work to make the code more
portable than that: it will also support int128 on compilers without
__attribute__(aligned()), if the native alignment of their 128-bit-int
type is no more than that of int64.

Add a regression test case that exercises the one known instance of the
problem, in parallel aggregation over a bigint column.

Back-patch of commit 751804998.  The code known to be affected only exists
in 9.6 and later, but we do have some stuff using int128 in 9.5, so patch
back to 9.5.

Discussion: 
https://postgr.es/m/20171110185747.31519.28...@wrigleys.postgresql.org

Does that solution not work for you?

regards, tom lane



Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-11-25 Thread Alvaro Herrera
Mark Dilger wrote:


> I understand project policy to allow elog for error conditions that will be 
> reported
> in "can't happen" type situations, similar to how an Assert would be used.  
> For
> conditions that can happen through (mis)use by the user, ereport is 
> appropriate.
> Not knowing whether you thought these elogs were reporting conditions that a
> user could cause, I did not know if you should change them to ereports, or if 
> you
> should just add a brief comment along the lines of /* should not be possible 
> */.

Two things dictate that policy:

1. messages are translated by default for ereport but not for elog.
Both things can be overridden, but we tend not to do it unless there's
no choice.

2. you can assign SQLSTATE only with ereport.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-11-25 Thread Tom Lane
Mark Dilger  writes:
>> On Nov 25, 2017, at 3:33 PM, Tomas Vondra  
>> wrote:
>> I might be missing something, but why would ereport be more appropriate
>> than elog? Ultimately, there's not much difference between elog(ERROR)
>> and ereport(ERROR) - both will cause a failure.

The core technical differences are (1) an ereport message is exposed for
translation, normally, while an elog is not; and (2) with ereport you can
set the errcode, whereas with elog it's always going to be XX000
(ERRCODE_INTERNAL_ERROR).

> I understand project policy to allow elog for error conditions that will be 
> reported
> in "can't happen" type situations, similar to how an Assert would be used.  
> For
> conditions that can happen through (mis)use by the user, ereport is 
> appropriate.

The project policy about this is basically that elog should only be used
for things that are legitimately "internal errors", ie not user-facing.
If there's a deterministic way for a user to trigger the error, or if
it can reasonably be expected to occur during normal operation, it should
definitely have an ereport (and a non-default errcode).

regards, tom lane



Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-11-25 Thread Tomas Vondra


On 11/26/2017 02:17 AM, Mark Dilger wrote:
> 
>> On Nov 25, 2017, at 3:33 PM, Tomas Vondra  
>> wrote:
>>
>>
>>
>> On 11/25/2017 10:01 PM, Mark Dilger wrote:
>>>
 On Nov 18, 2017, at 12:28 PM, Tomas Vondra  
 wrote:

 Hi,

 Attached is an updated version of the patch, adopting the psql describe
 changes introduced by 471d55859c11b.
>>>
>>> Hi Tomas,
>>>
>>> In src/backend/statistics/dependencies.c, you have introduced a comment:
>>>
>>> +   /*
>>> +* build an array of SortItem(s) sorted using the multi-sort support
>>> +*
>>> +* XXX This relies on all stats entries pointing to the same tuple
>>> +* descriptor. Not sure if that might not be the case.
>>> +*/
>>>
>>> Would you mind explaining that a bit more for me?  I don't understand 
>>> exactly what
>>> you mean here, but it sounds like the sort of thing that needs to be 
>>> clarified/fixed
>>> before it can be committed.  Am I misunderstanding this?
>>>
>>
>> The call right after that comment is
>>
>>items = build_sorted_items(numrows, rows, stats[0]->tupDesc,
>>   mss, k, attnums_dep);
>>
>> That method processes an array of tuples, and the structure is defined
>> by "tuple descriptor" (essentially a list of attribute info - data type,
>> length, ...). We get that from stats[0] and assume all the entries point
>> to the same tuple descriptor. That's generally safe assumption, I think,
>> because all the stats entries relate to columns from the same table.
> 
> Right, I got that, and tried mocking up some code to test that in an Assert.
> I did not pursue that far enough to reach any conclusion, however.  You
> seem to be indicating in the comment some uncertainty about whether the
> assumption is safe.  Do we need to dig into that further?
> 

I don't think it's worth the effort, really. I don't think we can really
get mismatching tuple descriptors here - that could only happen with
columns coming from different tables, or something similarly obscure.

>>>
>>> In src/backend/statistics/mcv.c, you have comments:
>>>
>>> + * FIXME: Single-dimensional MCV is sorted by frequency (descending). We
>>> + * should do that too, because when walking through the list we want to
>>> + * check the most frequent items first.
>>> + *
>>> + * TODO: We're using Datum (8B), even for data types (e.g. int4 or float4).
>>> + * Maybe we could save some space here, but the bytea compression should
>>> + * handle it just fine.
>>> + *
>>> + * TODO: This probably should not use the ndistinct directly (as computed 
>>> from
>>> + * the table, but rather estimate the number of distinct values in the
>>> + * table), no?
>>>
>>> Do you intend these to be fixed/implemented prior to committing this patch?
>>>
>>
>> Actually, the first FIXME is obsolete, as build_distinct_groups returns
>> the groups sorted by frequency. I'll remove that.
> 
> Ok, good.  That's the one I understood least.
> 
>> I think the rest is more a subject for discussion, so I'd need to hear
>> some feedback.
> 
> In terms of storage efficiency, you are using float8 for the frequency, which 
> is consistent
> with what other stats work uses, but may be overkill.  A float4 seems 
> sufficient to me.
> The extra four bytes for a float8 may be pretty small compared to the size of 
> the arrays
> being stored, so I'm not sure it matters.  Also, this might have been 
> discussed before,
> and I am not asking for a reversal of decisions the members of this mailing 
> list may
> already have reached.
> 
> As for using arrays of something smaller than Datum, you'd need some logic to 
> specify
> what the size is in each instance, and that probably complicates the code 
> rather a lot.
> Maybe someone else has a technique for doing that cleanly? 
> 

Note that this is not about storage efficiency. The comment is before
statext_mcv_build, so it's actually related to in-memory representation.
If you look into statext_mcv_serialize, it does use typlen to only copy
the number of bytes needed for each column.

>>>
>>> Further down in function statext_mcv_build, you have two loops, the first 
>>> allocating
>>> memory and the second initializing the memory.  There is no clear reason 
>>> why this
>>> must be done in two loops.  I tried combining the two loops into one, and 
>>> it worked
>>> just fine, but did not look any cleaner to me.  Feel free to disregard this 
>>> paragraph
>>> if you like it better the way you currently have it organized.
>>>
>>
>> I did it this way because of readability. I don't think this is a major
>> efficiency issue, as the maximum number of items is fairly limited, and
>> it happens only once at the end of the MCV list build (and the sorts and
>> comparisons are likely much more CPU expensive).
> 
> I defer to your judgement here.  It seems fine the way you did it.
> 
>>> Further down in statext_mcv_deserialize, you have some 

Re: Code cleanup patch submission for extended_stats.c

2017-11-25 Thread Mark Dilger

> On Nov 25, 2017, at 2:05 PM, Tom Lane  wrote:
> 
> Mark Dilger  writes:
>> It looks to me like Alvaro introduced this in the original version of the 
>> file which 
>> was created in commit 7b504eb282ca2f5104b5c00b4f05a3ef6bb1385b.  Grep'ing
>> through the code base, it seems the following would be more consistent with
>> how these initializations are handled elsewhere:
> 
>>memset(nulls, 1, sizeof(nulls));
>>memset(replaces, 0, sizeof(replaces));
>>memset(values, 0, sizeof(values));
> 
> +1.  I'd be inclined to use "false" and "true" for the init values of
> the boolean arrays, too.

Done.

mark



0001_extended_stats_sizeof.patch.2
Description: Binary data




Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-11-25 Thread Mark Dilger

> On Nov 25, 2017, at 3:33 PM, Tomas Vondra  
> wrote:
> 
> 
> 
> On 11/25/2017 10:01 PM, Mark Dilger wrote:
>> 
>>> On Nov 18, 2017, at 12:28 PM, Tomas Vondra  
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> Attached is an updated version of the patch, adopting the psql describe
>>> changes introduced by 471d55859c11b.
>> 
>> Hi Tomas,
>> 
>> In src/backend/statistics/dependencies.c, you have introduced a comment:
>> 
>> +   /*
>> +* build an array of SortItem(s) sorted using the multi-sort support
>> +*
>> +* XXX This relies on all stats entries pointing to the same tuple
>> +* descriptor. Not sure if that might not be the case.
>> +*/
>> 
>> Would you mind explaining that a bit more for me?  I don't understand 
>> exactly what
>> you mean here, but it sounds like the sort of thing that needs to be 
>> clarified/fixed
>> before it can be committed.  Am I misunderstanding this?
>> 
> 
> The call right after that comment is
> 
>items = build_sorted_items(numrows, rows, stats[0]->tupDesc,
>   mss, k, attnums_dep);
> 
> That method processes an array of tuples, and the structure is defined
> by "tuple descriptor" (essentially a list of attribute info - data type,
> length, ...). We get that from stats[0] and assume all the entries point
> to the same tuple descriptor. That's generally safe assumption, I think,
> because all the stats entries relate to columns from the same table.

Right, I got that, and tried mocking up some code to test that in an Assert.
I did not pursue that far enough to reach any conclusion, however.  You
seem to be indicating in the comment some uncertainty about whether the
assumption is safe.  Do we need to dig into that further?

>> 
>> In src/backend/statistics/mcv.c, you have comments:
>> 
>> + * FIXME: Single-dimensional MCV is sorted by frequency (descending). We
>> + * should do that too, because when walking through the list we want to
>> + * check the most frequent items first.
>> + *
>> + * TODO: We're using Datum (8B), even for data types (e.g. int4 or float4).
>> + * Maybe we could save some space here, but the bytea compression should
>> + * handle it just fine.
>> + *
>> + * TODO: This probably should not use the ndistinct directly (as computed 
>> from
>> + * the table, but rather estimate the number of distinct values in the
>> + * table), no?
>> 
>> Do you intend these to be fixed/implemented prior to committing this patch?
>> 
> 
> Actually, the first FIXME is obsolete, as build_distinct_groups returns
> the groups sorted by frequency. I'll remove that.

Ok, good.  That's the one I understood least.

> I think the rest is more a subject for discussion, so I'd need to hear
> some feedback.

In terms of storage efficiency, you are using float8 for the frequency, which 
is consistent
with what other stats work uses, but may be overkill.  A float4 seems 
sufficient to me.
The extra four bytes for a float8 may be pretty small compared to the size of 
the arrays
being stored, so I'm not sure it matters.  Also, this might have been discussed 
before,
and I am not asking for a reversal of decisions the members of this mailing 
list may
already have reached.

As for using arrays of something smaller than Datum, you'd need some logic to 
specify
what the size is in each instance, and that probably complicates the code 
rather a lot.
Maybe someone else has a technique for doing that cleanly? 

>> 
>> Further down in function statext_mcv_build, you have two loops, the first 
>> allocating
>> memory and the second initializing the memory.  There is no clear reason why 
>> this
>> must be done in two loops.  I tried combining the two loops into one, and it 
>> worked
>> just fine, but did not look any cleaner to me.  Feel free to disregard this 
>> paragraph
>> if you like it better the way you currently have it organized.
>> 
> 
> I did it this way because of readability. I don't think this is a major
> efficiency issue, as the maximum number of items is fairly limited, and
> it happens only once at the end of the MCV list build (and the sorts and
> comparisons are likely much more CPU expensive).

I defer to your judgement here.  It seems fine the way you did it.

>> Further down in statext_mcv_deserialize, you have some elogs which might 
>> need to be
>> ereports.  It is unclear to me whether you consider these deserialize error 
>> cases to be
>> "can't happen" type errors.  If so, you might add that fact to the comments 
>> rather than
>> changing the elogs to ereports.
>> 
> 
> I might be missing something, but why would ereport be more appropriate
> than elog? Ultimately, there's not much difference between elog(ERROR)
> and ereport(ERROR) - both will cause a failure.

I understand project policy to allow elog for error conditions that will be 
reported
in "can't happen" type situations, similar to how an Assert would be used.  For
conditions that can 

Re: [HACKERS] More stats about skipped vacuums

2017-11-25 Thread Michael Paquier
On Sun, Nov 26, 2017 at 12:34 AM, Tom Lane  wrote:
> Michael Paquier  writes:
>> On Sat, Nov 25, 2017 at 12:55 AM, Robert Haas  wrote:
>>> I think that's a good thing to worry about.   In the past, Tom has
>>> expressed reluctance to make stats tables that have a row per table
>>> any wider at all, IIRC.
>
>> Tom, any opinions to offer here? pg_stat_all_tables is currently at 22
>> columns (this takes two full lines on my terminal with a font size at
>> 13). With the first patch of what's proposed on this thread there
>> would be 24 columns. Perhaps it would be time to split the vacuum
>> statistics into a new view like pg_stat_tables_vacuum or similar?
>
> My concern is not with the width of any view --- you can always select a
> subset of columns if a view is too wide for your screen.  The issue is the
> number of counters in the stats collector's state.  We know, without any
> doubt, that widening PgStat_StatTabEntry causes serious pain to people
> with large numbers of tables; and they have no recourse when we do it.
> So the bar to adding more counters is very high IMO.  I won't say never,
> but I do doubt that something like skipped vacuums should make the cut.

I am not arguing about skipped vacuuum data here (don't think much of
it by the way), but of the number of index scans done by the last
vacuum or autovacuum. This helps in tunning autovacuum_work_mem and
maintenance_work_mem. The bar is still too high for that?
-- 
Michael



Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-11-25 Thread Tomas Vondra
Hi,

On 11/25/2017 09:23 PM, Mark Dilger wrote:
> 
>> On Nov 18, 2017, at 12:28 PM, Tomas Vondra  
>> wrote:
>>
>> Hi,
>>
>> Attached is an updated version of the patch, adopting the psql describe
>> changes introduced by 471d55859c11b.
>>
>> regards
>>
>> -- 
>> Tomas Vondra  http://www.2ndQuadrant.com
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>> <0001-multivariate-MCV-lists.patch.gz><0002-multivariate-histograms.patch.gz>
> 
> Hello Tomas,
> 
> After applying both your patches, I get a warning:
> 
> histogram.c:1284:10: warning: taking the absolute value of unsigned type 
> 'uint32' (aka 'unsigned int') has no effect [-Wabsolute-value]
> delta = fabs(data->numrows);
> ^
> histogram.c:1284:10: note: remove the call to 'fabs' since unsigned values 
> cannot be negative
> delta = fabs(data->numrows);
> ^~~~
> 1 warning generated.
> 

Hmm, yeah. The fabs() call is unnecessary, and probably a remnant from
some previous version where the field was not uint32.

I wonder why you're getting the warning and I don't, though. What
compiler are you using?

> 
> Looking closer at this section, there is some odd integer vs. floating point 
> arithmetic happening
> that is not necessarily wrong, but might be needlessly inefficient:
> 
> delta = fabs(data->numrows);
> split_value = values[0].value;
> 
> for (i = 1; i < data->numrows; i++)
> {
> if (values[i].value != values[i - 1].value)
> {
> /* are we closer to splitting the bucket in half? */
> if (fabs(i - data->numrows / 2.0) < delta)
> {
> /* let's assume we'll use this value for the split */
> split_value = values[i].value;
> delta = fabs(i - data->numrows / 2.0);
> nrows = i;
> }
> }
> }
> 
> I'm not sure the compiler will be able to optimize out the recomputation of 
> data->numrows / 2.0
> each time through the loop, since the compiler might not be able to prove to 
> itself that data->numrows
> does not get changed.  Perhaps you should compute it just once prior to 
> entering the outer loop,
> store it in a variable of integer type, round 'delta' off and store in an 
> integer, and do integer comparisons
> within the loop?  Just a thought
> 

Yeah, that's probably right. But I wonder if the loop is needed at all,
or whether we should start at i=(data->numrows/2.0) instead, and walk to
the closest change of value in both directions. That would probably save
more CPU than computing numrows/2.0 only once.

The other issue in that block of code seems to be that we compare the
values using simple inequality. That probably works for passbyval data
types, but we should use proper comparator (e.g. compare_datums_simple).

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Code cleanup patch submission for extended_stats.c

2017-11-25 Thread Tom Lane
Mark Dilger  writes:
> It looks to me like Alvaro introduced this in the original version of the 
> file which 
> was created in commit 7b504eb282ca2f5104b5c00b4f05a3ef6bb1385b.  Grep'ing
> through the code base, it seems the following would be more consistent with
> how these initializations are handled elsewhere:

> memset(nulls, 1, sizeof(nulls));
> memset(replaces, 0, sizeof(replaces));
> memset(values, 0, sizeof(values));

+1.  I'd be inclined to use "false" and "true" for the init values of
the boolean arrays, too.

regards, tom lane



pgbench - add \if support

2017-11-25 Thread Fabien COELHO


This patch adds \if support to pgbench, similar to psql's version added 
in March.


This patch brings a consistent set of features especially when combined 
with two other patches already in the (slow) CF process:


 - https://commitfest.postgresql.org/10/596/ .. /15/985/
   adds support for booleans expressions (comparisons, logical
   operators, ...). This enhanced expression engine would be useful
   to allow client-side expression in psql.

 - https://commitfest.postgresql.org/10/669/ .. /15/669/
   adds support for \gset, so that pgbench can interact with a database
   and extract something into a variable, instead of discarding it.

This patch adds a \if construct so that an expression on variables, 
possibly with data coming from the database, can change the behavior of a 
script.


For instance, the following script, which uses features from the three 
patches, would implement TPC-B per spec (not "tpcb-like", but really as 
specified).


  \set tbid  random(1, :scale)
  \set tid  10 * (:tbid - 1) + random(1, 10)
  -- client in same branch as teller at 85%
  \if :scale = 1 OR random(0, 99) < 85
-- same branch
\set bid  :tbid
  \else
-- different branch
\set bid  1 + (:tbid + random(1, :scale - 1)) % :scale
  \endif
  \set aid  :bid * 10 + random(1, 10)
  \set delta  random(-99, 99)
  BEGIN;
  UPDATE pgbench_accounts
SET abalance = abalance + :delta WHERE aid = :aid
RETURNING abalance AS balance \gset
  UPDATE pgbench_tellers
SET tbalance = tbalance + :delta WHERE tid = :tid;
  UPDATE pgbench_branches
SET bbalance = bbalance + :delta WHERE bid = :bid;
  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime)
VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
  END;

The patch moves the conditional stack infrastructure from psql to 
fe_utils, so that it is available to both psql & pgbench.


A partial evaluation is performed to detect structural errors (eg missing 
endif, else after else...) when the script is parsed, so that such errors 
cannot occur when a script is running.


A new automaton state is added to quickly step over false branches.

TAP tests ensure reasonable coverage of the feature.

--
Fabiendiff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 94b495e..715d096 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -895,6 +895,21 @@ pgbench  options  d
   
 
   
+   
+\if expression
+\elif expression
+\else
+\endif
+
+  
+  This group of commands implements nestable conditional blocks,
+  similarly to psql's .
+  Conditional expressions are identical to those with \set,
+  with non-zero values interpreted as true.
+ 
+
+   
+

 
  \set varname expression
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index fce7e3a..e88f782 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -2148,7 +2148,7 @@ hello 10
   
 
 
-  
+  
 \if expression
 \elif expression
 \else
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index bd96eae..9d1f90b 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -32,6 +32,7 @@
 #endif			/* ! WIN32 */
 
 #include "postgres_fe.h"
+#include "fe_utils/conditional.h"
 
 #include "getopt_long.h"
 #include "libpq-fe.h"
@@ -270,6 +271,9 @@ typedef enum
 	 * and we enter the CSTATE_SLEEP state to wait for it to expire. Other
 	 * meta-commands are executed immediately.
 	 *
+	 * CSTATE_SKIP_COMMAND for conditional branches which are not executed,
+	 * quickly skip commands that do not need any evaluation.
+	 *
 	 * CSTATE_WAIT_RESULT waits until we get a result set back from the server
 	 * for the current command.
 	 *
@@ -279,6 +283,7 @@ typedef enum
 	 * command counter, and loops back to CSTATE_START_COMMAND state.
 	 */
 	CSTATE_START_COMMAND,
+	CSTATE_SKIP_COMMAND,
 	CSTATE_WAIT_RESULT,
 	CSTATE_SLEEP,
 	CSTATE_END_COMMAND,
@@ -308,6 +313,7 @@ typedef struct
 	PGconn	   *con;			/* connection handle to DB */
 	int			id;/* client No. */
 	ConnectionStateEnum state;	/* state machine's current state. */
+	ConditionalStack cstack;	/* enclosing conditionals state */
 
 	int			use_file;		/* index in sql_script for this client */
 	int			command;		/* command number in script */
@@ -365,7 +371,11 @@ typedef enum MetaCommand
 	META_SET,	/* \set */
 	META_SETSHELL,/* \setshell */
 	META_SHELL,	/* \shell */
-	META_SLEEP	/* \sleep */
+	META_SLEEP,	/* \sleep */
+	META_IF,	/* \if */
+	META_ELIF,	/* \elif */
+	META_ELSE,	/* \else */
+	META_ENDIF	/* \endif */
 } MetaCommand;
 
 typedef enum QueryMode
@@ -1304,6 +1314,26 @@ coerceToDouble(PgBenchValue *pval, double *dval)
 	}
 }
 
+/*
+ * Return true or false for conditional purposes.
+ * Non zero numerical values are true.
+ */
+static bool
+valueTruth(PgBenchValue *pval)
+{
+	switch 

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-11-25 Thread Mark Dilger

> On Nov 18, 2017, at 12:28 PM, Tomas Vondra  
> wrote:
> 
> Hi,
> 
> Attached is an updated version of the patch, adopting the psql describe
> changes introduced by 471d55859c11b.

Hi Tomas,

In src/backend/statistics/dependencies.c, you have introduced a comment:

+   /*
+* build an array of SortItem(s) sorted using the multi-sort support
+*
+* XXX This relies on all stats entries pointing to the same tuple
+* descriptor. Not sure if that might not be the case.
+*/

Would you mind explaining that a bit more for me?  I don't understand exactly 
what
you mean here, but it sounds like the sort of thing that needs to be 
clarified/fixed
before it can be committed.  Am I misunderstanding this?


In src/backend/statistics/mcv.c, you have comments:

+ * FIXME: Single-dimensional MCV is sorted by frequency (descending). We
+ * should do that too, because when walking through the list we want to
+ * check the most frequent items first.
+ *
+ * TODO: We're using Datum (8B), even for data types (e.g. int4 or float4).
+ * Maybe we could save some space here, but the bytea compression should
+ * handle it just fine.
+ *
+ * TODO: This probably should not use the ndistinct directly (as computed from
+ * the table, but rather estimate the number of distinct values in the
+ * table), no?

Do you intend these to be fixed/implemented prior to committing this patch?


Further down in function statext_mcv_build, you have two loops, the first 
allocating
memory and the second initializing the memory.  There is no clear reason why 
this
must be done in two loops.  I tried combining the two loops into one, and it 
worked
just fine, but did not look any cleaner to me.  Feel free to disregard this 
paragraph
if you like it better the way you currently have it organized.


Further down in statext_mcv_deserialize, you have some elogs which might need 
to be
ereports.  It is unclear to me whether you consider these deserialize error 
cases to be
"can't happen" type errors.  If so, you might add that fact to the comments 
rather than
changing the elogs to ereports.


mark




Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-11-25 Thread Mark Dilger

> On Nov 18, 2017, at 12:28 PM, Tomas Vondra  
> wrote:
> 
> Hi,
> 
> Attached is an updated version of the patch, adopting the psql describe
> changes introduced by 471d55859c11b.
> 
> regards
> 
> -- 
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> <0001-multivariate-MCV-lists.patch.gz><0002-multivariate-histograms.patch.gz>

In src/backend/statistics/mcv.c, you have a few typos:

+ * there bo be a lot of duplicate values. But perhaps that's not true and we

+   /* Now it's safe to access the dimention info. */

+* Nowe we know the total expected MCV size, including all the pieces

+   /* pased by reference, but fixed length (name, tid, 
...) */


In src/include/statistics/statistics.h, there is some extraneous whitespace 
that needs
removing.


mark


Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-11-25 Thread Mark Dilger

> On Nov 18, 2017, at 12:28 PM, Tomas Vondra  
> wrote:
> 
> Hi,
> 
> Attached is an updated version of the patch, adopting the psql describe
> changes introduced by 471d55859c11b.
> 
> regards
> 
> -- 
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> <0001-multivariate-MCV-lists.patch.gz><0002-multivariate-histograms.patch.gz>

Hello Tomas,

After applying both your patches, I get a warning:

histogram.c:1284:10: warning: taking the absolute value of unsigned type 
'uint32' (aka 'unsigned int') has no effect [-Wabsolute-value]
delta = fabs(data->numrows);
^
histogram.c:1284:10: note: remove the call to 'fabs' since unsigned values 
cannot be negative
delta = fabs(data->numrows);
^~~~
1 warning generated.


Looking closer at this section, there is some odd integer vs. floating point 
arithmetic happening
that is not necessarily wrong, but might be needlessly inefficient:

delta = fabs(data->numrows);
split_value = values[0].value;

for (i = 1; i < data->numrows; i++)
{
if (values[i].value != values[i - 1].value)
{
/* are we closer to splitting the bucket in half? */
if (fabs(i - data->numrows / 2.0) < delta)
{
/* let's assume we'll use this value for the split */
split_value = values[i].value;
delta = fabs(i - data->numrows / 2.0);
nrows = i;
}
}
}

I'm not sure the compiler will be able to optimize out the recomputation of 
data->numrows / 2.0
each time through the loop, since the compiler might not be able to prove to 
itself that data->numrows
does not get changed.  Perhaps you should compute it just once prior to 
entering the outer loop,
store it in a variable of integer type, round 'delta' off and store in an 
integer, and do integer comparisons
within the loop?  Just a thought


mark


Code cleanup patch submission for extended_stats.c

2017-11-25 Thread Mark Dilger
Hackers, Alvaro,

In src/backend/statistics/extended_stats.c, in statext_store, there is a 
section:

Datum   values[Natts_pg_statistic_ext];
boolnulls[Natts_pg_statistic_ext];
boolreplaces[Natts_pg_statistic_ext];

memset(nulls, 1, Natts_pg_statistic_ext * sizeof(bool));
memset(replaces, 0, Natts_pg_statistic_ext * sizeof(bool));
memset(values, 0, Natts_pg_statistic_ext * sizeof(Datum));

It looks to me like Alvaro introduced this in the original version of the file 
which 
was created in commit 7b504eb282ca2f5104b5c00b4f05a3ef6bb1385b.  Grep'ing
through the code base, it seems the following would be more consistent with
how these initializations are handled elsewhere:

Datum   values[Natts_pg_statistic_ext];
boolnulls[Natts_pg_statistic_ext];
boolreplaces[Natts_pg_statistic_ext];

memset(nulls, 1, sizeof(nulls));
memset(replaces, 0, sizeof(replaces));
memset(values, 0, sizeof(values));

Patch attached as 0001_extended_stats_sizeof.patch.1


mark



0001_extended_stats_sizeof.patch.1
Description: Binary data


Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-11-25 Thread Mark Dilger

> On Nov 18, 2017, at 12:28 PM, Tomas Vondra  
> wrote:
> 
> Hi,
> 
> Attached is an updated version of the patch, adopting the psql describe
> changes introduced by 471d55859c11b.
> 
> regards
> 
> -- 
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> <0001-multivariate-MCV-lists.patch.gz><0002-multivariate-histograms.patch.gz>

Hello Tomas,

In 0002-multivariate-histograms.patch, src/include/nodes/relation.h,
struct StatisticExtInfo, you change:

-   charkind;   /* statistic kind of this entry 
*/
+   int kinds;  /* statistic kinds of 
this entry */

to have 'kinds' apparently be a bitmask, based on reading how you use
this in the code.  The #defines just below the struct give the four bits
to be used,

#define STATS_EXT_INFO_NDISTINCT1
#define STATS_EXT_INFO_DEPENDENCIES 2
#define STATS_EXT_INFO_MCV  4
#define STATS_EXT_INFO_HISTOGRAM8

except that nothing in the file indicates that this is so.  Perhaps a comment
could be added here mentioning that 'kinds' is a bitmask, and that these
#defines are related?

mark


Re: [HACKERS] More stats about skipped vacuums

2017-11-25 Thread Tom Lane
Robert Haas  writes:
> Of course, the other obvious question is whether we really need a
> consistent snapshot, because that's bound to be pretty expensive even
> if you eliminate the I/O cost.  Taking a consistent snapshot across
> all 100,000 tables in the database even if we're only ever going to
> access 5 of those tables doesn't seem like a good or scalable design.

Mumble.  It's a property I'm pretty hesitant to give up, especially
since the stats views have worked like that since day one.  It's
inevitable that weakening that guarantee would break peoples' queries,
probably subtly.

What we need is a way to have a consistent snapshot without implementing
it by copying 100,000 tables' worth of data for every query.  Hmm, I heard
of a technique called MVCC once ...

regards, tom lane



Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-11-25 Thread Mark Dilger

> On Nov 18, 2017, at 12:28 PM, Tomas Vondra  
> wrote:
> 
> Hi,
> 
> Attached is an updated version of the patch, adopting the psql describe
> changes introduced by 471d55859c11b.
> 
> regards
> 
> -- 
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> <0001-multivariate-MCV-lists.patch.gz><0002-multivariate-histograms.patch.gz>

Thanks, Tomas, again for your work on this feature.

Applying just the 0001-multivariate-MCV-lists.patch to the current master, and
then extending the stats_ext.sql test as follows, I am able to trigger an error,
"ERROR:  operator 4294934272 is not a valid ordering operator".

 

diff --git a/src/test/regress/sql/stats_ext.sql 
b/src/test/regress/sql/stats_ext.sql
index e9902ced5c..5083dc05e6 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -402,4 +402,22 @@ EXPLAIN (COSTS OFF)
 EXPLAIN (COSTS OFF)
  SELECT * FROM mcv_lists WHERE a IS NULL AND b IS NULL AND c IS NULL;
 
-RESET random_page_cost;
+DROP TABLE mcv_lists; 
+ 
+CREATE TABLE mcv_lists (
+a NUMERIC[],
+   b NUMERIC[]
+);
+CREATE STATISTICS mcv_lists_stats (mcv) ON a, b FROM mcv_lists;
+INSERT INTO mcv_lists (a, b)
+   (SELECT array_agg(gs::numeric) AS a, array_agg(gs::numeric) AS b
+   FROM generate_series(1,1000) gs
+   );
+ANALYZE mcv_lists;
+INSERT INTO mcv_lists (a, b)
+   (SELECT array_agg(gs::numeric) AS a, array_agg(gs::numeric) AS b
+   FROM generate_series(1,1000) gs
+   );
+ANALYZE mcv_lists;
+
+DROP TABLE mcv_lists;

Which gives me the following regression.diffs:

*** /Users/mark/master/postgresql/src/test/regress/expected/stats_ext.out   
2017-11-25 08:06:37.0 -0800
--- /Users/mark/master/postgresql/src/test/regress/results/stats_ext.out
2017-11-25 08:10:18.0 -0800
***
*** 721,724 
   Index Cond: ((a IS NULL) AND (b IS NULL))
  (5 rows)
  
! RESET random_page_cost;
--- 721,741 
   Index Cond: ((a IS NULL) AND (b IS NULL))
  (5 rows)
  
! DROP TABLE mcv_lists;
! CREATE TABLE mcv_lists (
! a NUMERIC[],
!   b NUMERIC[]
! );
! CREATE STATISTICS mcv_lists_stats (mcv) ON a, b FROM mcv_lists;
! INSERT INTO mcv_lists (a, b)
!   (SELECT array_agg(gs::numeric) AS a, array_agg(gs::numeric) AS b
!   FROM generate_series(1,1000) gs
!   );
! ANALYZE mcv_lists;
! INSERT INTO mcv_lists (a, b)
!   (SELECT array_agg(gs::numeric) AS a, array_agg(gs::numeric) AS b
!   FROM generate_series(1,1000) gs
!   );
! ANALYZE mcv_lists;
! ERROR:  operator 4294934272 is not a valid ordering operator
! DROP TABLE mcv_lists;

==




Re: [HACKERS] More stats about skipped vacuums

2017-11-25 Thread Robert Haas
On Sat, Nov 25, 2017 at 10:34 AM, Tom Lane  wrote:
> If we could get rid of the copy-to-a-temporary-file technology for
> transferring the stats collector's data to backends, then this problem
> would probably vanish or at least get a lot less severe.  But that seems
> like a nontrivial project.  With the infrastructure we have today, we
> could probably keep the stats tables in a DSM segment; but how would
> a backend get a consistent snapshot of them?

I suppose the obvious approach is to have a big lock around the
statistics data proper; this could be taken in shared mode to take a
snapshot or in exclusive mode to update statistics.  In addition,
create one or more queues where statistics messages can be enqueued in
lieu of updating the main statistics data directly.  If that doesn't
perform well enough, you could keep two copies of the statistics, A
and B.  At any given time, one copy is quiescent and the other copy is
being updated.  Periodically, at a time when we know that nobody is
taking a snapshot of the statistics, they reverse roles.

Of course, the other obvious question is whether we really need a
consistent snapshot, because that's bound to be pretty expensive even
if you eliminate the I/O cost.  Taking a consistent snapshot across
all 100,000 tables in the database even if we're only ever going to
access 5 of those tables doesn't seem like a good or scalable design.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-25 Thread Robert Haas
On Wed, Nov 22, 2017 at 8:36 AM, Amit Kapila  wrote:
>> remove-memory-leak-protection-v1.patch removes the memory leak
>> protection that Tom installed upon discovering that the original
>> version of tqueue.c leaked memory like crazy.  I think that it
>> shouldn't do that any more, courtesy of
>> 6b65a7fe62e129d5c2b85cd74d6a91d8f7564608.  Assuming that's correct, we
>> can avoid a whole lot of tuple copying in Gather Merge and a much more
>> modest amount of overhead in Gather.  Since my test case exercised
>> Gather Merge, this bought ~400 ms or so.
>
> I think Tom didn't installed memory protection in nodeGatherMerge.c
> related to an additional copy of tuple.  I could see it is present in
> the original commit of Gather Merge
> (355d3993c53ed62c5b53d020648e4fbcfbf5f155).  Tom just avoided applying
> heap_copytuple to a null tuple in his commit
> 04e9678614ec64ad9043174ac99a25b1dc45233a.  I am not sure whether you
> are just referring to the protection Tom added in nodeGather.c,  If
> so, it is not clear from your mail.

Yes, that's what I mean.  What got done for Gather Merge was motivated
by what Tom did for Gather.  Sorry for not expressing the idea more
precisely.

> I think we don't need the additional copy of tuple in
> nodeGatherMerge.c and your patch seem to be doing the right thing.
> However, after your changes, it looks slightly odd that
> gather_merge_clear_tuples() is explicitly calling heap_freetuple for
> the tuples allocated by tqueue.c, maybe we can add some comment.  It
> was much clear before this patch as nodeGatherMerge.c was explicitly
> copying the tuples that it is freeing.

OK, I can add a comment about that.

> Isn't it better to explicitly call gather_merge_clear_tuples in
> ExecEndGatherMerge so that we can free the memory for tuples allocated
> in this node rather than relying on reset/free of MemoryContext in
> which those tuples are allocated?

Generally relying on reset/free of a MemoryContext is cheaper.
Typically you only want to free manually if the freeing would
otherwise not happen soon enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] More stats about skipped vacuums

2017-11-25 Thread Tom Lane
Michael Paquier  writes:
> On Sat, Nov 25, 2017 at 12:55 AM, Robert Haas  wrote:
>> I think that's a good thing to worry about.   In the past, Tom has
>> expressed reluctance to make stats tables that have a row per table
>> any wider at all, IIRC.

> Tom, any opinions to offer here? pg_stat_all_tables is currently at 22
> columns (this takes two full lines on my terminal with a font size at
> 13). With the first patch of what's proposed on this thread there
> would be 24 columns. Perhaps it would be time to split the vacuum
> statistics into a new view like pg_stat_tables_vacuum or similar?

My concern is not with the width of any view --- you can always select a
subset of columns if a view is too wide for your screen.  The issue is the
number of counters in the stats collector's state.  We know, without any
doubt, that widening PgStat_StatTabEntry causes serious pain to people
with large numbers of tables; and they have no recourse when we do it.
So the bar to adding more counters is very high IMO.  I won't say never,
but I do doubt that something like skipped vacuums should make the cut.

If we could get rid of the copy-to-a-temporary-file technology for
transferring the stats collector's data to backends, then this problem
would probably vanish or at least get a lot less severe.  But that seems
like a nontrivial project.  With the infrastructure we have today, we
could probably keep the stats tables in a DSM segment; but how would
a backend get a consistent snapshot of them?

regards, tom lane



Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

2017-11-25 Thread Robert Haas
On Tue, Nov 21, 2017 at 5:38 PM, Peter Geoghegan  wrote:
>> That having been said, I think the place where our plans most commonly
>> go wrong is where we incorrectly estimate the number of tuples by
>> multiple orders of magnitude - 100x is common, 1000x is common, a
>> million x is not uncommon, even a billion x is not unheard-of.  And I
>> don't think there's any way to make a hash join happy if it thinks
>> it's going to need 1 batch and it ends up needing a million batches.
>
> What about dynamic role reversal? That could make a big difference.

In the best case it's great, but it looks to me like there are a lot
of thorny problems.  For example, imagine giant_table INNER JOIN
bigger_than_we_thought  The latter table will be chosen as the inner
table and that won't work out very well, but there's no way to know
whether switching the sides will be any better except to try reading a
bunch of rows from giant_table and seeing whether it turns out to be a
lot smaller than we thought.  To do that, we'll need to dump the hash
table we started to build on the original inner side out to disk so
that we can free up enough work_mem to try building a hash table on
the other side.  When the giant table turns out to actually be giant,
we'll need to go back to the original plan, which means dumping out
the tuples from the second hash table and reloading the tuples from
the first one.  So we end up just doing a bunch of extra work for
nothing.  I think that this scenario - wasting effort trying to switch
the sides only to give up - will happen frequently.

In the multi-batch case, there seems to be a little more hope of doing
something clever.  We're anyway writing out most of both inputs out to
tapes.  If we were willing to write ALL of both inputs out to tapes,
then we could decide - perhaps even separately for each batch - which
side to load into the hash table.  Of course, that adds a lot of
incremental I/O unless the number of batches is large (e.g. if we had
only 4 batches, writing 4/4 of the data instead of 3/4 is a 33%
increase, but if we had 64 batches, writing 64/64 of the data instead
of 63/64 doesn't matter a lot, probably).  And it leaves out a few
important details, like the fact that what fits in the hash table is
used to choose the number of batches in the first place, and that we
write the whole of one side to tapes before starting on the other
side.  I don't know how to handle those problems but it seems like it
might be possible to come up with something clever, at least for
certain cases.

> I agree that it would be enormously valuable if we could make
> estimates much better, so I think that I understand why you emphasize
> it. But, I don't think that there are any good ideas for improving
> join selectivity that don't involve expert DBA knowledge, or
> novel/risky techniques for feedback to the system about column
> redundancy/correlation, etc. These do not seem like scalable
> approaches, and so they don't particularly appeal to me as projects.
> I'd be happy to be shown to be wrong about this.

Yeah, I agree that it's a hard problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] More stats about skipped vacuums

2017-11-25 Thread Michael Paquier
On Sat, Nov 25, 2017 at 12:55 AM, Robert Haas  wrote:
> On Tue, Nov 21, 2017 at 2:09 AM, Kyotaro HORIGUCHI
>  wrote:
>> Yes, my concern here is how many column we can allow in a stats
>> view. I think I'm a bit too warried about that.
>
> I think that's a good thing to worry about.   In the past, Tom has
> expressed reluctance to make stats tables that have a row per table
> any wider at all, IIRC.

Tom, any opinions to offer here? pg_stat_all_tables is currently at 22
columns (this takes two full lines on my terminal with a font size at
13). With the first patch of what's proposed on this thread there
would be 24 columns. Perhaps it would be time to split the vacuum
statistics into a new view like pg_stat_tables_vacuum or similar?
-- 
Michael