Re: [HACKERS] Custom compression methods

2017-11-27 Thread Tomas Vondra
Hi,

On 11/27/2017 04:52 PM, Ildus Kurbangaliev wrote:
> ...
>
> Hi. This looks like a serious bug, but I couldn't reproduce it yet.
> Did you upgrade some old database or this bug happened after 
> insertion of all data to new database? I tried using your 'archie' 
> tool to download mailing lists and insert them to database, but
> couldn't catch any errors.
> 

I can trigger it pretty reliably with these steps:

git checkout f65d21b258085bdc8ef2cc282ab1ff12da9c595c
patch -p1 < ~/custom_compression_methods_v6.patch
./configure --enable-debug --enable-cassert \
 CFLAGS="-fno-omit-frame-pointer -O0 -DRANDOMIZE_ALLOCATED_MEMORY" \
 --prefix=/home/postgres/pg-compress
make -s clean && make -s -j4 install
cd contrib/
make -s clean && make -s -j4 install

export PATH=/home/postgres/pg-compress/bin:$PATH
pg_ctl -D /mnt/raid/pg-compress init
pg_ctl -D /mnt/raid/pg-compress -l compress.log start
createdb archie
cd ~/archie/sql/
psql archie < create.sql

~/archie/bin/load.py --workers 4 --db archie */* > load.log 2>&1


I guess the trick might be -DRANDOMIZE_ALLOCATED_MEMORY (I first tried
without it, and it seemed working fine). If that's the case, I bet there
is a palloc that should have been palloc0, or something like that.

If you still can't reproduce that, I may give you access to this machine
so that you can debug it there.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-11-27 Thread Tomas Vondra
Hi,

Attached is an updated version of the patch series, fixing the issues
reported by Mark Dilger:

1) Fix fabs() issue in histogram.c.

2) Do not rely on extra_data being StdAnalyzeData, and instead lookup
the LT operator explicitly. This also adds a simple regression tests to
make sure ANALYZE on arrays works fine, but perhaps we should invent
some simple queries too.

3) I've removed / clarified some of the comments mentioned by Mark.

4) I haven't changed how the statistics kinds are defined in relation.h,
but I agree there should be a comment explaining how STATS_EXT_INFO_*
relate to StatisticExtInfo.kinds.

5) The most significant change happened histograms. There used to be two
structures for histograms:

  - MVHistogram - expanded (no deduplication etc.), result of histogram
build and never used for estimation

  - MVSerializedHistogram - deduplicated to save space, produced from
MVHistogram before storing in pg_statistic_ext and never used for
estimation

So there wasn't really any reason to expose the "non-serialized" version
outside histogram.c. It was just confusing and unnecessary, so I've
moved MVHistogram to histogram.c (and renamed it to MVHistogramBuild),
and renamed MVSerializedHistogram. And same for the MVBucket stuff.

So now we only deal with MVHistogram everywhere, except in histogram.c.

6) I've also made MVHistogram to include a varlena header directly (and
be packed as a bytea), which allows us to store it without having to
call any serialization functions).

I guess if we should do (5) and (6) for the MCV lists too, it seems more
convenient than the current approach. And perhaps even for the
statistics added to 9.6 (it does not change the storage format).


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-multivariate-MCV-lists.patch.gz
Description: application/gzip


0002-multivariate-histograms.patch.gz
Description: application/gzip


simplehash: tb->sizemask = 0

2017-11-27 Thread Tomas Vondra
Hi,

I'm a bit puzzled by this code in SH_COMPUTE_PARAMETERS:

if (tb->size == SH_MAX_SIZE)
tb->sizemask = 0;
else
tb->sizemask = tb->size - 1;

Doesn't that mean that with SH_MAX_SIZE we end up with sizemask being 0
(i.e. no bits set)? At least that's what I get from

printf("%#x\n", (unsigned int)0);

That would mean SH_INITIAL_BUCKET/SH_NEXT/SH_PREV can only ever return
bucket 0, no?

I don't think we're building hash tables with 2^32 buckets, though.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-11-25 Thread Tomas Vondra
Hi,

On 11/25/2017 09:23 PM, Mark Dilger wrote:
> 
>> On Nov 18, 2017, at 12:28 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> 
>> wrote:
>>
>> Hi,
>>
>> Attached is an updated version of the patch, adopting the psql describe
>> changes introduced by 471d55859c11b.
>>
>> regards
>>
>> -- 
>> Tomas Vondra  http://www.2ndQuadrant.com
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>> <0001-multivariate-MCV-lists.patch.gz><0002-multivariate-histograms.patch.gz>
> 
> Hello Tomas,
> 
> After applying both your patches, I get a warning:
> 
> histogram.c:1284:10: warning: taking the absolute value of unsigned type 
> 'uint32' (aka 'unsigned int') has no effect [-Wabsolute-value]
> delta = fabs(data->numrows);
> ^
> histogram.c:1284:10: note: remove the call to 'fabs' since unsigned values 
> cannot be negative
> delta = fabs(data->numrows);
> ^~~~
> 1 warning generated.
> 

Hmm, yeah. The fabs() call is unnecessary, and probably a remnant from
some previous version where the field was not uint32.

I wonder why you're getting the warning and I don't, though. What
compiler are you using?

> 
> Looking closer at this section, there is some odd integer vs. floating point 
> arithmetic happening
> that is not necessarily wrong, but might be needlessly inefficient:
> 
> delta = fabs(data->numrows);
> split_value = values[0].value;
> 
> for (i = 1; i < data->numrows; i++)
> {
> if (values[i].value != values[i - 1].value)
> {
> /* are we closer to splitting the bucket in half? */
> if (fabs(i - data->numrows / 2.0) < delta)
> {
> /* let's assume we'll use this value for the split */
> split_value = values[i].value;
> delta = fabs(i - data->numrows / 2.0);
> nrows = i;
> }
> }
> }
> 
> I'm not sure the compiler will be able to optimize out the recomputation of 
> data->numrows / 2.0
> each time through the loop, since the compiler might not be able to prove to 
> itself that data->numrows
> does not get changed.  Perhaps you should compute it just once prior to 
> entering the outer loop,
> store it in a variable of integer type, round 'delta' off and store in an 
> integer, and do integer comparisons
> within the loop?  Just a thought
> 

Yeah, that's probably right. But I wonder if the loop is needed at all,
or whether we should start at i=(data->numrows/2.0) instead, and walk to
the closest change of value in both directions. That would probably save
more CPU than computing numrows/2.0 only once.

The other issue in that block of code seems to be that we compare the
values using simple inequality. That probably works for passbyval data
types, but we should use proper comparator (e.g. compare_datums_simple).

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-11-25 Thread Tomas Vondra


On 11/26/2017 02:17 AM, Mark Dilger wrote:
> 
>> On Nov 25, 2017, at 3:33 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> 
>> wrote:
>>
>>
>>
>> On 11/25/2017 10:01 PM, Mark Dilger wrote:
>>>
>>>> On Nov 18, 2017, at 12:28 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> 
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Attached is an updated version of the patch, adopting the psql describe
>>>> changes introduced by 471d55859c11b.
>>>
>>> Hi Tomas,
>>>
>>> In src/backend/statistics/dependencies.c, you have introduced a comment:
>>>
>>> +   /*
>>> +* build an array of SortItem(s) sorted using the multi-sort support
>>> +*
>>> +* XXX This relies on all stats entries pointing to the same tuple
>>> +* descriptor. Not sure if that might not be the case.
>>> +*/
>>>
>>> Would you mind explaining that a bit more for me?  I don't understand 
>>> exactly what
>>> you mean here, but it sounds like the sort of thing that needs to be 
>>> clarified/fixed
>>> before it can be committed.  Am I misunderstanding this?
>>>
>>
>> The call right after that comment is
>>
>>items = build_sorted_items(numrows, rows, stats[0]->tupDesc,
>>   mss, k, attnums_dep);
>>
>> That method processes an array of tuples, and the structure is defined
>> by "tuple descriptor" (essentially a list of attribute info - data type,
>> length, ...). We get that from stats[0] and assume all the entries point
>> to the same tuple descriptor. That's generally safe assumption, I think,
>> because all the stats entries relate to columns from the same table.
> 
> Right, I got that, and tried mocking up some code to test that in an Assert.
> I did not pursue that far enough to reach any conclusion, however.  You
> seem to be indicating in the comment some uncertainty about whether the
> assumption is safe.  Do we need to dig into that further?
> 

I don't think it's worth the effort, really. I don't think we can really
get mismatching tuple descriptors here - that could only happen with
columns coming from different tables, or something similarly obscure.

>>>
>>> In src/backend/statistics/mcv.c, you have comments:
>>>
>>> + * FIXME: Single-dimensional MCV is sorted by frequency (descending). We
>>> + * should do that too, because when walking through the list we want to
>>> + * check the most frequent items first.
>>> + *
>>> + * TODO: We're using Datum (8B), even for data types (e.g. int4 or float4).
>>> + * Maybe we could save some space here, but the bytea compression should
>>> + * handle it just fine.
>>> + *
>>> + * TODO: This probably should not use the ndistinct directly (as computed 
>>> from
>>> + * the table, but rather estimate the number of distinct values in the
>>> + * table), no?
>>>
>>> Do you intend these to be fixed/implemented prior to committing this patch?
>>>
>>
>> Actually, the first FIXME is obsolete, as build_distinct_groups returns
>> the groups sorted by frequency. I'll remove that.
> 
> Ok, good.  That's the one I understood least.
> 
>> I think the rest is more a subject for discussion, so I'd need to hear
>> some feedback.
> 
> In terms of storage efficiency, you are using float8 for the frequency, which 
> is consistent
> with what other stats work uses, but may be overkill.  A float4 seems 
> sufficient to me.
> The extra four bytes for a float8 may be pretty small compared to the size of 
> the arrays
> being stored, so I'm not sure it matters.  Also, this might have been 
> discussed before,
> and I am not asking for a reversal of decisions the members of this mailing 
> list may
> already have reached.
> 
> As for using arrays of something smaller than Datum, you'd need some logic to 
> specify
> what the size is in each instance, and that probably complicates the code 
> rather a lot.
> Maybe someone else has a technique for doing that cleanly? 
> 

Note that this is not about storage efficiency. The comment is before
statext_mcv_build, so it's actually related to in-memory representation.
If you look into statext_mcv_serialize, it does use typlen to only copy
the number of bytes needed for each column.

>>>
>>> Further down in function statext_mcv_build, you have two loops, the first 
>>> allocating
>>> memory and the second initializing the memory.  There is no cle

Re: [HACKERS] CUBE seems a bit confused about ORDER BY

2017-11-29 Thread Tomas Vondra


On 11/29/2017 06:13 AM, Michael Paquier wrote:
> On Tue, Nov 21, 2017 at 7:07 AM, Alexander Korotkov
> <a.korot...@postgrespro.ru> wrote:
>> On Mon, Nov 20, 2017 at 1:59 PM, Alexander Korotkov
>> <a.korot...@postgrespro.ru> wrote:
>>>
>>> On Mon, Nov 20, 2017 at 4:09 AM, Tomas Vondra
>>> <tomas.von...@2ndquadrant.com> wrote:
>>>>
>>>> Seems fine to me, although perhaps it should be split into two parts.
>>>> One with the cube_coord_llur fixes, and then g_cube_distance changes
>>>> adding support for negative coordinates.
>>>
>>>
>>> Thank you for your feedback.  I'll split this patch into two.
>>
>>
>> Please, find two patches attached.
> 
> This got no reviews, so moved to next CF.
> 

That is somewhat inaccurate, I believe. I won't claim it received enough
reviews, but it certainly received some. And splitting it into two does
not invalidate that, I guess.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Custom compression methods

2017-11-30 Thread Tomas Vondra


On 11/30/2017 04:20 PM, Ildus Kurbangaliev wrote:
> On Thu, 30 Nov 2017 00:30:37 +0100
> Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
>
> ...
>
>> I can imagine other interesting use cases - for example values in
>> JSONB columns often use the same "schema" (keys, nesting, ...), so
>> can I imagine building a "dictionary" of JSON keys for the whole
>> column ...
>>
>> Ildus, is this a use case you've been aiming for, or were you aiming
>> to use the new API in a different way?
> 
> Thank you for such good overview. I agree that pglz is pretty good as
> general compression method, and there's no point to change it, at
> least now.
> 
> I see few useful use cases for compression methods, it's special
> compression methods for int[], timestamp[] for time series and yes,
> dictionaries for jsonb, for which I have even already created an
> extension (https://github.com/postgrespro/jsonbd). It's working and
> giving promising results.
> 

I understand the reluctance to put everything into core, particularly
for complex patches that evolve quickly. Also, not having to put
everything into core is kinda why we have extensions.

But perhaps some of the simpler cases would be good candidates for core,
making it possible to test the feature?

>>
>> I wonder if the patch can be improved to handle this use case better.
>> For example, it requires knowledge the actual data type, instead of
>> treating it as opaque varlena / byte array. I see tsvector compression
>> does that by checking typeid in the handler.
>>
>> But that fails for example with this example
>>
>> db=# create domain x as tsvector;
>> CREATE DOMAIN
>> db=# create table t (a x compressed ts1);
>> ERROR:  unexpected type 28198672 for tsvector compression handler
>>
>> which means it's a few brick shy to properly support domains. But I
>> wonder if this should be instead specified in CREATE COMPRESSION
>> METHOD instead. I mean, something like
>>
>> CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
>> TYPE tsvector;
>>
>> When type is no specified, it applies to all varlena values. Otherwise
>> only to that type. Also, why not to allow setting the compression as
>> the default method for a data type, e.g.
>>
>> CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
>> TYPE tsvector DEFAULT;
>>
>> would automatically add 'COMPRESSED ts1' to all tsvector columns in
>> new CREATE TABLE commands.
> 
> Initial version of the patch contains ALTER syntax that change 
> compression method for whole types, but I have decided to remove
> that functionality for now because the patch is already quite complex
> and it could be added later as separate patch.
> 
> Syntax was:
> ALTER TYPE  SET COMPRESSION ;
> 
> Specifying the supported type for the compression method is a good idea.
> Maybe the following syntax would be better?
> 
> CREATE COMPRESSION METHOD ts1 FOR tsvector HANDLER
> tsvector_compression_handler;
> 

Understood. Good to know you've considered it, and I agree it doesn't
need to be there from the start (which makes the patch simpler).

>>
>> BTW do you expect the tsvector compression to be generally useful, or
>> is it meant to be used only by the tests? If generally useful,
>> perhaps it should be created in pg_compression by default. If only
>> for tests, maybe it should be implemented in an extension in contrib
>> (thus also serving as example how to implement new methods).
>>
>> I haven't thought about the JSONB use case very much, but I suppose
>> that could be done using the configure/drop methods. I mean,
>> allocating the dictionary somewhere (e.g. in a table created by an
>> extension?). The configure method gets the Form_pg_attribute record,
>> so that should be enough I guess.
>>
>> But the patch is not testing those two methods at all, which seems
>> like something that needs to be addresses before commit. I don't
>> expect a full-fledged JSONB compression extension, but something
>> simple that actually exercises those methods in a meaningful way.
> 
> I will move to tsvector_compression_handler to separate extension in
> the next version. I added it more like as example, but also it could be
> used to achieve a better compression for tsvectors. Tests on maillists
> database ('archie' tables):
> 
> usual compression:
> 
> maillists=# select body_tsvector, subject_tsvector into t1 from
> messages; SELECT 1114213
> maillists=# select pg_size_pretty(pg_total_relation_size('t1'));
>

Re: [HACKERS] Custom compression methods

2017-11-24 Thread Tomas Vondra
Hi,

I ran into another issue - after inserting some data into a table with a
tsvector column (without any compression defined), I can no longer read
the data.

This is what I get in the console:

db=# select max(md5(body_tsvector::text)) from messages;
ERROR:  cache lookup failed for compression options 6432

and the stack trace looks like this:

Breakpoint 1, get_cached_compression_options (cmoptoid=6432) at
tuptoaster.c:2563
2563elog(ERROR, "cache lookup failed for compression 
options %u",
cmoptoid);
(gdb) bt
#0  get_cached_compression_options (cmoptoid=6432) at tuptoaster.c:2563
#1  0x004bf3da in toast_decompress_datum (attr=0x2b44148) at
tuptoaster.c:2390
#2  0x004c0c1e in heap_tuple_untoast_attr (attr=0x2b44148) at
tuptoaster.c:225
#3  0x0083f976 in pg_detoast_datum (datum=) at
fmgr.c:1829
#4  0x008072de in tsvectorout (fcinfo=0x2b41e00) at tsvector.c:315
#5  0x005fae00 in ExecInterpExpr (state=0x2b414b8,
econtext=0x2b25ab0, isnull=) at execExprInterp.c:1131
#6  0x0060bdf4 in ExecEvalExprSwitchContext
(isNull=0x7fe9bd37 "", econtext=0x2b25ab0, state=0x2b414b8) at
../../../src/include/executor/executor.h:299

It seems the VARATT_IS_CUSTOM_COMPRESSED incorrectly identifies the
value as custom-compressed for some reason.

Not sure why, but the tsvector column is populated by a trigger that
simply does

NEW.body_tsvector
:= to_tsvector('english', strip_replies(NEW.body_plain));

If needed, the complete tool is here:

https://bitbucket.org/tvondra/archie


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Custom compression methods

2017-11-23 Thread Tomas Vondra
Hi,

On 11/23/2017 10:38 AM, Ildus Kurbangaliev wrote:
> On Tue, 21 Nov 2017 18:47:49 +0100
> Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
> 
>>>   
>>
>> Hmmm, it still doesn't work for me. See this:
>>
>> test=# create extension pg_lz4 ;
>> CREATE EXTENSION
>> test=# create table t_lz4 (v text compressed lz4);
>> CREATE TABLE
>> test=# create table t_pglz (v text);
>> CREATE TABLE
>> test=# insert into t_lz4 select repeat(md5(1::text),300);
>> INSERT 0 1
>> test=# insert into t_pglz select * from t_lz4;
>> INSERT 0 1
>> test=# drop extension pg_lz4 cascade;
>> NOTICE:  drop cascades to 2 other objects
>> DETAIL:  drop cascades to compression options for lz4
>> drop cascades to table t_lz4 column v
>> DROP EXTENSION
>> test=# \c test
>> You are now connected to database "test" as user "user".
>> test=# insert into t_lz4 select repeat(md5(1::text),300);^C
>> test=# select * from t_pglz ;
>> ERROR:  cache lookup failed for compression options 16419
>>
>> That suggests no recompression happened.
> 
> Should be fixed in the attached patch. I've changed your extension a
> little bit according changes in the new patch (also in attachments).
> 

Hmm, this seems to have fixed it, but only in one direction. Consider this:

create table t_pglz (v text);
create table t_lz4 (v text compressed lz4);

insert into t_pglz select repeat(md5(i::text),300)
from generate_series(1,10) s(i);

insert into t_lz4 select repeat(md5(i::text),300)
from generate_series(1,10) s(i);

\d+

 Schema |  Name  | Type  | Owner | Size  | Description
++---+---+---+-
 public | t_lz4  | table | user  | 12 MB |
 public | t_pglz | table | user  | 18 MB |
(2 rows)

truncate t_pglz;
insert into t_pglz select * from t_lz4;

\d+

 Schema |  Name  | Type  | Owner | Size  | Description
++---+---+---+-
 public | t_lz4  | table | user  | 12 MB |
 public | t_pglz | table | user  | 18 MB |
(2 rows)

which is fine. But in the other direction, this happens

truncate t_lz4;
insert into t_lz4 select * from t_pglz;

 \d+
   List of relations
 Schema |  Name  | Type  | Owner | Size  | Description
++---+---+---+-
 public | t_lz4  | table | user  | 18 MB |
 public | t_pglz | table | user  | 18 MB |
    (2 rows)

which means the data is still pglz-compressed. That's rather strange, I
guess, and it should compress the data using the compression method set
for the target table instead.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Custom compression methods

2017-12-01 Thread Tomas Vondra


On 12/01/2017 08:48 PM, Alvaro Herrera wrote:
> Ildus Kurbangaliev wrote:
> 
>> If the table is big, decompression could take an eternity. That's why i
>> decided to only to disable it and the data could be decompressed using
>> compression options.
>>
>> My idea was to keep compression options forever, since there will not
>> be much of them in one database. Still that requires that extension is
>> not removed.
>>
>> I will try to find a way how to recompress data first in case it moves
>> to another table.
> 
> I think what you should do is add a dependency between a column that
> compresses using a method, and that method.  So the method cannot be
> dropped and leave compressed data behind.  Since the method is part of
> the extension, the extension cannot be dropped either.  If you ALTER
> the column so that it uses another compression method, then the table is
> rewritten and the dependency is removed; once you do that for all the
> columns that use the compression method, the compression method can be
> dropped.
> 

+1 to do the rewrite, just like for other similar ALTER TABLE commands

>
> Maybe our dependency code needs to be extended in order to support this.
> I think the current logic would drop the column if you were to do "DROP
> COMPRESSION .. CASCADE", but I'm not sure we'd see that as a feature.
> I'd rather have DROP COMPRESSION always fail instead until no columns
> use it.  Let's hear other's opinions on this bit though.
> 

Why should this behave differently compared to data types? Seems quite
against POLA, if you ask me ...

If you want to remove the compression, you can do the SET NOT COMPRESSED
(or whatever syntax we end up using), and then DROP COMPRESSION METHOD.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Custom compression methods

2017-12-01 Thread Tomas Vondra
On 12/01/2017 08:20 PM, Robert Haas wrote:
> On Fri, Dec 1, 2017 at 10:18 AM, Tomas Vondra
> <tomas.von...@2ndquadrant.com> wrote:
>> It has very little impact on this patch, as it has nothing to do with
>> columnar storage. That is, each value is compressed independently.
> 
> I understand that this patch is not about columnar storage, but I 
> think the idea that we may want to operate on the compressed data 
> directly is not only applicable to that case.
> 

Yeah. To clarify, my point was that column stores benefit from
compressing many values at once, and then operating on this compressed
vector. That is not what this patch is doing (or can do), of course.

But I certainly do agree that if the compression can be integrated into
the data type, allowing processing on compressed representation, then
that will beat whatever this patch is doing, of course ...

>>
>> I agree with these thoughts in general, but I'm not quite sure
>> what is your conclusion regarding the patch.
> 
> I have not reached one. Sometimes I like to discuss problems before 
> deciding what I think. :-)
> 

That's lame! Let's make decisions without discussion ;-)

>
> It does seem to me that the patch may be aiming at a relatively narrow
> target in a fairly large problem space, but I don't know whether to
> label that as short-sightedness or prudent incrementalism.
> 

I don't know either. I don't think people will start switching their
text columns to lz4 just because they can, or because they get 4% space
reduction compared to pglz.

But the ability to build per-column dictionaries seems quite powerful, I
guess. And I don't think that can be easily built directly into JSONB,
because we don't have a way to provide information about the column
(i.e. how would you fetch the correct dictionary?).


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Custom compression methods

2017-12-14 Thread Tomas Vondra
On 12/14/2017 04:21 PM, Robert Haas wrote:
> On Wed, Dec 13, 2017 at 5:10 AM, Tomas Vondra
> <tomas.von...@2ndquadrant.com> wrote:
>>> 2. If several data types can benefit from a similar approach, it has
>>> to be separately implemented for each one.
>>
>> I don't think the current solution improves that, though. If you
>> want to exploit internal features of individual data types, it
>> pretty much requires code customized to every such data type.
>>
>> For example you can't take the tsvector compression and just slap
>> it on tsquery, because it relies on knowledge of internal tsvector
>> structure. So you need separate implementations anyway.
> 
> I don't think that's necessarily true. Certainly, it's true that
> *if* tsvector compression depends on knowledge of internal tsvector 
> structure, *then* that you can't use the implementation for anything 
> else (this, by the way, means that there needs to be some way for a 
> compression method to reject being applied to a column of a data
> type it doesn't like).

I believe such dependency (on implementation details) is pretty much the
main benefit of datatype-aware compression methods. If you don't rely on
such assumption, then I'd say it's a general-purpose compression method.

> However, it seems possible to imagine compression algorithms that can
> work for a variety of data types, too. There might be a compression
> algorithm that is theoretically a general-purpose algorithm but has
> features which are particularly well-suited to, say, JSON or XML
> data, because it looks for word boundaries to decide on what strings
> to insert into the compression dictionary.
> 

Can you give an example of such algorithm? Because I haven't seen such
example, and I find arguments based on hypothetical compression methods
somewhat suspicious.

FWIW I'm not against considering such compression methods, but OTOH it
may not be such a great primary use case to drive the overall design.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Custom compression methods

2017-12-18 Thread Tomas Vondra


On 12/17/2017 04:32 AM, Robert Haas wrote:
> On Thu, Dec 14, 2017 at 12:23 PM, Tomas Vondra
> <tomas.von...@2ndquadrant.com> wrote:
>> Can you give an example of such algorithm? Because I haven't seen such
>> example, and I find arguments based on hypothetical compression methods
>> somewhat suspicious.
>>
>> FWIW I'm not against considering such compression methods, but OTOH it
>> may not be such a great primary use case to drive the overall design.
> 
> Well it isn't, really.  I am honestly not sure what we're arguing
> about at this point.  I think you've agreed that (1) opening avenues
> for extensibility is useful, (2) substitution a general-purpose
> compression algorithm could be useful, and (3) having datatype
> compression that is enabled through TOAST rather than built into the
> datatype might sometimes be desirable.  That's more than adequate
> justification for this proposal, whether half-general compression
> methods exist or not.  I am prepared to concede that there may be no
> useful examples of such a thing.
> 

I don't think we're arguing - we're discussing if a proposed patch is
the right design solving relevant use cases.

I personally am not quite convinced about that, for the reason I tried
to explain in my previous messages. I see it as a poor alternative to
compression built into the data type. I do like the idea of compression
with external dictionary, however.

But don't forget that it's not me in this thread - it's my evil twin,
moonlighting as Mr. Devil's lawyer ;-)

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Custom compression methods

2017-12-13 Thread Tomas Vondra


On 12/13/2017 05:55 PM, Alvaro Herrera wrote:
> Tomas Vondra wrote:
> 
>> On 12/13/2017 01:54 AM, Robert Haas wrote:
> 
>>> 3. Compression is only applied to large-ish values.  If you are just
>>> making the data type representation more compact, you probably want to
>>> apply the new representation to all values.  If you are compressing in
>>> the sense that the original data gets smaller but harder to interpret,
>>> then you probably only want to apply the technique where the value is
>>> already pretty wide, and maybe respect the user's configured storage
>>> attributes.  TOAST knows about some of that kind of stuff.
>>
>> Good point. One such parameter that I really miss is compression level.
>> I can imagine tuning it through CREATE COMPRESSION METHOD, but it does
>> not seem quite possible with compression happening in a datatype.
> 
> Hmm, actually isn't that the sort of thing that you would tweak using a
> column-level option instead of a compression method?
>   ALTER TABLE ALTER COLUMN SET (compression_level=123)
> The only thing we need for this is to make tuptoaster.c aware of the
> need to check for a parameter.
> 

Wouldn't that require some universal compression level, shared by all
supported compression algorithms? I don't think there is such thing.

Defining it should not be extremely difficult, although I'm sure there
will be some cumbersome cases. For example what if an algorithm "a"
supports compression levels 0-10, and algorithm "b" only supports 0-3?

You may define 11 "universal" compression levels, and map the four
levels for "b" to that (how). But then everyone has to understand how
that "universal" mapping is defined.

Another issue is that there are algorithms without a compression level
(e.g. pglz does not have one, AFAICS), or with somewhat definition (lz4
does not have levels, and instead has "acceleration" which may be an
arbitrary positive integer, so not really compatible with "universal"
compression level).

So to me the

ALTER TABLE ALTER COLUMN SET (compression_level=123)

seems more like an unnecessary hurdle ...

>>> I don't think TOAST needs to be entirely transparent for the
>>> datatypes.  We've already dipped our toe in the water by allowing some
>>> operations on "short" varlenas, and there's really nothing to prevent
>>> a given datatype from going further.  The OID problem you mentioned
>>> would presumably be solved by hard-coding the OIDs for any built-in,
>>> privileged compression methods.
>>
>> Stupid question, but what do you mean by "short" varlenas?
> 
> Those are varlenas with 1-byte header rather than the standard 4-byte
> header.
> 

OK, that's what I thought. But that is still pretty transparent to the
data types, no?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Tracking of page changes for backup purposes. PTRACK [POC]

2017-12-19 Thread Tomas Vondra
Hi,

a couple of months ago there was proposal / patch with the similar
goals, from Andrey Borodin. See these two threads

[1]
https://www.postgresql.org/message-id/flat/843D96CC-7C55-4296-ADE0-622A7ACD4978%40yesql.se#843d96cc-7c55-4296-ade0-622a7acd4...@yesql.se

[2]
https://www.postgresql.org/message-id/flat/449A7A9D-DB58-40F8-B80E-4C4EE7DB47FD%40yandex-team.ru#449a7a9d-db58-40f8-b80e-4c4ee7db4...@yandex-team.ru

I recall there was a long discussion regarding which of the approaches
is the *right* one (although that certainly depends on other factors).

On 12/18/2017 11:18 AM, Anastasia Lubennikova wrote:
> In this thread I would like to raise the issue of incremental backups.
> What I suggest in this thread, is to choose one direction, so we can
> concentrate our community efforts.
> There is already a number of tools, which provide incremental backup.
> And we can see five principle techniques they use:
> 
> 1. Use file modification time as a marker that the file has changed.
> 2. Compute file checksums and compare them.
> 3. LSN-based mechanisms. Backup pages with LSN >= last backup LSN.
> 4. Scan all WAL files in the archive since the previous backup and
> collect information about changed pages.
> 5. Track page changes on the fly. (ptrack)
> 
> They can also be combined to achieve better performance.
> 
> My personal candidate is the last one, since it provides page-level
> granularity, while most of the others approaches can only do file-level
> incremental backups or require additional reads or calculations.
> 

I share the opinion that options 1 and 2 are not particularly
attractive, due to either unreliability, or not really saving that much
CPU and I/O.

I'm not quite sure about 3, because it doesn't really explain how would
it be done - it seems to assume we'd have to reread the files. I'll get
back to this.

Option 4 has some very interesting features. Firstly, relies on WAL and
so should not require any new code (and it could, in theory, support
even older PostgreSQL releases, for example). Secondly, this can be
offloaded to a different machine. And it does even support additional
workflows - e.g. "given these two full backups and the WAL, generate an
incremental backup between them".

So I'm somewhat hesitant to proclaim option 5 as the clear winner, here.


> In a nutshell, using ptrack patch, PostgreSQL can track page changes on
> the fly. Each time a relation page is updated, this page is marked in a
> special PTRACK bitmap fork for this relation. As one page requires just
> one bit in the PTRACK fork, such bitmaps are quite small. Tracking
> implies some minor overhead on the database server operation but speeds
> up incremental backups significantly.
> 

That makes sense, I guess, although I find the "ptrack" acronym somewhat
cryptic, and we should probably look for something more descriptive. But
the naming can wait, I guess.

My main question is if bitmap is the right data type. It seems to cause
a lot of complexity later, because it needs to be reset once in a while,
you have to be careful about failed incremental backups etc.

What if we tracked the LSN for each page instead? Sure, it'd require so,
64x more space (1 bit -> 8 bytes per page), but it would not require
resets, you could take incremental backups from arbitrary point in time,
and so on. That seems like a significant improvement to me, so perhaps
the space requirements are justified (still just 1MB for 1GB segment,
with the default 8kB pages).

> Detailed overview of the implementation with all pros and cons,
> patches and links to the related threads you can find here:
> 
> https://wiki.postgresql.org/index.php?title=PTRACK_incremental_backups.
> 
> Patches for v 10.1 and v 9.6 are attached.
> Since ptrack is basically just an API for use in backup tools, it is
> impossible to test the patch independently.
> Now it is integrated with our backup utility, called pg_probackup. You can
> find it herehttps://github.com/postgrespro/pg_probackup
> Let me know if you find the documentation too long and complicated, I'll
> write a brief How-to for ptrack backups.
> 
> Spoiler: Please consider this patch and README as a proof of concept. It
> can be improved in some ways, but in its current state PTRACK is a
> stable prototype, reviewed and tested well enough to find many
> non-trivial corner cases and subtle problems. And any discussion of
> change track algorithm must be aware of them. Feel free to share your
> concerns and point out any shortcomings of the idea or the implementation.
> 

Thanks for the proposal and patch!

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

2017-12-19 Thread Tomas Vondra


On 12/19/2017 08:38 PM, Mark Dilger wrote:
> 
>> On Nov 18, 2017, at 12:45 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> 
>> wrote:
>>
>> Hi,
>>
>> Apparently there was some minor breakage due to duplicate OIDs, so here
>> is the patch series updated to current master.
>>
>> regards
>>
>> -- 
>> Tomas Vondra  http://www.2ndQuadrant.com
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>> <0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz><0002-BRIN-bloom-indexes.patch.gz><0003-BRIN-multi-range-minmax-indexes.patch.gz><0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz>
> 
> 
> After applying these four patches to my copy of master, the regression
> tests fail for F_SATISFIES_HASH_PARTITION 5028 as attached.
> 

D'oh! There was an incorrect OID referenced in pg_opclass, which was
also used by the satisfies_hash_partition() function. Fixed patches
attached.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz
Description: application/gzip


0002-BRIN-bloom-indexes.patch.gz
Description: application/gzip


0003-BRIN-multi-range-minmax-indexes.patch.gz
Description: application/gzip


0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz
Description: application/gzip


Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-12-19 Thread Tomas Vondra
Hi,

On 12/19/2017 08:17 PM, Mark Dilger wrote:
> 
> I tested your latest patches on my mac os x laptop and got one test
> failure due to the results of 'explain' coming up differently.  For the 
> record,
> I followed these steps:
> 
> cd postgresql/
> git pull
> # this got my directory up to 8526bcb2df76d5171b4f4d6dc7a97560a73a5eff with 
> no local changes
> patch -p 1 < ../0001-multivariate-MCV-lists.patch
> patch -p 1 < ../0002-multivariate-histograms.patch
> ./configure --prefix=/Users/mark/master/testinstall --enable-cassert 
> --enable-tap-tests --enable-depend && make -j4 && make check-world
> 

Yeah, those steps sounds about right.

Apparently this got broken by ecc27d55f4, although I don't quite
understand why - but it works fine before. Can you try if it works fine
on 9f4992e2a9 and fails with ecc27d55f4?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: CUBE seems a bit confused about ORDER BY

2017-12-12 Thread Tomas Vondra


On 12/12/2017 01:52 PM, Alexander Korotkov wrote:
> On Tue, Dec 12, 2017 at 3:49 PM, Teodor Sigaev <teo...@sigaev.ru
> <mailto:teo...@sigaev.ru>> wrote:
> 
> Yes, the thing is that we change behavior of existing ~>
> operator.  In general, this is not good idea because it could
> affect existing users whose already use this operator. 
> Typically in such situation, we could leave existing operator as
> is, and invent new operator with new behavior.  However, in this
> particular case, I think there are reasons to make an exception
> to the rules.  The reasons are following:
> 1) The ~> operator was designed especially for knn gist.
> 2) Knn gist support for current behavior is broken by design and
> can't be fixed.  Most we can do to fix existing ~> operator
> behavior as is to drop knn gist support.  But then ~> operator
> would be left useless.
> 3) It doesn't seems that ~> operator have many users yet,
> because an error wasn't reported during whole release cycle.
> 
> I hope these reasons justify altering behavior of existing
> operator as an exception to the rules.  Another question is
> whether we should backpatch it.  But I think we could leave this
> decision to committer.
> 
>     I think that this patch is ready for committer.
> 
> I'm agree with changing behavior of existing ~> operator, but is any
> objection here? Current implementation is not fixable and I hope
> that users which use this operator will understand why we change it.
> Fortunately, the fix doesn't require changes in system catalog.
> 
> The single question here is about index over expression with this
> operator, they will need to reindex, which should be noted in
> release notes.
> 
> 
> Yes.  I bet only few users have built indexes over ~> operator if any. 
> Ask them to reindex in the release notes seems OK for me.
> 

Is there a good way to detect such cases? Either in pg_upgrade, so that
we can print warnings, or at least manually (which would be suitable for
release notes).

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Custom compression methods

2017-12-12 Thread Tomas Vondra


On 12/12/2017 10:33 PM, Robert Haas wrote:
> On Mon, Dec 11, 2017 at 2:53 PM, Tomas Vondra
> <tomas.von...@2ndquadrant.com> wrote:
>> But let me play the devil's advocate for a while and question the
>> usefulness of this approach to compression. Some of the questions were
>> mentioned in the thread before, but I don't think they got the attention
>> they deserve.
> 
> Sure, thanks for chiming in.  I think it is good to make sure we are
> discussing this stuff.
> 
>> But perhaps we should simply make it an initdb option (in which case the
>> whole cluster would simply use e.g. lz4 instead of pglz)?
>>
>> That seems like a much simpler approach - it would only require some
>> ./configure options to add --with-lz4 (and other compression libraries),
>> an initdb option to pick compression algorithm, and probably noting the
>> choice in cluster controldata.
>>
>> No dependencies tracking, no ALTER TABLE issues, etc.
>>
>> Of course, it would not allow using different compression algorithms for
>> different columns (although it might perhaps allow different compression
>> level, to some extent).
>>
>> Conclusion: If we want to offer a simple cluster-wide pglz alternative,
>> perhaps this patch is not the right way to do that.
> 
> I actually disagree with your conclusion here.   I mean, if you do it
> that way, then it has the same problem as checksums: changing
> compression algorithms requires a full dump-and-reload of the
> database, which makes it more or less a non-starter for large
> databases.  On the other hand, with the infrastructure provided by
> this patch, we can have a default_compression_method GUC that will be
> set to 'pglz' initially.  If the user changes it to 'lz4', or we ship
> a new release where the new default is 'lz4', then new tables created
> will use that new setting, but the existing stuff keeps working.  If
> you want to upgrade your existing tables to use lz4 rather than pglz,
> you can change the compression option for those columns to COMPRESS
> lz4 PRESERVE pglz if you want to do it incrementally or just COMPRESS
> lz4 to force a rewrite of an individual table.  That's really
> powerful, and I think users will like it a lot.
> 
> In short, your approach, while perhaps a little simpler to code, seems
> like it is fraught with operational problems which this design avoids.
> 

I agree the checksum-like limitations are annoying and make it
impossible to change the compression algorithm after the cluster is
initialized (although I recall a discussion about addressing that).

So yeah, if such flexibility is considered valuable/important, then the
patch is a better solution.

>> Custom datatype-aware compression (e.g. the tsvector)
>> --
>>
>> Exploiting knowledge of the internal data type structure is a promising
>> way to improve compression ratio and/or performance.
>>
>> The obvious question of course is why shouldn't this be done by the data
>> type code directly, which would also allow additional benefits like
>> operating directly on the compressed values.
>>
>> Another thing is that if the datatype representation changes in some
>> way, the compression method has to change too. So it's tightly coupled
>> to the datatype anyway.
>>
>> This does not really require any new infrastructure, all the pieces are
>> already there.
>>
>> In some cases that may not be quite possible - the datatype may not be
>> flexible enough to support alternative (compressed) representation, e.g.
>> because there are no bits available for "compressed" flag, etc.
>>
>> Conclusion: IMHO if we want to exploit the knowledge of the data type
>> internal structure, perhaps doing that in the datatype code directly
>> would be a better choice.
> 
> I definitely think there's a place for compression built right into
> the data type.  I'm still happy about commit
> 145343534c153d1e6c3cff1fa1855787684d9a38 -- although really, more
> needs to be done there.  But that type of improvement and what is
> proposed here are basically orthogonal.  Having either one is good;
> having both is better.
> 

Why orthogonal?

For example, why couldn't (or shouldn't) the tsvector compression be
done by tsvector code itself? Why should we be doing that at the varlena
level (so that the tsvector code does not even know about it)?

For example we could make the datatype EXTERNAL and do the compression
on our own, using a custom algorithm. Of course, that would require
datatype-specific implementation, but tsvector_compress does that too.

It seems to me the main reason is that tsvector

Re: spgist rangetypes compiler warning (gcc 7.2.0)

2017-11-18 Thread Tomas Vondra


On 11/18/2017 10:40 PM, Tom Lane wrote:
> Tomas Vondra <tomas.von...@2ndquadrant.com> writes:
>> while compiling on gcc 7.2.0 (on ARM), I got this warning:
> 
>> rangetypes_spgist.c: In function 'spg_range_quad_inner_consistent':
>> rangetypes_spgist.c:559:29: warning: comparison between pointer and
>> zero character constant [-Wpointer-compare]
>>   if (in->traversalValue != (Datum) 0)
>>  ^~
> 
> Huh.  I wonder why 7.2.1 on Fedora isn't producing that warning.
> 

Not sure. Maybe Ubuntu uses different flags (on ARM)? This is what I get
from 'gcc -v' on the machine:

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabihf/7/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro
7.2.0-8ubuntu3' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr
--with-gcc-major-version-only --program-suffix=-7
--program-prefix=arm-linux-gnueabihf- --enable-shared
--enable-linker-build-id --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix --libdir=/usr/lib
--enable-nls --with-sysroot=/ --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-libitm --disable-libquadmath --enable-plugin
--enable-default-pie --with-system-zlib --with-target-system-zlib
--enable-objc-gc=auto --enable-multiarch --enable-multilib
--disable-sjlj-exceptions --with-arch=armv7-a --with-fpu=vfpv3-d16
--with-float=hard --with-mode=thumb --disable-werror --enable-multilib
--enable-checking=release --build=arm-linux-gnueabihf
--host=arm-linux-gnueabihf --target=arm-linux-gnueabihf
Thread model: posix
gcc version 7.2.0 (Ubuntu/Linaro 7.2.0-8ubuntu3)

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means

2017-11-18 Thread Tomas Vondra
Hi,

On 11/02/2017 08:15 PM, Tom Lane wrote:
> Haribabu Kommi <kommi.harib...@gmail.com> writes:
>> The changes are fine and now it reports proper live tuples in both
>> pg_class and stats. The other issue of continuous vacuum operation
>> leading to decrease of number of live tuples is not related to this
>> patch and that can be handled separately.
> 
> I did not like this patch too much, because it did nothing to fix
> the underlying problem of confusion between "live tuples" and "total
> tuples"; in fact, it made that worse, because now the comment on
> LVRelStats.new_rel_tuples is a lie.  And there's at least one usage
> of that field value where I think we do want total tuples not live
> tuples: where we pass it to index AM cleanup functions.  Indexes
> don't really care whether heap entries are live or not, only that
> they're there.  Plus the VACUUM VERBOSE log message that uses the
> field is supposed to be reporting total tuples not live tuples.
> 
> I hacked up the patch trying to make that better, finding along the
> way that contrib/pgstattuple shared in the confusion about what
> it was trying to count.  Results attached.
> 

Thanks for looking into this. I agree your patch is more consistent and
generally cleaner.

> However, I'm not sure we're there yet, because there remains a fairly
> nasty discrepancy even once we've gotten everyone onto the same page
> about reltuples counting just live tuples: VACUUM and ANALYZE have
> different definitions of what's "live".  In particular they do not treat
> INSERT_IN_PROGRESS and DELETE_IN_PROGRESS tuples the same.  Should we
> try to do something about that?  If so, what?  It looks like ANALYZE's
> behavior is pretty tightly constrained, judging by the comments in
> acquire_sample_rows.
> 

Yeah :-(

For the record (and people following this thread who are too lazy to
open the analyze.c and search for the comments), the problem is that
acquire_sample_rows does not count HEAPTUPLE_INSERT_IN_PROGRESS as live
(and HEAPTUPLE_DELETE_IN_PROGRESS as dead) as it assumes the transaction
will send the stats at the end.

Which is a correct assumption, but it means that when there is a
long-running transaction that inserted/deleted many rows, the reltuples
value will oscillate during VACUUM / ANALYZE runs.

ISTM we need to unify those definitions, probably so that VACUUM adopts
what acquire_sample_rows does. I mean, if ANALYZE assumes that the stats
will be updated at the end of transaction, why shouldn't VACUUM do the
same thing?

The one reason I can think of is that VACUUM is generally expected to
run longer than ANALYZE, so the assumption that large transactions
complete later is not quite reliable.

Or is there some other reason why VACUUM can't treat the in-progress
tuples the same way?

> Another problem is that it looks like CREATE INDEX will set reltuples
> to the total number of heap entries it chose to index, because that
> is what IndexBuildHeapScan counts.  Maybe we should adjust that?
> 

You mean by only counting live tuples in IndexBuildHeapRangeScan,
following whatever definition we end up using in VACUUM/ANALYZE? Seems
like a good idea I guess, although CREATE INDEX not as frequent as the
other commands.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: percentile value check can be slow

2017-11-18 Thread Tomas Vondra
Hi,

On 11/18/2017 10:30 PM, David Fetter wrote:
> On Sat, Nov 18, 2017 at 10:44:36AM -0500, Tom Lane wrote:
>> jotpe <jo...@posteo.de> writes:
>>> I tried to enter invalid percentile fractions, and was astonished
>>> that it seems to be checked after many work is done?
>>
>> IIRC, only the aggregate's final-function is concerned with direct
>> arguments, so it's not all that astonishing.
> 
> It may not be surprising from the point of view of a systems
> programmer, but it's pretty surprising that this check is deferred to
> many seconds in when the system has all the information it need in
> order to establish this before execution begins.
> 
> I'm not sure I see an easy way to do this check early, but it's worth
> trying on grounds of POLA violation.  I have a couple of ideas on how
> to do this, one less invasive but hinky, the other a lot more invasive
> but better overall.
> 
> Ugly Hack With Ugly Performance Consequences:
> Inject a subtransaction at the start of execution that supplies an
> empty input to the final function with the supplied direct
> arguments.
> 

I'm pretty sure you realize this is quite unlikely to get accepted.

> Bigger Lift:
> Require a separate recognizer function direct arguments and fire
> it during post-parse analysis.  Perhaps this could be called
> recognizer along with the corresponding mrecognizer.  It's not
> clear that it's sane to have separate ones, but I thought I'd
> bring it up for completeness.
> 

Is 'recognizer' an established definition I should know? Is it the same
as 'validator' or is it something new/different?

> Way Bigger Lift, As Far As I Can Tell, But More Fun For Users:
> Allow optional CHECK constraints in CREATE AGGREGATE for direct
> arguments.
> 

How will any of the approaches deal with something like

select percentile_cont((select array_agg(v) from p))
   within group (order by a) from t;

In this case the the values are unknown after the parse analysis, so I
guess it does not really address that.


FWIW while I won't stand in the way of improving this, I wonder if this
is really worth the additional complexity. If you get errors like this
with a static list of values, you will fix the list and you're done. If
the list is dynamic (computed in the query itself), you'll still get the
error much later during query execution.

So if you're getting many failures like this for the "delayed error
reporting" to be an issue, perhaps there's something wrong in you stack
and you should address that instead?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] new function for tsquery creartion

2017-11-18 Thread Tomas Vondra
Hi,

On 09/13/2017 10:57 AM, Victor Drobny wrote:
> On 2017-09-09 06:03, Thomas Munro wrote:
>> Please send a rebased version of the patch for people to review and
>> test as that one has bit-rotted.
> Hello,
> Thank you for interest. In the attachment you can find rebased
> version(based on 69835bc8988812c960f4ed5aeee86b62ac73602a commit)
> 

I did a quick review of the patch today. The patch unfortunately no
longer applies, so I had to use an older commit from September. Please
rebase to current master.

I've only looked on the diff at this point, will do more testing once
the rebase happens.

Some comments:

1) This seems to mix multiple improvements into one patch. There's the
support for alternative query syntax, and then there are the new
operators (AROUND and <m,n>). I propose to split the patch into two or
more parts, each addressing one of those bits.

I guess there will be two or three parts - first adding the syntax,
second adding <m,n> and third adding the AROUND(n). Seems reasonable?

2) I don't think we should mention Google in the docs explicitly. Not
that I'm somehow anti-google, but this syntax was certainly not invented
by Google - I vividly remember using something like that on Altavista
(yeah, I'm old). And it's used by pretty much every other web search
engine out there ...

3) In the SGML docs, please use  instead of just
quoting the values. So it should be | instead of '|'
etc. Just like in the parts describing plainto_tsquery, for example.

4) Also, I recommend adding a brief explanation what the examples do.
Right now there's just a bunch of queryto_tsquery, and the reader is
expected to understand the output. I suggest adding a sentence or two,
explaining what's happening (just like for plainto_tsquery examples).

5) I'm not sure about negative values in the <n,m> operator. I don't
find it particularly confusing - once you understand that (a <n,m> b)
means "there are 'k' words between 'a' and 'b' (n <= k <= m)", then
negative values seem like a fairly straightforward extension.

But I guess the main question is "Is there really a demand for the new
<n,m> operator, or have we just invented if because we can?"

6) There seem to be some new constants defined, with names not following
the naming convention. I mean this

#define WAITOPERAND 1
#define WAITOPERATOR2
#define WAITFIRSTOPERAND3
#define WAITSINGLEOPERAND   4
#define INSIDEQUOTES5   <-- the new one

and

#define TSPO_L_ONLY0x01
#define TSPO_R_ONLY0x02
#define TSPO_BOTH  0x04
#define TS_NOT_EXAC0x08 <-- the new one

Perhaps that's OK, but it seems a bit inconsistent.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] pg_basebackup --progress output for batch execution

2017-11-18 Thread Tomas Vondra


On 11/10/2017 02:32 PM, Martín Marqués wrote:
> Hi,
> 
> Thanks for having a look at this patch.
> 
> 2017-11-09 20:55 GMT-03:00 Jeff Janes <jeff.ja...@gmail.com>:
>> On Fri, Sep 29, 2017 at 4:00 PM, Martin Marques
>> <martin.marq...@2ndquadrant.com> wrote:
>>>
>>> Hi,
>>>
>>> Some time ago I had to work on a system where I was cloning a standby
>>> using pg_basebackup, that didn't have screen or tmux. For that reason I
>>> redirected the output to a file and ran it with nohup.
>>>
>>> I normally (always actually ;) ) run pg_basebackup with --progress and
>>> --verbose so I can follow how much has been done. When done on a tty you
>>> get a nice progress bar with the percentage that has been cloned.
>>>
>>> The problem came with the execution and redirection of the output, as
>>> the --progress option will write a *very* long line!
>>>
>>> Back then I thought of writing a patch (actually someone suggested I do
>>> so) to add a --batch-mode option which would change the behavior
>>> pg_basebackup has when printing the output messages.
>>
>>
>>
>> While separate lines in the output file is better than one very long line,
>> it still doesn't seem so useful.  If you aren't watching it in real time,
>> then you really need to have a timestamp on each line so that you can
>> interpret the output.  The lines are about one second apart, but I don't
>> know robust that timing is under all conditions.
> 
> I kind of disagree with your view here.
> 
> If the cloning process takes many hours to complete (in my case, it
> was around 12 hours IIRC) you might want to peak at the log every now
> and then with tail.
> 
> I do agree on adding a timestamp prefix to each line, as it's not
> clear from the code how often progress_report() is called.
> 
>> I think I agree with Arthur that I'd rather have the decision made by
>> inspecting whether output is going to a tty, rather than by adding another
>> command line option.  But maybe that is not detected robustly enough across
>> all platforms and circumstances?
> 
> In this case, Arthur's idea is good, but would make the patch less 
> generic/flexible for the end user.
> 

I'm not quite sure about that. We're not getting the flexibility for
free, but for additional complexity (additional command-line option).
And what valuable capability would we (or the users) loose, really, by
relying on the isatty call?

So what use case is possible with --batch-mode but not with isatty (e.g.
when combined with tee)?

>
> That's why I tried to reproduce what top does when executed with -b 
> (Batch mode operation). There, it's the end user who decides how the 
> output is formatted (well, saying it decides on formatting a bit of
> an overstatement, but you get it ;) )
> 

In 'top' a batch mode also disables input, and runs for a certain number
of interactions (as determined by '-n' option). That's not the case in
pg_basebackup, where it only adds the extra newline.

>
> An example where using isatty() might fail is if you run 
> pg_basebackup from a tty but redirect the output to a file, I
> believe that in that case isatty() will return true, but it's very
> likely that the user might want batch mode output.
> 

IMHO that should work correctly, as already explained by Arthur.

>
> But maybe we should also add Arthurs idea anyway (when not in batch 
> mode), as it seems pretty lame to output progress in one line if you 
> are not in a tty.
> 

I'd say to just use isatty, unless we can come up with a compelling use
case that is not satisfied by it.

And if we end up using --batch-mode, perhaps we should only allow it
when --progress is used. It's the only thing it affects, and it makes no
difference when used without it.

Furthermore, I propose to reword this:

  
   Runs pg_basebackup in batch mode. This is useful
   if the output is to be pipped so the other end of the pipe reads each
   line.
  
  
   Using this option with --progress will result in
   printing each progress output with a newline at the end, instead of a
   carrige return.
  

like this:

  
   Runs pg_basebackup in batch mode, in which
   --progress terminates the lines with a regular
   newline instead of carriage return.
  
  
   This is useful if the output is redirected to a file or a pipe,
   instead of a plain console.
  

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

2017-11-18 Thread Tomas Vondra
Hi,

Apparently there was some minor breakage due to duplicate OIDs, so here
is the patch series updated to current master.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz
Description: application/gzip


0002-BRIN-bloom-indexes.patch.gz
Description: application/gzip


0003-BRIN-multi-range-minmax-indexes.patch.gz
Description: application/gzip


0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz
Description: application/gzip


Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-11-18 Thread Tomas Vondra
Hi,

Attached is an updated version of the patch, adopting the psql describe
changes introduced by 471d55859c11b.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-multivariate-MCV-lists.patch.gz
Description: application/gzip


0002-multivariate-histograms.patch.gz
Description: application/gzip


Re: [HACKERS] Custom compression methods

2017-11-19 Thread Tomas Vondra
Hi,

On 11/14/2017 02:23 PM, Ildus Kurbangaliev wrote:
>
> ...
>
> Attached version 4 of the patch. Fixed pg_upgrade and few other bugs.
> 

I did a review of this today, and I think there are some things that
need improvement / fixing.

Firstly, some basic comments from just eye-balling the diff, then some
bugs I discovered after writing an extension adding lz4.

1) formatRelOptions/freeRelOptions are no longer needed (I see Ildar
already pointer that out)

2) There's unnecessary whitespace (extra newlines) on a couple of
places, which is needlessly increasing the size of the patch. Small
difference, but annoying.

3) tuptoaster.c

Why do you change 'info' from int32 to uint32? Seems unnecessary.

Adding new 'att' variable in toast_insert_or_update is confusing, as
there already is 'att' in the very next loop. Technically it's correct,
but I'd bet it'll lead to some WTF?! moments later. I propose to just
use TupleDescAttr(tupleDesc,i) on the two places where it matters,
around line 808.

There are no comments for init_compression_options_htab and
get_compression_options_info, so that needs to be fixed. Moreover, the
names are confusing because what we really get is not just 'options' but
the compression routines too.

4) gen_db_file_maps probably shouldn't do the fprints, right?

5) not sure why you modify src/tools/pgindent/exclude_file_patterns

6) I'm rather confused by AttributeCompression vs. ColumnCompression. I
mean, attribute==column, right? Of course, one is for data from parser,
the other one is for internal info. But can we make the naming clearer?

7) The docs in general are somewhat unsatisfactory, TBH. For example the
ColumnCompression has no comments, unlike everything else in parsenodes.
Similarly for the SGML docs - I suggest to expand them to resemble FDW
docs (https://www.postgresql.org/docs/10/static/fdwhandler.html) which
also follows the handler/routines pattern.

8) One of the unclear things if why we even need 'drop' routing. It
seems that if it's defined DropAttributeCompression does something. But
what should it do? I suppose dropping the options should be done using
dependencies (just like we drop columns in this case).

BTW why does DropAttributeCompression mess with att->attisdropped in
this way? That seems a bit odd.

9) configure routines that only check if (options != NIL) and then error
out (like tsvector_configure) seem a bit unnecessary. Just allow it to
be NULL in CompressionMethodRoutine, and throw an error if options is
not NIL for such compression method.

10) toast_compress_datum still does this:

if (!ac && (valsize < PGLZ_strategy_default->min_input_size ||
valsize > PGLZ_strategy_default->max_input_size))

which seems rather pglz-specific (the naming is a hint). Why shouldn't
this be specific to compression, exposed either as min/max constants, or
wrapped in another routine - size_is_valid() or something like that?

11) The comments in toast_compress_datum probably need updating, as it
still references to pglz specifically. I guess the new compression
methods do matter too.

12) get_compression_options_info organizes the compression info into a
hash table by OID. The hash table implementation assumes the hash key is
at the beginning of the entry, but AttributeCompression is defined like
this:

typedef struct
{
CompressionMethodRoutine *routine;
List  *options;
Oid cmoptoid;
} AttributeCompression;

Which means get_compression_options_info is busted, will never lookup
anything, and the hash table will grow by adding more and more entries
into the same bucket. Of course, this has extremely negative impact on
performance (pretty much arbitrarily bad, depending on how many entries
you've already added to the hash table).

Moving the OID to the beginning of the struct fixes the issue.

13) When writing the experimental extension, I was extremely confused
about the regular varlena headers, custom compression headers, etc. In
the end I stole the code from tsvector.c and whacked it a bit until it
worked, but I wouldn't dare to claim I understand how it works.

This needs to be documented somewhere. For example postgres.h has a
bunch of paragraphs about varlena headers, so perhaps it should be
there? I see the patch tweaks some of the constants, but does not update
the comment at all.

Perhaps it would be useful to provide some additional macros making
access to custom-compressed varlena values easier. Or perhaps the
VARSIZE_ANY / VARSIZE_ANY_EXHDR / VARDATA_ANY already support that? This
part is not very clear to me.



regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pg_lz4.tgz
Description: application/compressed-tar


Re: percentile value check can be slow

2017-11-19 Thread Tomas Vondra
Hi,

On 11/19/2017 03:10 AM, David Fetter wrote:
> On Sat, Nov 18, 2017 at 11:05:47PM +0100, Tomas Vondra wrote:
>> Hi,
>>
>> ...
>>
>> Is 'recognizer' an established definition I should know? Is it the same
>> as 'validator' or is it something new/different?
> 
> I borrowed it from http://langsec.org/
> 
> I'm not entirely sure what you mean by a validator, but a recognizer
> is something that gives a quick and sure read as to whether the input
> is well-formed. In general, it's along the lines of a tokenizer, a
> parser, and something that does very light post-parse analysis for
> correctness of form.
> 
> For the case that started the thread, a recognizer would check
> something along the lines of 
> 
> CHECK('[0,1]' @> ALL(input_array))
> 

OK, thanks. From what I understand, recognizer is more about recognizing
if a string is valid within a given formal language (essentially, if
it's a well-formed program). That may not be the right term for checks
on parameter values.

OTOH we already have "validators" on a number of places - functions
checking various parameters, e.g. reloptions for FDWs, etc.

But I guess the naming can be solved later ...

>>> Way Bigger Lift, As Far As I Can Tell, But More Fun For Users:
>>> Allow optional CHECK constraints in CREATE AGGREGATE for direct
>>> arguments.
>>>
>>
>> How will any of the approaches deal with something like
>>
>> select percentile_cont((select array_agg(v) from p))
>>within group (order by a) from t;
>>
>> In this case the the values are unknown after the parse analysis, so I
>> guess it does not really address that.
> 
> It doesn't.  Does it make sense to do a one-shot execution for cases
> like that?  It might well be worth it to do the aggregate once in
> advance as a throw-away if the query execution time is already going
> to take awhile.  Of course, you can break that one by making p a JOIN
> to yet another thing...
> 
>> FWIW while I won't stand in the way of improving this, I wonder if this
>> is really worth the additional complexity. If you get errors like this
>> with a static list of values, you will fix the list and you're done. If
>> the list is dynamic (computed in the query itself), you'll still get the
>> error much later during query execution.
>>
>> So if you're getting many failures like this for the "delayed error
>> reporting" to be an issue, perhaps there's something wrong in you stack
>> and you should address that instead?
> 
> I'd like to think that getting something to fail quickly and cheaply
> when it can will give our end users a better experience.  Here,
> "cheaply" refers to their computing resources and time.

The trouble is, this increases execution time for everyone, including
people who carefully construct the parameter values. That seems rather
undesirable.

>
> Clearly, not having this happen in this case bothered Johannes
> enough to wade in here.
> 

No. He was surprised the error is reported after significant amount of
time, but he does not seem to claim failing faster would be valuable to
him. That is your assumption, and I have my doubts about it.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] CUBE seems a bit confused about ORDER BY

2017-11-19 Thread Tomas Vondra


On 10/30/2017 03:04 PM, Alexander Korotkov wrote:
> On Sun, Oct 29, 2017 at 1:30 AM, Alexander Korotkov
> <a.korot...@postgrespro.ru <mailto:a.korot...@postgrespro.ru>> wrote:
...
> 
> Thus, I heard no objection to my approach.  Attached patch changes ~>
> operator in the proposed way and fixes KNN-GiST search for that.  I'm
> going to register this patch to the nearest commitfest.
> 

Seems fine to me, although perhaps it should be split into two parts.
One with the cube_coord_llur fixes, and then g_cube_distance changes
adding support for negative coordinates.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] GnuTLS support

2017-11-19 Thread Tomas Vondra
Hi,

On 11/02/2017 11:33 PM, Andreas Karlsson wrote:
> On 09/18/2017 07:04 PM, Jeff Janes wrote:> You fixed the first issue,
> but I still get the second one:
>>
>> be-secure-gnutls.c: In function 'get_peer_certificate':
>> be-secure-gnutls.c:667: error: 'GNUTLS_X509_CRT_LIST_SORT' undeclared
>> (first use in this function)
>> be-secure-gnutls.c:667: error: (Each undeclared identifier is reported
>> only once
>> be-secure-gnutls.c:667: error: for each function it appears in.)
> 
> Thanks again for testing the code. I have now rebased the patch and
> fixed the second issue. I tested that it works on CentOS 6.
> 
> Work which remains:
> 
> - sslinfo
> - pgcrypto
> - Documentation
> - Decide if what I did with the config is a good idea
> 

I don't want to be the annoying guy, but this patch no longer applies
due to 642bafa0c5f9f08d106a14f31429e0e0c718b603 touching the tests :-(

BTW regarding the config, I believe you've kept is static (no hiding of
sections based on configure parameters), and you've separated the
openssl and gnutls options, right? Seems fine to me. The one thing is
that it was proposed to rename the openssl-specific options to start
with openssl_ instead of ssl_.

One question though. What happens when you do

  ./configure --with-openssl --with-gnutls

If I get it right we ignore gnutls and use openssl (as it's the first
checked in #ifdefs). Shouldn't we enforce in configure that only one TLS
implementation is enabled? Either by some elaborate check, or by
switching to something like

 --with-ssl=(openssl|gnutls)


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Custom compression methods

2017-11-19 Thread Tomas Vondra


On 11/15/2017 02:13 PM, Robert Haas wrote:
> On Wed, Nov 15, 2017 at 4:09 AM, Ildus Kurbangaliev
> <i.kurbangal...@postgrespro.ru> wrote:
>> So in the next version of the patch I can just unlink the options from
>> compression methods and dropping compression method will not affect
>> already compressed tuples. They still could be decompressed.
> 
> I guess I don't understand how that can work.  I mean, if somebody
> removes a compression method - i.e. uninstalls the library - and you
> don't have a way to make sure there are no tuples that can only be
> uncompressed by that library - then you've broken the database.
> Ideally, there should be a way to add a new compression method via an
> extension ... and then get rid of it and all dependencies thereupon.
> 

I share your confusion. Once you do DROP COMPRESSION METHOD, there must
be no remaining data compressed with it. But that's what the patch is
doing already - it enforces this using dependencies, as usual.

Ildus, can you explain what you meant? How could the data still be
decompressed after DROP COMPRESSION METHOD, and possibly after removing
the .so library?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Custom compression methods

2017-11-21 Thread Tomas Vondra

On 11/21/2017 09:28 PM, Ildus K wrote:
>> Hmmm, it still doesn't work for me. See this:
>>
>> test=# create extension pg_lz4 ;
>> CREATE EXTENSION
>> test=# create table t_lz4 (v text compressed lz4);
>> CREATE TABLE
>> test=# create table t_pglz (v text);
>> CREATE TABLE
>> test=# insert into t_lz4 select repeat(md5(1::text),300);
>> INSERT 0 1
>> test=# insert into t_pglz select * from t_lz4;
>> INSERT 0 1
>> test=# drop extension pg_lz4 cascade;
>> NOTICE:  drop cascades to 2 other objects
>> DETAIL:  drop cascades to compression options for lz4
>> drop cascades to table t_lz4 column v
>> DROP EXTENSION
>> test=# \c test
>> You are now connected to database "test" as user "user".
>> test=# insert into t_lz4 select repeat(md5(1::text),300);^C
>> test=# select * from t_pglz ;
>> ERROR:  cache lookup failed for compression options 16419
>>
>> That suggests no recompression happened.
> 
> I will check that. Is your extension published somewhere?
> 

No, it was just an experiment, so I've only attached it to the initial
review. Attached is an updated version, with a fix or two.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pg_lz4.tgz
Description: application/compressed-tar


Re: [HACKERS] Custom compression methods

2017-12-01 Thread Tomas Vondra


On 11/30/2017 09:51 PM, Alvaro Herrera wrote:
> Tomas Vondra wrote:
> 
>> On 11/30/2017 04:20 PM, Ildus Kurbangaliev wrote:
> 
>>> CREATE COMPRESSION METHOD ts1 FOR tsvector HANDLER
>>> tsvector_compression_handler;
>>
>> Understood. Good to know you've considered it, and I agree it doesn't
>> need to be there from the start (which makes the patch simpler).
> 
> Just passing by, but wouldn't this fit in the ACCESS METHOD group of
> commands?  So this could be simplified down to
> CREATE ACCESS METHOD ts1 TYPE COMPRESSION
> we have that for indexes and there are patches flying for heap storage,
> sequences, etc.  I think that's simpler than trying to invent all new
> commands here.  Then (in a future patch) you can use ALTER TYPE to
> define compression for that type, or even add a column-level option to
> reference a specific compression method.
> 

I think that would conflate two very different concepts. In my mind,
access methods define how rows are stored. Compression methods are an
orthogonal concept, e.g. you can compress a value (using a custom
compression algorithm) and store it in an index (using whatever access
method it's using). So not only access methods operate on rows (while
compression operates on varlena values), but you can combine those two
things together. I don't see how you could do that if both are defined
as "access methods" ...

Furthermore, the "TYPE" in CREATE COMPRESSION method was meant to
restrict the compression algorithm to a particular data type (so, if it
relies on tsvector, you can't apply it to text columns). Which is very
different from "TYPE COMPRESSION" in CREATE ACCESS METHOD.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Custom compression methods

2017-12-02 Thread Tomas Vondra
On 12/01/2017 10:52 PM, Andres Freund wrote:
> On 2017-12-01 16:14:58 -0500, Robert Haas wrote:
>> Honestly, if we can give everybody a 4% space reduction by
>> switching to lz4, I think that's totally worth doing -- but let's
>> not make people choose it, let's make it the default going forward,
>> and keep pglz support around so we don't break pg_upgrade
>> compatibility (and so people can continue to choose it if for some
>> reason it works better in their use case). That kind of improvement
>> is nothing special in a specific workload, but TOAST is a pretty
>> general-purpose mechanism. I have become, through a few bitter
>> experiences, a strong believer in the value of trying to reduce our
>> on-disk footprint, and knocking 4% off the size of every TOAST
>> table in the world does not sound worthless to me -- even though
>> context-aware compression can doubtless do a lot better.
> 
> +1. It's also a lot faster, and I've seen way way to many workloads
> with 50%+ time spent in pglz.
> 

TBH the 4% figure is something I mostly made up (I'm fake news!). On the
mailing list archive (which I believe is pretty compressible) I observed
something like 2.5% size reduction with lz4 compared to pglz, at least
with the compression levels I've used ...

Other algorithms (e.g. zstd) got significantly better compression (25%)
compared to pglz, but in exchange for longer compression. I'm sure we
could lower compression level to make it faster, but that will of course
hurt the compression ratio.

I don't think switching to a different compression algorithm is a way
forward - it was proposed and explored repeatedly in the past, and every
time it failed for a number of reasons, most of which are still valid.


Firstly, it's going to be quite hard (or perhaps impossible) to find an
algorithm that is "universally better" than pglz. Some algorithms do
work better for text documents, some for binary blobs, etc. I don't
think there's a win-win option.

Sure, there are workloads where pglz performs poorly (I've seen such
cases too), but IMHO that's more an argument for the custom compression
method approach. pglz gives you good default compression in most cases,
and you can change it for columns where it matters, and where a
different space/time trade-off makes sense.


Secondly, all the previous attempts ran into some legal issues, i.e.
licensing and/or patents. Maybe the situation changed since then (no
idea, haven't looked into that), but in the past the "pluggable"
approach was proposed as a way to address this.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Custom compression methods

2017-12-01 Thread Tomas Vondra
On 12/01/2017 03:23 PM, Robert Haas wrote:
> On Thu, Nov 30, 2017 at 2:47 PM, Tomas Vondra
> <tomas.von...@2ndquadrant.com> wrote:
>> OK. I think it's a nice use case (and nice gains on the compression
>> ratio), demonstrating the datatype-aware compression. The question is
>> why shouldn't this be built into the datatypes directly?
> 
> Tomas, thanks for running benchmarks of this.  I was surprised to see
> how little improvement there was from other modern compression
> methods, although lz4 did appear to be a modest win on both size and
> speed.  But I share your intuition that a lot of the interesting work
> is in datatype-specific compression algorithms.  I have noticed in a
> number of papers that I've read that teaching other parts of the
> system to operate directly on the compressed data, especially for
> column stores, is a critical performance optimization; of course, that
> only makes sense if the compression is datatype-specific.  I don't
> know exactly what that means for the design of this patch, though.
> 

It has very little impact on this patch, as it has nothing to do with
columnar storage. That is, each value is compressed independently.

Column stores exploit the fact that they get a vector of values,
compressed in some data-aware way. E.g. some form of RLE or dictionary
compression, which allows them to evaluate expressions on the compressed
vector. But that's irrelevant here, we only get row-by-row execution.

Note: The idea to build dictionary for the whole jsonb column (which
this patch should allow) does not make it "columnar compression" in the
"column store" way. The executor will still get the decompressed value.

> As a general point, no matter which way you go, you have to somehow
> deal with on-disk compatibility.  If you want to build in compression
> to the datatype itself, you need to find at least one bit someplace to
> mark the fact that you applied built-in compression.  If you want to
> build it in as a separate facility, you need to denote the compression
> used someplace else.  I haven't looked at how this patch does it, but
> the proposal in the past has been to add a value to vartag_external.

AFAICS the patch does that by setting a bit in the varlena header, and
then adding OID of the compression method after the varlena header. So
you get (verlena header + OID + data).

This has good and bad consequences.

Good: It's transparent for the datatype, so it does not have to worry
about the custom compression at all (and it may change arbitrarily).

Bad: It's transparent for the datatype, so it can't operate directly on
the compressed representation.

I don't think this is an argument against the patch, though. If the
datatype can support intelligent compression (and execution without
decompression), it has to be done in the datatype anyway.

> One nice thing about the latter method is that it can be used for any
> data type generically, regardless of how much bit-space is available
> in the data type representation itself.  It's realistically hard to
> think of a data-type that has no bit space available anywhere but is
> still subject to data-type specific compression; bytea definitionally
> has no bit space but is also can't benefit from special-purpose
> compression, whereas even something like text could be handled by
> starting the varlena with a NUL byte to indicate compressed data
> following.  However, you'd have to come up with a different trick for
> each data type.  Piggybacking on the TOAST machinery avoids that.  It
> also implies that we only try to compress values that are "big", which
> is probably be desirable if we're talking about a kind of compression
> that makes comprehending the value slower. Not all types of
> compression do, cf. commit 145343534c153d1e6c3cff1fa1855787684d9a38,
> and for those that don't it probably makes more sense to just build it
> into the data type.
> 
> All of that is a somewhat separate question from whether we should
> have CREATE / DROP COMPRESSION, though (or Alvaro's proposal of using
> the ACCESS METHOD stuff instead).  Even if we agree that piggybacking
> on TOAST is a good way to implement pluggable compression methods, it
> doesn't follow that the compression method is something that should be
> attached to the datatype from the outside; it could be built into it
> in a deep way.  For example, "packed" varlenas (1-byte header) are a
> form of compression, and the default functions for detoasting always
> produced unpacked values, but the operators for the text data type
> know how to operate on the packed representation.  That's sort of a
> trivial example, but it might well be that there are other cases where
> we can do something similar.  Maybe jsonb, for example, can compress
> data in such a

Re: [HACKERS] Custom compression methods

2017-12-02 Thread Tomas Vondra

On 12/02/2017 09:24 PM, konstantin knizhnik wrote:
> 
> On Dec 2, 2017, at 6:04 PM, Tomas Vondra wrote:
> 
>> On 12/01/2017 10:52 PM, Andres Freund wrote:
>> ...
>>
>> Other algorithms (e.g. zstd) got significantly better compression (25%)
>> compared to pglz, but in exchange for longer compression. I'm sure we
>> could lower compression level to make it faster, but that will of course
>> hurt the compression ratio.
>>
>> I don't think switching to a different compression algorithm is a way
>> forward - it was proposed and explored repeatedly in the past, and every
>> time it failed for a number of reasons, most of which are still valid.
>>
>>
>> Firstly, it's going to be quite hard (or perhaps impossible) to
>> find an algorithm that is "universally better" than pglz. Some
>> algorithms do work better for text documents, some for binary
>> blobs, etc. I don't think there's a win-win option.
>>
>> Sure, there are workloads where pglz performs poorly (I've seen
>> such cases too), but IMHO that's more an argument for the custom
>> compression method approach. pglz gives you good default
>> compression in most cases, and you can change it for columns where
>> it matters, and where a different space/time trade-off makes
>> sense.
>>
>>
>> Secondly, all the previous attempts ran into some legal issues, i.e.
>> licensing and/or patents. Maybe the situation changed since then (no
>> idea, haven't looked into that), but in the past the "pluggable"
>> approach was proposed as a way to address this.
>>
>>
> 
> May be it will be interesting for you to see the following results
> of applying page-level compression (CFS in PgPro-EE) to pgbench
> data:
> 

I don't follow. If I understand what CFS does correctly (and I'm mostly
guessing here, because I haven't seen the code published anywhere, and I
assume it's proprietary), it essentially compresses whole 8kB blocks.

I don't know it reorganizes the data into columnar format first, in some
way (to make it more "columnar" which is more compressible), which would
make somewhat similar to page-level compression in Oracle.

But it's clearly a very different approach from what the patch aims to
improve (compressing individual varlena values).

> 
> All algorithms (except zlib) were used with best-speed option: using 
> better compression level usually has not so large impact on
> compression ratio (<30%), but can significantly increase time
> (several times). Certainly pgbench isnot the best candidate for
> testing compression algorithms: it generates a lot of artificial and
> redundant data. But we measured it also on real customers data and
> still zstd seems to be the best compression methods: provides good
> compression with smallest CPU overhead.
> 

I think this really depends on the dataset, and drawing conclusions
based on a single test is somewhat crazy. Especially when it's synthetic
pgbench data with lots of inherent redundancy - sequential IDs, ...

My takeaway from the results is rather that page-level compression may
be very beneficial in some cases, although I wonder how much of that can
be gained by simply using compressed filesystem (thus making it
transparent to PostgreSQL).


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Custom compression methods

2017-12-02 Thread Tomas Vondra
On 12/02/2017 09:38 PM, Andres Freund wrote:
> Hi,
> 
> On 2017-12-02 16:04:52 +0100, Tomas Vondra wrote:
>> Firstly, it's going to be quite hard (or perhaps impossible) to find an
>> algorithm that is "universally better" than pglz. Some algorithms do
>> work better for text documents, some for binary blobs, etc. I don't
>> think there's a win-win option.
> 
> lz4 is pretty much there.
> 

That's a matter of opinion, I guess. It's a solid compression algorithm,
that's for sure ...

>> Secondly, all the previous attempts ran into some legal issues, i.e.
>> licensing and/or patents. Maybe the situation changed since then (no
>> idea, haven't looked into that), but in the past the "pluggable"
>> approach was proposed as a way to address this.
> 
> Those were pretty bogus.

IANAL so I don't dare to judge on bogusness of such claims. I assume if
we made it optional (e.g. configure/initdb option, it'd be much less of
an issue). Of course, that has disadvantages too (because when you
compile/init with one algorithm, and then find something else would work
better for your data, you have to start from scratch).

>
> I think we're not doing our users a favor if they've to download
> some external projects, then fiddle with things, just to not choose
> a compression algorithm that's been known bad for at least 5+ years.
> If we've a decent algorithm in-core *and* then allow extensibility, 
> that's one thing, but keeping the bad and tell forks "please take
> our users with this code we give you" is ...
> 

I don't understand what exactly is your issue with external projects,
TBH. I think extensibility is one of the great strengths of Postgres.
It's not all rainbows and unicorns, of course, and it has costs too.

FWIW I don't think pglz is a "known bad" algorithm. Perhaps there are
cases where other algorithms (e.g. lz4) are running circles around it,
particularly when it comes to decompression speed, but I wouldn't say
it's "known bad".

Not sure which forks you're talking about ...

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Rethinking MemoryContext creation

2017-12-11 Thread Tomas Vondra


On 12/11/2017 05:27 PM, Tom Lane wrote:
> I wrote:
>> While fooling around with a different problem, I got annoyed at how slow
>> MemoryContext creation and deletion is.
> 
> Looking at this some more, there's another micro-optimization we could
> make, which is to try to get rid of the strlen+strcpy operations needed
> for the context name.  (And yes, I'm seeing those show up in profiles,
> to the tune of a couple percent of total runtime in some examples.)
> 
> For a *very* large majority of the callers of AllocSetContextCreate,
> the context name is a simple C string constant, so we could just store
> the pointer to it and save the space and cycles required to copy it.
> This would require providing a separate API for the few callers that are
> actually passing transient strings, but that's easy enough.  I envision
> AllocSetContextCreate becoming a wrapper macro for
> "AllocSetContextCreateExtended", which'd take a flag argument specifying
> whether the context name needs to be copied.
> 
> However, unless we want to run around and touch all the ~ 150 calls
> with constant arguments, we'd have to set things up so that the default
> behavior for AllocSetContextCreate is to not copy.  This risks breaking
> callers in extensions.  Not entirely sure if it's worth that --- any
> thoughts?
> 

I don't think silently breaking extensions is particularly attractive
option, so I guess we'll have to run around and tweak the ~150 calls.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Rethinking MemoryContext creation

2017-12-11 Thread Tomas Vondra
On 12/11/2017 06:22 PM, Robert Haas wrote:
> On Mon, Dec 11, 2017 at 11:59 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> Tomas Vondra <tomas.von...@2ndquadrant.com> writes:
>>> On 12/11/2017 05:27 PM, Tom Lane wrote:
>>>> However, unless we want to run around and touch all the ~ 150 calls
>>>> with constant arguments, we'd have to set things up so that the default
>>>> behavior for AllocSetContextCreate is to not copy.  This risks breaking
>>>> callers in extensions.  Not entirely sure if it's worth that --- any
>>>> thoughts?
>>
>>> I don't think silently breaking extensions is particularly attractive
>>> option, so I guess we'll have to run around and tweak the ~150 calls.
>>
>> Meh.  I suppose that of the ~150 call sites, there are probably only
>> a dozen or two where it would actually make a performance difference,
>> so maybe this needn't be quite as invasive as I first thought.
> 
> I think changing only a subset of the call sites is unappealing
> because, even though it may not make a measurable performance
> difference in other cases, it may get cargo-culted into some place
> where it does make a difference.
> 

Not sure. One the one hand, it might certainly be somewhat confusing
when we use two different methods to create memory contexts with C
string constants. On the other hand, I'm sure we have other similar
"local" optimizations.

I'd say "let's just tweak all the calls to use the new function" but I'm
not the person doing the leg work ...

> I also don't think silently breaking extensions is a terribly big deal
> in this case.  It seems likely that most extensions use static names
> just as most of our internal stuff does.  I'm going to guess that the
> number of extensions that will actually break as a result of a change
> in this area is probably very small - conceivably zero, and likely
> less than five.  I don't think we should be willing to uglify the core
> code too much for that level of breakage.
> 

I don't know how many extensions you maintain, but in my experience this
type of silent breakage is extremely annoying. I definitely prefer a
clean compile-time API breakage, for example.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Custom compression methods

2017-12-11 Thread Tomas Vondra
Hi,

I see there's an ongoing discussion about the syntax and ALTER TABLE
behavior when changing a compression method for a column. So the patch
seems to be on the way to be ready in the January CF, I guess.

But let me play the devil's advocate for a while and question the
usefulness of this approach to compression. Some of the questions were
mentioned in the thread before, but I don't think they got the attention
they deserve.

FWIW I don't know the answers, but I think it's important to ask them.
Also, apologies if this post looks to be against the patch - that's part
of the "devil's advocate" thing.


The main question I'm asking myself is what use cases the patch
addresses, and whether there is a better way to do that. I see about
three main use-cases:

1) Replacing the algorithm used to compress all varlena types (in a way
that makes it transparent for the data type code).

2) Custom datatype-aware compression (e.g. the tsvector).

3) Custom datatype-aware compression with additional column-specific
metadata (e.g. the jsonb with external dictionary).

Now, let's discuss those use cases one by one, and see if there are
simpler (or better in some way) solutions ...


Replacing the algorithm used to compress all varlena values (in a way
that makes it transparent for the data type code).
--

While pglz served us well over time, it was repeatedly mentioned that in
some cases it becomes the bottleneck. So supporting other state of the
art compression algorithms seems like a good idea, and this patch is one
way to do that.

But perhaps we should simply make it an initdb option (in which case the
whole cluster would simply use e.g. lz4 instead of pglz)?

That seems like a much simpler approach - it would only require some
./configure options to add --with-lz4 (and other compression libraries),
an initdb option to pick compression algorithm, and probably noting the
choice in cluster controldata.

No dependencies tracking, no ALTER TABLE issues, etc.

Of course, it would not allow using different compression algorithms for
different columns (although it might perhaps allow different compression
level, to some extent).

Conclusion: If we want to offer a simple cluster-wide pglz alternative,
perhaps this patch is not the right way to do that.


Custom datatype-aware compression (e.g. the tsvector)
--

Exploiting knowledge of the internal data type structure is a promising
way to improve compression ratio and/or performance.

The obvious question of course is why shouldn't this be done by the data
type code directly, which would also allow additional benefits like
operating directly on the compressed values.

Another thing is that if the datatype representation changes in some
way, the compression method has to change too. So it's tightly coupled
to the datatype anyway.

This does not really require any new infrastructure, all the pieces are
already there.

In some cases that may not be quite possible - the datatype may not be
flexible enough to support alternative (compressed) representation, e.g.
because there are no bits available for "compressed" flag, etc.

Conclusion: IMHO if we want to exploit the knowledge of the data type
internal structure, perhaps doing that in the datatype code directly
would be a better choice.


Custom datatype-aware compression with additional column-specific
metadata (e.g. the jsonb with external dictionary).
--

Exploiting redundancy in multiple values in the same column (instead of
compressing them independently) is another attractive way to help the
compression. It is inherently datatype-aware, but currently can't be
implemented directly in datatype code as there's no concept of
column-specific storage (e.g. to store dictionary shared by all values
in a particular column).

I believe any patch addressing this use case would have to introduce
such column-specific storage, and any solution doing that would probably
need to introduce the same catalogs, etc.

The obvious disadvantage of course is that we need to decompress the
varlena value before doing pretty much anything with it, because the
datatype is not aware of the compression.

So I wonder if the patch should instead provide infrastructure for doing
that in the datatype code directly.

The other question is if the patch should introduce some infrastructure
for handling the column context (e.g. column dictionary). Right now,
whoever implements the compression has to implement this bit too.



regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Custom compression methods

2017-12-01 Thread Tomas Vondra

On 12/01/2017 08:38 PM, Alvaro Herrera wrote:
> Tomas Vondra wrote:
> 
>> On 11/30/2017 09:51 PM, Alvaro Herrera wrote:
> 
>>> Just passing by, but wouldn't this fit in the ACCESS METHOD group of
>>> commands?  So this could be simplified down to
>>> CREATE ACCESS METHOD ts1 TYPE COMPRESSION
>>> we have that for indexes and there are patches flying for heap storage,
>>> sequences, etc.
>>
>> I think that would conflate two very different concepts. In my mind,
>> access methods define how rows are stored.
> 
> In mine, they define how things are accessed (i.e. more general than
> what you're thinking).  We *currently* use them to store rows [in
> indexes], but there is no reason why we couldn't expand that.
> 

Not sure I follow. My argument was not as much about whether the rows
are stored as rows or in some other (columnar) format, but that access
methods deal with "tuples" (i.e. row in the "logical" way). I assume
that even if we end up implementing other access method types, they will
still be tuple-based.

OTOH compression methods (at least as introduced by this patch) operate
on individual values, and have very little to do with access to the
value (in a sense it's a transparent thing).

>
> So we group access methods in "types"; the current type we have is for
> indexes, and methods in that type define how are indexes accessed.  This
> new type would indicate how would values be compressed.  I disagree that
> there is no parallel there.
> 
> I'm trying to avoid pointless proliferation of narrowly defined DDL
> commands.
> 

Of course, the opposite case is using the same DDL for very different
concepts (although I understand you don't see it that way).

But in fairness, I don't really care if we call this COMPRESSION METHOD
or ACCESS METHOD or DARTH VADER ...

>> Furthermore, the "TYPE" in CREATE COMPRESSION method was meant to
>> restrict the compression algorithm to a particular data type (so, if it
>> relies on tsvector, you can't apply it to text columns).
> 
> Yes, of course.  I'm saying that the "datatype" property of a
> compression access method would be declared somewhere else, not in the
> TYPE clause of the CREATE ACCESS METHOD command.  Perhaps it makes sense
> to declare that a certain compression access method is good only for a
> certain data type, and then you can put that in the options clause,
> "CREATE ACCESS METHOD hyperz TYPE COMPRESSION WITH (type = tsvector)".
> But many compression access methods would be general in nature and so
> could be used for many datatypes (say, snappy).
> 
> To me it makes sense to say "let's create this method which is for data
> compression" (CREATE ACCESS METHOD hyperz TYPE COMPRESSION) followed by
> either "let's use this new compression method for the type tsvector"
> (ALTER TYPE tsvector SET COMPRESSION hyperz) or "let's use this new
> compression method for the column tc" (ALTER TABLE ALTER COLUMN tc SET
> COMPRESSION hyperz).
> 

The WITH syntax does not seem particularly pretty to me, TBH. I'd be
much happier with "TYPE tsvector" and leaving WITH for the options
specific to each compression method.

FWIW I think syntax is the least critical part of this patch. It's ~1%
of the patch, and the gram.y additions are rather trivial.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: POC: GROUP BY optimization

2018-06-06 Thread Tomas Vondra
ay be:
1) first column - the column with bigger estimated number of groups
2) second column - the pair of (first, second) with bigger estimated 
number of groups

3) third column - the triple of (first, second, third) with bigger ...

But seems even with that algorithm, ISTM, it could be implemented in 
cheaper manner.




Maybe. I do have some ideas, although I haven't tried implementing it.

If there's no extended statistic on the columns, you can do the current 
thing (assuming independence etc.). There's not much we can do here.


If there's an extended statistic, you can do either a greedy search (get 
the next column with the highest ndistinct coefficient) or exhaustive 
search (computing the estimated number of comparisons).


Another challenge is that using only the ndistinct coefficient assumes 
uniform distribution of the values. But you can have a column with 1M 
distinct values, where a single value represents 99% of the rows. And 
another column with 100k distinct values, with actual uniform 
distribution. I'm pretty sure it'd be more efficient to place the 100k 
column first.




The real issue however is that this decision can't be made entirely 
locally. Consider for example this:


 explain select a,b,c, count(*) from t group by a,b,c order by c,b,a;

Which is clearly cheaper (at least according to the optimizer) than 
doing two separate sorts. So the optimization actually does the wrong 
thing here, and it needs to somehow consider the other ordering 
requirements (in this case ORDER BY) - either by generating multiple 
paths with different orderings or some heuristics.
Hm, thank you. I consider it is a bug of my implementation - basic idea 
was that we try to match already existing or needed order and only if we 
fail or have unmatched tail of pathkey list than we will try to find 
cheapest column order.
Fixed in v7 (0002-opt_group_by_index_and_order-v7.patch), but may be by 
naive way: if we don't have a path pathkey first try to reorder columns 
accordingly to order by clause. Test for your is also added.




OK. I'll give it a try.



I'm also wondering how freely we can change the group by result 
ordering. Assume you disable hash aggregate and parallel query - 
currently we are guaranteed to use group aggregate that produces 
exactly the ordering as specified in GROUP BY, but this patch removes 
that "guarantee" and we can produce arbitrary permutation of the 
ordering. But as I argued in other threads, such implicit guarantees 
are really fragile, can silently break for arbitrary reasons (say, 
parallel query will do just that) and can be easily fixed by adding a 
proper ORDER BY. So I don't think this is an issue.
Agree. SQL by design doesn't give a warranty of particular order without 
explicit ORDER BY clause.


The other random thought is how will/could this interact with the 
incremental sorts patch. I know Alexander wanted to significantly 
limit where we actually consider the incremental sort, but I'm not 
sure if this was one of those places or not (is sure looks like a 
place where we would greatly benefit from it).
Seems, they should not directly interact. Patch tries to find cheapest 
column order, Alexander's patch tries to make sort cheaper - they are a 
different tasks. But I will try.




Thanks.

So to summarize this initial review - I do suggest splitting the patch 
into two parts. One that does the index magic, and one for this 
reordering optimization. The first part (additional indexes) seems 
quite fairly safe, likely to get committable soon. The other part 
(ndistinct reordering) IMHO requires more thought regarding costing 
and interaction with other query parts.

Thank you for review!



;-)


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Spilling hashed SetOps and aggregates to disk

2018-06-06 Thread Tomas Vondra




On 06/05/2018 07:39 PM, David Fetter wrote:

On Tue, Jun 05, 2018 at 01:27:01PM -0400, Tom Lane wrote:

David Fetter  writes:

On Tue, Jun 05, 2018 at 02:56:23PM +1200, David Rowley wrote:

True. Although not all built in aggregates have those defined.



Just out of curiosity, which ones don't? As of
3f85c62d9e825eedd1315d249ef1ad793ca78ed4, pg_aggregate has both of
those as NOT NULL.


NOT NULL isn't too relevant; that's just protecting the fixed-width
nature of the catalog rows.  What's important is which ones are zero.


Thanks for helping me understand this better.


# select aggfnoid::regprocedure, aggkind from pg_aggregate where (aggserialfn=0 
or aggdeserialfn=0) and aggtranstype = 'internal'::regtype;
aggfnoid   | aggkind
--+-
[snip]
(19 rows)

Probably the ordered-set/hypothetical ones aren't relevant for this
issue.

Whether or not we feel like fixing the above "normal" aggs for this,
the patch would have to not fail on extension aggregates that don't
support serialization.


Could there be some kind of default serialization with reasonable
properties?



Not really, because the aggregates often use "internal" i.e. a pointer 
referencing whothehellknowswhat, and how do you serialize/deserialize 
that? The other issue is that serialize/deserialize is only a part of a 
problem - you also need to know how to do "combine", and not all 
aggregates can do that ... (certainly not in universal way).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: POC: GROUP BY optimization

2018-06-09 Thread Tomas Vondra
On 06/09/2018 08:09 PM, Tomas Vondra wrote:
> 
> /snip/
> 
> 4) when adding Sort for grouping, try producing the right output order
>(if the ORDER BY was specified)
> 

BTW I've just realized we already do something similar in master. If you
run a query like this:

  SELECT a, b, count(*) FROM t GROUP BY b, a ORDER BY a;

we will actually plan it like this:

  QUERY PLAN
  ---
   GroupAggregate
 Group Key: a, b
 ->  Sort
   Sort Key: a, b
   ->  Seq Scan on t
  (5 rows)

I.e. we already do reorder the group clauses to match ORDER BY, to only
require a single sort. This happens in preprocess_groupclause(), which
also explains the reasoning behind that.

I wonder if some of the new code reordering group pathkeys could/should
be moved here (not sure, maybe it's too early for those decisions). In
any case, it might be appropriate to update some of the comments before
preprocess_groupclause() which claim we don't do certain things added by
the proposed patches.

This probably also somewhat refutes my claim that the order of grouping
keys is currently fully determined by users (and so they may pick the
most efficient order), while the reorder-by-ndistinct patch would make
that impossible. Apparently when there's ORDER BY, we already mess with
the order of group clauses - there are ways to get around it (subquery
with OFFSET 0) but it's much less clear.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: POC: GROUP BY optimization

2018-06-09 Thread Tomas Vondra
). I mean, we first call

  n_preordered = group_keys_reorder_by_pathkeys(path->pathkeys, ...);

so the reordering tries to maintain ordering of the input path. Then if
(n_preordered == 0) we do this:

  group_keys_reorder_by_pathkeys(root->sort_pathkeys, ...);

Doesn't that mean that if we happen to have input path with partially
matching ordering (so still requiring explicit Sort before grouping) we
may end up ignoring the ORDER BY entirely? Instead of using Sort that
would match the ORDER BY? I might have misunderstood this, though.

I'm not sure why the first group_keys_reorder_by_pathkeys() call does
not simply return 0 when the input path ordering is not sufficient for
the grouping? Is there some reasoning behind that (e.g. that sorting
partially sorted is faster)?

Overall I think this patch introduces four different optimizations that
are indeed very interesting:

1) consider additional sorted paths for grouping

2) reorder GROUP BY clauses per ndistinct to minimize comparisons

3) when adding Sort for grouping, maintain partial input ordering

4) when adding Sort for grouping, try producing the right output order
   (if the ORDER BY was specified)

But at this point it's kinda mashed together in ad-hoc manner, using
very simple heuristics / assumptions. We'll need to figure out how to
combine those optimizations.

BTW I get compiler warnings that n_preordered_groups may be used
uninitialized in add_paths_to_grouping_rel, because it's only set in an
if/else branch, and then used further down.

> Next, I had a look on cost_incremental_sort() provided by
> incremental sort patch and, it's a pity, it doesn't solve our problem
> with the impact of the cost of per-column comparison function and
> number of its calls.

I currently doesn't, but that might be more an issue in the incremental
sort patch and we may need to fix that. Clearly the costing in general
in that patch needs more work, and I do recall Tom pointing out that the
current heuristics (estimating number of sort groups using ndistincts)
seems rather weak.

It may not be exactly the same problem, but it seems kinda similar to
what this patch does. I have a hunch that those patches will end up
inventing something fairly similar.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: POC: GROUP BY optimization

2018-06-07 Thread Tomas Vondra

On 06/07/2018 03:41 PM, Teodor Sigaev wrote:
>>

... snip ...

>>

Priorization of the user-provided order can be as simple as giving
that comparison_cost a small handicap.


I see no point in doing that, and I don't recall a single place in the
planner where we do that. If the user specified ORDER BY, we'll slap an
explicit Sort on top when needed (which acts as the handicap, but in a
clear way). Otherwise we don't do such things - it'd be just plain
confusing (consider "ORDER BY a,b" vs. "ORDER BY b,c" with same data
types, ndistinct etc. but unexpectedly different costs). Also, what
would be a good value for the handicap?


Again agree. If we have fixed order of columns (ORDER BY) then we should 
not try to reorder it. Current patch follows that if I didn't a mistake.




This part seems to be more a misunderstanding between me and Claudio. I 
believe Claudio was referring to the column order in a GROUP BY, not 
ORDER BY. In which case we don't add any Sort, of course.


I'm still opposed to adding arbitrary handicap to prioritize the order 
specified by user, for the reasons I explained before. We should aim to 
make the heuristics/costing reliable enough to make this unnecessary.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL prefetch

2018-06-16 Thread Tomas Vondra




On 06/16/2018 12:06 PM, Thomas Munro wrote:

On Sat, Jun 16, 2018 at 9:38 PM, Tomas Vondra
 wrote:

On 06/15/2018 08:01 PM, Andres Freund wrote:

On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:

On 14.06.2018 09:52, Thomas Munro wrote:

Why stop at the page cache...  what about shared buffers?


It is good question. I thought a lot about prefetching directly to shared
buffers.


I think that's definitely how this should work.  I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.


Could you elaborate why prefetching into s_b is so much better (I'm sure it has 
advantages, but I suppose prefetching into page cache would be much easier to 
implement).


posix_fadvise(POSIX_FADV_WILLNEED) might already get most of the
speed-up available here in the short term for this immediate
application, but in the long term a shared buffers prefetch system is
one of the components we'll need to support direct IO.



Sure. Assuming the switch to direct I/O will happen (it probably will, 
sooner or later), my question is whether this patch should be required 
to introduce the prefetching into s_b. Or should we use posix_fadvise 
for now, get most of the benefit, and leave the prefetch into s_b as an 
improvement for later?


The thing is - we're already doing posix_fadvise prefetching in bitmap 
heap scans, it would not be putting additional burden on the direct I/O 
patch (hypothetical, so far).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL prefetch

2018-06-16 Thread Tomas Vondra




On 06/15/2018 08:01 PM, Andres Freund wrote:

On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:



On 14.06.2018 09:52, Thomas Munro wrote:

On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
 wrote:

pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.

Hi Konstantin,

Why stop at the page cache...  what about shared buffers?



It is good question. I thought a lot about prefetching directly to shared
buffers.


I think that's definitely how this should work.  I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.



Could you elaborate why prefetching into s_b is so much better (I'm sure 
it has advantages, but I suppose prefetching into page cache would be 
much easier to implement).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL prefetch

2018-06-16 Thread Tomas Vondra
On 06/16/2018 09:02 PM, Andres Freund wrote:
> On 2018-06-16 11:38:59 +0200, Tomas Vondra wrote:
>>
>>
>> On 06/15/2018 08:01 PM, Andres Freund wrote:
>>> On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
>>>>
>>>>
>>>> On 14.06.2018 09:52, Thomas Munro wrote:
>>>>> On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
>>>>>  wrote:
>>>>>> pg_wal_prefetch function will infinitely traverse WAL and prefetch block
>>>>>> references in WAL records
>>>>>> using posix_fadvise(WILLNEED) system call.
>>>>> Hi Konstantin,
>>>>>
>>>>> Why stop at the page cache...  what about shared buffers?
>>>>>
>>>>
>>>> It is good question. I thought a lot about prefetching directly to shared
>>>> buffers.
>>>
>>> I think that's definitely how this should work.  I'm pretty strongly
>>> opposed to a prefetching implementation that doesn't read into s_b.
>>>
>>
>> Could you elaborate why prefetching into s_b is so much better (I'm sure it
>> has advantages, but I suppose prefetching into page cache would be much
>> easier to implement).
> 
> I think there's a number of issues with just issuing prefetch requests
> via fadvise etc:
> 
> - it leads to guaranteed double buffering, in a way that's just about
>   guaranteed to *never* be useful. Because we'd only prefetch whenever
>   there's an upcoming write, there's simply no benefit in the page
>   staying in the page cache - we'll write out the whole page back to the
>   OS.

How does reading directly into shared buffers substantially change the
behavior? The only difference is that we end up with the double
buffering after performing the write. Which is expected to happen pretty
quick after the read request.

> - reading from the page cache is far from free - so you add costs to the
>   replay process that it doesn't need to do.
> - you don't have any sort of completion notification, so you basically
>   just have to guess how far ahead you want to read. If you read a bit
>   too much you suddenly get into synchronous blocking land.
> - The OS page is actually not particularly scalable to large amounts of
>   data either. Nor are the decisions what to keep cached likley to be
>   particularly useful.

The posix_fadvise approach is not perfect, no doubt about that. But it
works pretty well for bitmap heap scans, and it's about 13249x better
(rough estimate) than the current solution (no prefetching).

> - We imo need to add support for direct IO before long, and adding more
>   and more work to reach feature parity strikes meas a bad move.
> 

IMHO it's unlikely to happen in PG12, but I might be over-estimating the
invasiveness and complexity of the direct I/O change. While this patch
seems pretty doable, and the improvements are pretty significant.

My point was that I don't think this actually adds a significant amount
of work to the direct IO patch, as we already do prefetch for bitmap
heap scans. So this needs to be written anyway, and I'd expect those two
places to share most of the code. So where's the additional work?

I don't think we should reject patches just because it might add a bit
of work to some not-yet-written future patch ... (which I however don't
think is this case).


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-18 Thread Tomas Vondra




On 06/18/2018 05:06 PM, Joe Conway wrote:

On 06/18/2018 10:52 AM, Tom Lane wrote:

Robert Haas  writes:

On Mon, Jun 18, 2018 at 10:12 AM, Joe Conway  wrote:

Not necessarily. Our pages probably have enough predictable bytes to aid
cryptanalysis, compared to user data in a column which might not be very
predicable.



Really?  I would guess that the amount of entropy in a page is WAY
higher than in an individual column value.


Depending on the specifics of the encryption scheme, having some
amount of known (or guessable) plaintext may allow breaking the
cipher, even if much of the plaintext is not known. This is
cryptology 101, really.


Exactly


At the same time, having to have a bunch of
independently-decipherable short field values is not real secure
either, especially if they're known to all be encrypted with the
same key. But what you know or can guess about the plaintext in
such cases would be target-specific, rather than an attack that
could be built once and used against any PG database.


Again is dependent on the specific solution for encryption. In some 
cases you might do something like generate a single use random key, 
encrypt the payload with that, encrypt the single use key with the 
"global" key, append the two results and store.




Yeah, I suppose we could even have per-page keys, for example.

One topic I haven't seen mentioned in this thread yet is indexes. That's 
a pretty significant side-channel, when built on encrypted columns. Even 
if the indexes are encrypted too, you can often deduce a lot of 
information from them.


So what's the plan here? Disallow indexes on encrypted columns? Index 
encypted values directly? Something else?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-13 Thread Tomas Vondra

Hi,

On 05/25/2018 01:41 PM, Moon, Insung wrote:

Hello Hackers,

...

BTW, I want to support CBC mode encryption[3]. However, I'm not sure 
how to use the IV in CBC mode for this proposal. I'd like to hear

opinions by security engineer.



I'm not a cryptographer either, but this is exactly where you need a 
prior discussion about the threat models - there are a couple of 
chaining modes, each with different weaknesses.


FWIW it may also matter if data_checksums are enabled, because that may 
prevent malleability attacks affecting of the modes. Assuming active 
attacker (with the ability to modify the data files) is part of the 
threat model, of course.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-13 Thread Tomas Vondra

On 06/11/2018 11:22 AM, Masahiko Sawada wrote:

On Fri, May 25, 2018 at 8:41 PM, Moon, Insung
 wrote:

Hello Hackers,

This propose a way to develop "Table-level" Transparent Data 
Encryption (TDE) and Key Management Service (KMS) support in 
PostgreSQL.


...


As per discussion at PGCon unconference, I think that firstly we
need to discuss what threats we want to defend database data against.
If user wants to defend against a threat that is malicious user who 
logged in OS or database steals an important data on datbase this 
design TDE would not help. Because such user can steal the data by 
getting a memory dump or by SQL. That is of course differs depending 
on system requirements or security compliance but what threats do

you want to defend database data against? and why?



I do agree with this - a description of the threat model needs to be 
part of the design discussion, otherwise it's not possible to compare it 
to alternative solutions (e.g. full-disk encryption using LUKS or using 
existing privilege controls and/or RLS).


TDE was proposed/discussed repeatedly in the past, and every time it 
died exactly because it was not very clear which issue it was attempting 
to solve.


Let me share some of the issues mentioned as possibly addressed by TDE 
(I'm not entirely sure TDE actually solves them, I'm just saying those 
were mentioned in previous discussions):


1) enterprise requirement - Companies want in-database encryption, for 
various reasons (because "enterprise solution" or something).


2) like FDE, but OS/filesystem independent - Same config on any OS and 
filesystem, which may make maintenance easier.


3) does not require special OS/filesystem setup - Does not require help 
from system adminitrators, setup of LUKS devices or whatever.


4) all filesystem access (basebackups/rsync) is encrypted anyway

5) solves key management (the main challenge with pgcrypto)

6) allows encrypting only some of the data (tables, columns) to minimize 
performance impact


IMHO it makes sense to have TDE even if it provides the same "security" 
as disk-level encryption, assuming it's more convenient to setup/use 
from the database.


Also, if I understand correctly, at unconference session there also 
were two suggestions about the design other than the suggestion by 
Alexander: implementing TDE at column level using POLICY, and 
implementing TDE at table-space level. The former was suggested by

Joe but I'm not sure the detail of that suggestion. I'd love to hear
the deal of that suggestion. The latter was suggested by
Tsunakawa-san. Have you considered that?

You mentioned that encryption of temporary data for query processing 
and large objects are still under the consideration. But other than 
them you should consider the temporary data generated by other 
subsystems such as reorderbuffer and transition table as well.




The severity of those limitations is likely related to the threat model. 
I don't think encrypting temporary data would be a big problem, assuming 
you know which key to use.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: POC: GROUP BY optimization

2018-06-16 Thread Tomas Vondra



On 06/13/2018 06:56 PM, Teodor Sigaev wrote:
>> I.e. we already do reorder the group clauses to match ORDER BY, to only
>> require a single sort. This happens in preprocess_groupclause(), which
>> also explains the reasoning behind that.
> Huh. I missed that. That means group_keys_reorder_by_pathkeys() call
> inside get_cheapest_group_keys_order() duplicates work of
> preprocess_groupclause().
>  And so it's possible to replace it to simple counting number of the
> same pathkeys in beginning of lists. Or remove most part of
> preprocess_groupclause() - but it seems to me we should use first
> option, preprocess_groupclause() is simpler, it doesn't operate with
> pathkeys only with  SortGroupClause which is simpler.
> 
> BTW, incremental sort path provides pathkeys_common(), exactly what we
> need.
> 
>> I wonder if some of the new code reordering group pathkeys could/should
>> be moved here (not sure, maybe it's too early for those decisions). In
>> any case, it might be appropriate to update some of the comments before
>> preprocess_groupclause() which claim we don't do certain things added by
>> the proposed patches.
> 
> preprocess_groupclause() is too early step to use something like
> group_keys_reorder_by_pathkeys() because pathkeys is not known here yet.
> 
> 
>> This probably also somewhat refutes my claim that the order of
>> grouping keys is currently fully determined by users (and so they
>> may pick the most efficient order), while the reorder-by-ndistinct
>> patch would make that impossible. Apparently when there's ORDER BY,
>> we already mess with the order of group clauses - there are ways to
>> get around it (subquery with OFFSET 0) but it's much less clear.
> 
> I like a moment when objections go away :)
> 

I'm afraid that's a misunderstanding - my objections did not really go
away. I'm just saying my claim that we're not messing with order of
grouping keys is not entirely accurate, because we do that in one
particular case.

I still think we need to be careful when introducing new optimizations
in this area - reordering the grouping keys by ndistinct, ORDER BY or
whatever. In particular I don't think we should commit these patches
that may quite easily cause regressions, and then hope some hypothetical
future patch fixes the costing. Those objections still stand.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Tomas Vondra




On 06/11/2018 07:14 PM, Andres Freund wrote:

Hi,

On 2018-06-11 17:29:52 +0200, Tomas Vondra wrote:

It would be great to get something that performs better than just falling
back to sort (and I was advocating for that), but I'm worried we might be
moving the goalposts way too far.


I'm unclear on why that'd have that bad performance in relevant
cases. You're not going to hit the path unless the number of groups is
pretty large (or work_mem is ridiculously small, in which case we don't
care). With a large number of groups the sorting path isn't particularly
inefficient, because repeatedly storing the input values isn't such a
large fraction in comparison to the number of groups (and their
transition values).  Which scenarios are you concerned about?



Say you have a 1TB table, and keeping the groups in memory would require 
 work_mem=2GB. After hitting the work_mem limit, there may be pretty 
large amount of tuples you'd have to spill to disk and sort.


For example we hit the work_mem limit after processing 10% tuples, 
switching to sort would mean spill+sort of 900GB of data. Or we might 
say - hmm, we're 10% through, so we expect hitting the limit 10x, so 
let's spill the hash table and then do sort on that, writing and sorting 
only 10GB of data. (Or merging it in some hash-based way, per Robert's 
earlier message.)


I don't quite understand the argument that the number of groups needs to 
be pretty large for us to hit this. So what if the groups are 2x or 10x 
more than work_mem? It can still be cheaper than switching to sort-based 
approach, no?



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Tomas Vondra
On 06/11/2018 08:13 PM, Jeff Davis wrote:
> On Mon, 2018-06-11 at 19:33 +0200, Tomas Vondra wrote:
>> For example we hit the work_mem limit after processing 10% tuples,
>>  switching to sort would mean spill+sort of 900GB of data. Or we 
>> might say - hmm, we're 10% through, so we expect hitting the limit
>> 10x, so let's spill the hash table and then do sort on that,
>> writing and sorting only 10GB of data. (Or merging it in some
>> hash-based way, per Robert's earlier message.)
> 
> Your example depends on large groups and a high degree of group
> clustering. That's fine, but it's a special case,
> 

True, it's a special case and it won't work for other cases. It was
merely an example for Andres.

OTOH it's not entirely unrealistic, I think. Consider something like

  SELECT
extract(year from ts) AS y,
extract(month from ts) AS m,
extract(day from ts) AS d,
string_agg(x),
array_agg(y)
  FROM fact_table
  GROUP BY y, m, d;

which is likely very correlated (assuming new data is appended to the
table), and the string_agg/array_agg are likely to produce fairly large
groups (about proportional to the number of tuples in the group).

Another example might be about HLL aggregate, although in that case the
transition state does not grow, so it may not be that bad (and the
default estimate of 1kB would work pretty nicely). But there certainly
are other aggregates with large transition state, where this might not
be the case, and we currently have no way to communicate that to the
planner - except for setting work_mem much lower :-/

However, I now realize I've ignored the fact that we typically don't
sort the whole table but only a very few columns, so the example was not
entirely fair - we would not sort the whole remaining 900GB but likely
much less.

> and complexity does have a cost, too.

Sure.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL prefetch

2018-06-19 Thread Tomas Vondra




On 06/19/2018 05:50 PM, Andres Freund wrote:

On 2018-06-19 12:08:27 +0300, Konstantin Knizhnik wrote:

I do not think that prefetching in shared buffers requires much more efforts
and make patch more envasive...
It even somehow simplify it, because there is no to maintain own cache of
prefetched pages...



But it will definitely have much more impact on Postgres performance:
contention for buffer locks, throwing away pages accessed by read-only
queries,...


These arguments seem bogus to me. Otherwise the startup process is going
to do that work.



Also there are two points which makes prefetching into shared buffers more
complex:
1. Need to spawn multiple workers to make prefetch in parallel and somehow
distribute work between them.


I'm not even convinced that's true. It doesn't seem insane to have a
queue of, say, 128 requests that are done with posix_fadvise WILLNEED,
where the oldest requests is read into shared buffers by the
prefetcher. And then discarded from the page cache with WONTNEED.  I
think we're going to want a queue that's sorted in the prefetch process
anyway, because there's a high likelihood that we'll otherwise issue
prfetch requets for the same pages over and over again.

That gets rid of most of the disadvantages: We have backpressure
(because the read into shared buffers will block if not yet ready),
we'll prevent double buffering, we'll prevent the startup process from
doing the victim buffer search.



I'm confused. I thought you wanted to prefetch directly to shared 
buffers, so that it also works with direct I/O in the future. But now 
you suggest to use posix_fadvise() to work around the synchronous buffer 
read limitation. I don't follow ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL prefetch

2018-06-19 Thread Tomas Vondra
ber of installations,
that set the timeout high, just so they can avoid FPWs.
May be, but I am not so sure. This is why I will try to investigate it 
more.




I'd say checkpoints already do act as such timeout (not only, but people 
are setting it high to get rid of FPIs).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Tomas Vondra




On 06/11/2018 05:16 PM, Robert Haas wrote:

On Wed, Jun 6, 2018 at 8:16 PM, Tomas Vondra
 wrote:

... and this is pretty much what Jeff Davis suggested, I think. The
trouble is, each of those cases behaves nicely/terribly in different
corner cases.


That's a really good point.  If the number of groups is pretty small
compared to the number of input tuples, then you really only ever want
to dump out transition values.  By doing so, you minimize the amount
of data you have to write.  But if the number of groups is, say, half
the number of input tuples, then computing transition values before
you have all the values that belong to that group is probably a waste
of time.  I wonder if we could decide what to do by comparing the
number of input tuples consumed to the number of groups created at the
time we run out of memory.  If we've got less than, I dunno, five
tuples per group, then assume most groups are small.  Pick a strategy
that (mainly) spools input tuples.  Otherwise, pick a strategy that
spools transition tuples.

In either case, it seems like we can pick a pure hashing strategy or
switch to using both hashing and sorting.  For example, IIUC, Andres's
proposal involves spooling mostly input tuples, but can also spool
transition tuples if the transition values consume more memory as they
absorb more tuples.  However, you could do a similar kind of thing
without needing sort support.  When you see a value that's not doesn't
fit in your in-memory hash table, use the hash code to route it to 1
of N batch files.  Have a second set of batch files for transition
tuples in case you need to kick things out of the in-memory hash
table.  Once you finish reading the input, emit all the values that
remain in the in-memory hash table and then process each batch file
separately.

Similarly, David's strategy involves spooling only transition tuples
and then sorting on the group key, but it's again possible to do
something similar without relying on sorting.  Instead of flushing the
in-memory hash table to a tuple store, split the transition tuples it
contains among N batch files based on the hash code.  Once you've read
all of the input, go back and reprocess each batch file, combining
transition values whenever the same group keys appear in more than one
transition tuple.



Yeah, essentially something like the batching in hash joins.


To me, the pure-hashing strategies look a little simpler, but maybe
there's enough performance benefit from combining hashing and sorting
that it's worth the complexity, or maybe we should just accept
whichever variant somebody's willing to code.  But I think we almost
have to have separate handling for many-row-per-group and
few-rows-per-group, because those seem fundamentally different.



I think the underlying question is whether we want to treat this as an 
emergency safety against OOM (which should be a rare thing in most 
cases) or something allowing us to pick hash aggregate more often.


It would be great to get something that performs better than just 
falling back to sort (and I was advocating for that), but I'm worried we 
might be moving the goalposts way too far.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Tomas Vondra

On 06/11/2018 04:25 PM, Robert Haas wrote:



> ...


Maybe that's not exactly what Tomas (or you) meant by eviction
strategy, though.  Not sure.  But it does seem to me that we need to
pick the algorithm we're going to use, more or less, in order to
decide what infrastructure we need, and at least to some extent, we
ought to let our choice of algorithm be informed by the desire not to
need too much infrastructure.



I was using eviction strategy in a somewhat narrower sense - pretty much 
just "Which groups to evict from the hash table?" but you're right the 
question is more general and depends on which scheme we end up using 
(just hashagg, hash+sort, something else ...)


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: POC: GROUP BY optimization

2018-06-03 Thread Tomas Vondra
 - either by generating multiple 
paths with different orderings or some heuristics.


I'm also wondering how freely we can change the group by result 
ordering. Assume you disable hash aggregate and parallel query - 
currently we are guaranteed to use group aggregate that produces exactly 
the ordering as specified in GROUP BY, but this patch removes that 
"guarantee" and we can produce arbitrary permutation of the ordering. 
But as I argued in other threads, such implicit guarantees are really 
fragile, can silently break for arbitrary reasons (say, parallel query 
will do just that) and can be easily fixed by adding a proper ORDER BY. 
So I don't think this is an issue.


The other random thought is how will/could this interact with the 
incremental sorts patch. I know Alexander wanted to significantly limit 
where we actually consider the incremental sort, but I'm not sure if 
this was one of those places or not (is sure looks like a place where we 
would greatly benefit from it).


So to summarize this initial review - I do suggest splitting the patch 
into two parts. One that does the index magic, and one for this 
reordering optimization. The first part (additional indexes) seems quite 
fairly safe, likely to get committable soon. The other part (ndistinct 
reordering) IMHO requires more thought regarding costing and interaction 
with other query parts.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Improve geometric types

2018-06-03 Thread Tomas Vondra

On 06/03/2018 11:50 PM, Tom Lane wrote:

Tomas Vondra  writes:

The main remaining question I have is what do do with back-branches.
Shall we back-patch this or not?


Given the behavioral changes involved, I'd say "no way".  That's
reinforced by the lack of field complaints; if there were lots of
complaints, maybe we'd be willing to break backwards compatibility,
but ...



Fair enough, I tend to over-estimate importance of bugfixes and 
under-estimate breakage due to behavior change. But if we don't want to 
back-patch this, I'm fine with that. I was a bit worried about making 
future backpatches more painful, but this code received only ~20 commits 
over the past files, half of that due tot pgindent, so that seems to be 
a non-issue.


But now I'm wondering what does this mean for existing indexes? Doesn't 
this effectively mean those are unlikely to give meaningful responses 
(in the old or new semantics)?



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Improve geometric types

2018-06-03 Thread Tomas Vondra

Hi Emre,

Thanks for the rebased patch. I remember reviewing the patch in the last 
CF, and it seems in a pretty good shape. I plan to look at it again in 
the next commitfest, but it seems to have been reviewed by other 
experienced people so I'm not worried about this part.


The main remaining question I have is what do do with back-branches. 
Shall we back-patch this or not?


The trouble is that while the patch is essentially a bugfix, it 
refactors quite significant amount of code to make the fixes practical. 
If it was possible to back-patch just the fixes without the refactoring, 
that would be ideal, but unfortunately that's not the case. Based on 
discussion with Emre in Ottawa that would be rather impractical due to 
the nature of the bugs and low code reuse.


I do believe we should back-patch - after all, it fixes real bugs. It's 
true the bugs were there for years and no one noticed/reported them, but 
it's still buggy and that's not fun.


Opinions?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread Tomas Vondra



On 06/05/2018 09:22 AM, David Rowley wrote:

On 5 June 2018 at 17:04, Tomas Vondra  wrote:

On 06/05/2018 04:56 AM, David Rowley wrote:

Isn't there still a problem determining when the memory exhaustion
actually happens though?   As far as I know, we've still little
knowledge how much memory each aggregate state occupies.

Jeff tried to solve this in [1], but from what I remember, there was
too much concern about the overhead of the additional accounting code.

[1] 
https://www.postgresql.org/message-id/flat/CAKJS1f8yvvvj-sVDv_bcxkzcZKq0ZOTVhX0dHfnYDct2Mycq5Q%40mail.gmail.com#cakjs1f8yvvvj-svdv_bcxkzczkq0zotvhx0dhfnydct2myc...@mail.gmail.com



I had a chat with Jeff Davis at pgcon about this, and IIRC he suggested
a couple of people who were originally worried about the overhead seem
to be accepting it now.


Is there any great need to make everything pay the small price for
this? Couldn't we just create a new MemoryContextMethod that
duplicates aset.c, but has the require additional accounting built in
at the implementation level, then make execGrouping.c use that
allocator for its hash table? The code would not really need to be
duplicated, we could just do things the same way as like.c does with
like_match.c and include a .c file. We'd need another interface
function in MemoryContextMethods to support getting the total memory
allocated in the context, but that does not seem like a problem.



There probably is not a need, but there was not a great way to enable it 
explicitly only for some contexts. IIRC what was originally considered 
back in 2015 was some sort of a flag in the memory context, and the 
overhead was about the same as with the int64 arithmetic alone.


But I don't think we've considered copying the whole AllocSet. Maybe 
that would work, not sure. I wonder if an aggregate might use a custom 
context internally (I don't recall anything like that). The accounting 
capability seems potentially useful for other places, and those might 
not use AllocSet (or at least not directly).


All of this of course assumes the overhead is still there. I sure will 
do some new measurements.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Spilling hashed SetOps and aggregates to disk

2018-06-04 Thread Tomas Vondra
On 06/05/2018 04:56 AM, David Rowley wrote:
> On 5 June 2018 at 06:52, Andres Freund  wrote:
>> That part has gotten a bit easier since, because we have serialize /
>> deserialize operations for aggregates these days.
> 
> True. Although not all built in aggregates have those defined.

Not sure what to do about those, unfortunately. We could stop adding
more groups and hope the aggregate state does not grow further, but
that's about it.

> 
>> I wonder whether, at least for aggregates, the better fix wouldn't be to
>> switch to feeding the tuples into tuplesort upon memory exhaustion and
>> doing a sort based aggregate.  We have most of the infrastructure to do
>> that due to grouping sets. It's just the pre-existing in-memory tuples
>> that'd be problematic, in that the current transition values would need
>> to serialized as well.  But with a stable sort that'd not be
>> particularly problematic, and that could easily be achieved.

That's one of the possible solutions, yes. It's hard to say if it's
better or worse than the other approaches, because that depends on the
number of tuples vs. number of groups etc.

Evicting some (or all) of the in-memory groups can be easily better or
worse, depending on the data set. And I'm not sure the code complexity
would be significantly different.

I expect the eviction strategy to be the primary design challenge of
this patch. The other bits will be mostly determined by this one piece.

While the primary goal of the patch is addressing the OOM risks in hash
aggregate, I think it'd be a mistake to see it just that way. I see it
could allow us to perform hash aggregate more often, even if we know the
groups won't fit into work_mem. If we could estimate how much of the
aggregate state we'll have to spill to disk, we could still prefer
hashagg over groupagg. We would pay the price for eviction, but on large
data sets that can be massively cheaper than having to do sort.

I admit this is a bit hand-wavy, and the costing depends on us being
able to estimate the amount of evicted data. I certainly don't expect to
get this right away (that would move the goal posts for the patch very
far away), but it would be unfortunate to make this improvement
impossible/more difficult in the future.

> 
> Isn't there still a problem determining when the memory exhaustion
> actually happens though?   As far as I know, we've still little
> knowledge how much memory each aggregate state occupies.
> 
> Jeff tried to solve this in [1], but from what I remember, there was
> too much concern about the overhead of the additional accounting code.
> 
> [1] 
> https://www.postgresql.org/message-id/flat/CAKJS1f8yvvvj-sVDv_bcxkzcZKq0ZOTVhX0dHfnYDct2Mycq5Q%40mail.gmail.com#cakjs1f8yvvvj-svdv_bcxkzczkq0zotvhx0dhfnydct2myc...@mail.gmail.com
> 

I had a chat with Jeff Davis at pgcon about this, and IIRC he suggested
a couple of people who were originally worried about the overhead seem
to be accepting it now.

IMHO we do want a memory-bound hash aggregate, and doing some sort of
memory accounting is a must-have part of that. I don't see a way around
this, really. We need to minimize the overhead, of course.

I do not recall the exact approach we ended up with in 2015, though, or
what the measurements with that version were. I see 1-3% mentioned early
in the thread, and there were some additional improvements in subsequent
patch version I think.

I don't think we can realistically improve this (accounting at block
level), and there was a discussion if this is actually an overhead or
merely due to different binary layout.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread Tomas Vondra

On 06/05/2018 07:46 AM, Jeff Davis wrote:

On Tue, 2018-06-05 at 07:04 +0200, Tomas Vondra wrote:

I expect the eviction strategy to be the primary design challenge of
this patch. The other bits will be mostly determined by this one
piece.


Not sure I agree that this is the primary challenge.

The cases that benefit from eviction are probably a minority. I see two
categories that would benefit:

   * Natural clustering in the heap. This sounds fairly common, but a
lot of the cases that come to mind are too low-cardinality to be
compelling; e.g. timestamps grouped by hour/day/month. If someone has
run into a high-cardinality natural grouping case, let me know.
   * ARRAY_AGG (or similar): individual state values can be large enough
that we need to evict to avoid exceeding work_mem even if not adding
any new groups.

In either case, it seems like a fairly simple eviction strategy would
work. For instance, we could just evict the entire hash table if
work_mem is exceeded or if the hit rate on the hash table falls below a
certain threshold. If there was really something important that should
have stayed in the hash table, it will go back in soon anyway.

So why should eviction be a major driver for the entire design? I agree
it should be an area of improvement for the future, so let me know if
you see a major problem, but I haven't been as focused on eviction.



My concern is more about what happens when the input tuple ordering is 
inherently incompatible with the eviction strategy, greatly increasing 
the amount of data written to disk during evictions.


Say for example that we can fit 1000 groups into work_mem, and that 
there are 2000 groups in the input data set. If the input is correlated 
with the groups, everything is peachy because we'll evict the first 
batch, and then group the remaining 1000 groups (or something like 
that). But if the input data is random (which can easily happen, e.g. 
for IP addresses, UUIDs and such) we'll hit the limit repeatedly and 
will evict much sooner.


I know you suggested simply dumping the whole hash table and starting 
from scratch while we talked about this at pgcon, but ISTM it has 
exactly this issue.


But I don't know if there actually is a better option - maybe we simply 
have to accept this problem. After all, running slowly is still better 
than OOM (which may or may not happen now).


I wonder if we can somehow detect this at run-time and maybe fall-back 
to groupagg. E.g. we could compare number of groups vs. number of input 
tuples when we first hit the limit. It's a rough heuristics, but maybe 
sufficient.



While the primary goal of the patch is addressing the OOM risks in
hash
aggregate, I think it'd be a mistake to see it just that way. I see
it
could allow us to perform hash aggregate more often, even if we know
the
groups won't fit into work_mem. If we could estimate how much of the
aggregate state we'll have to spill to disk, we could still prefer
hashagg over groupagg. We would pay the price for eviction, but on
large
data sets that can be massively cheaper than having to do sort.


Agreed. I ran some tests of my patch in the last round, and they
strongly supported choosing HashAgg a lot more often. A lot of sort
improvements have been made though, so I really need to run some new
numbers.



Yeah, Peter made the sort stuff a lot faster. But I think there still is 
a lot of cases where the hashagg would greatly reduce the amount of data 
that needs to be written to disk, and that's not something the sort 
improvements could improve. If you need to aggregate a 1TB table, it's 
still going to be ~1TB of data you need to write to disk during on-disk 
sort. But if there is only a reasonable number of groups, that greatly 
simplifies things.



far away), but it would be unfortunate to make this improvement
impossible/more difficult in the future.


If you see anything that would make this difficult in the future, let
me know.



Sure. Sorry for being too hand-wavy in this thread, and possibly moving 
the goal posts further away :-/ I got excited by you planning to work on 
this again and perhaps a bit carried away ;-)



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread Tomas Vondra

On 06/05/2018 02:49 PM, Andres Freund wrote:

Hi,

On 2018-06-05 10:05:35 +0200, Tomas Vondra wrote:

My concern is more about what happens when the input tuple ordering is
inherently incompatible with the eviction strategy, greatly increasing the
amount of data written to disk during evictions.

Say for example that we can fit 1000 groups into work_mem, and that there
are 2000 groups in the input data set. If the input is correlated with the
groups, everything is peachy because we'll evict the first batch, and then
group the remaining 1000 groups (or something like that). But if the input
data is random (which can easily happen, e.g. for IP addresses, UUIDs and
such) we'll hit the limit repeatedly and will evict much sooner.



I know you suggested simply dumping the whole hash table and starting from
scratch while we talked about this at pgcon, but ISTM it has exactly this
issue.


Yea, that's the case I was thinking of where going to sorting will very
likely have better performance.

I think it'd even be sensible to have a "skew tuple" like
optimization. When detecting getting closer to memory exhaustion, move
new groups to the tuplesort, but already hashed tuples stay in the
hashtable.  That'd still need tuples being moved to the sort in the
cases where the transition values get to big (say array_agg), but that
should be comparatively rare.  I'm sure we could do better in selecting
the hash-tabled values than just taking the first incoming ones, but
that shouldn't be too bad.



Not sure. I'd imagine the "compression" due to aggregating many tuples 
into much smaller amount of memory would be a clear win, so I don't find 
the "let's just dump all input tuples into a file" very attractive.


I think evicting a fraction of the aggregate state (say, ~25%) would 
work better - choosing the "oldest" items seems OK, as those likely 
represent many input tuples (thus having a high "compaction" ratio). Or 
we could evict items representing rare groups, to make space for the 
common ones. Perhaps a combined eviction strategy (10% of each type) 
would work well. I'm conveniently ignoring the fact that we don't have 
information to determine this, at the moment, of course.


I'm sure it's possible to make better decisions based on some cost 
estimates, both at plan time and then during execution.


That being said, I don't want to over-think / over-engineer this. 
Getting something that reduces the risk of OOM in the first step is a 
good enough improvement. If we can do something better next, great.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread Tomas Vondra

On 06/05/2018 03:41 PM, se...@rielau.com wrote:
In our code base we have added a WithStats-Flavor for creating memory 
contexts.
This api accepts a pointer to metric for accounting and it is inherited 
by all subcontexts unless overridden.
So we only needed to change context creation API where we wanted (such 
as TopTansactionContext, Message Context, ..)

That's quite trivial, actually.


I think that's pretty much what we tried when this patch was first 
discussed in 2015, but it added measurable overhead (1-3%) even when the 
accounting was disabled. Which is not quite desirable, of course.


Also we have fixed all those missing hash spills - albeit based on the 
9.6 hash table design I think.


I don't think this part of the code changed all that much.


If there is interest by the community we are very willing to share.


Absolutely! I think it'd be great to share both the code and the 
reasoning behind the design.



Cheers
Serge
Salesforce


cheers

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread Tomas Vondra

On 06/05/2018 03:17 PM, David Rowley wrote:

On 6 June 2018 at 01:09, Andres Freund  wrote:

On 2018-06-06 01:06:39 +1200, David Rowley wrote:

My concern is that only accounting memory for the group and not the
state is only solving half the problem. It might be fine for
aggregates that don't stray far from their aggtransspace, but for the
other ones, we could still see OOM.



If solving the problem completely is too hard, then a half fix (maybe
3/4) is better than nothing, but if we can get a design for a full fix
before too much work is done, then isn't that better?


I don't think we actually disagree.  I was really primarily talking
about the case where we can't really do better because we don't have
serialization support.  I mean we could just rescan from scratch, using
a groupagg, but that obviously sucks.


I don't think we do. To take yours to the 100% solution might just
take adding the memory accounting to palloc that Jeff proposed a few
years ago and use that accounting to decide when we should switch
method.

However, I don't quite fully recall how the patch accounted for
memory consumed by sub-contexts and if getting the entire
consumption required recursively looking at subcontexts. If that's
the case then checking the consumption would likely cost too much if
it was done after each tuple was aggregated.


Yeah, a simple wrapper would not really fix the issue, because 
allocating memory in the subcontext would not update the accounting 
information. So we'd be oblivious of possibly large amounts of memory. I 
don't quite see how is this related to (not) having serialization 
support, as it's more about not knowing we already hit the memory limit.



That being said, I don't think array_agg actually has the issue, at 
least since b419865a814abbca12bdd6eef6a3d5ed67f432e1 (it does behave 
differently in aggregate and non-aggregate contexts, IIRC).


I don't know how common this issue is, though. I don't think any 
built-in aggregates create additional contexts, but I may be wrong. But 
we have this damn extensibility thing, where users can write their own 
aggregates ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Spilling hashed SetOps and aggregates to disk

2018-06-07 Thread Tomas Vondra

On 06/07/2018 02:18 AM, Andres Freund wrote:

On 2018-06-06 17:17:52 -0700, Andres Freund wrote:

On 2018-06-07 12:11:37 +1200, David Rowley wrote:

On 7 June 2018 at 08:11, Tomas Vondra  wrote:

On 06/06/2018 04:11 PM, Andres Freund wrote:

Consider e.g. a scheme where we'd switch from hashed aggregation to
sorted aggregation due to memory limits, but already have a number of
transition values in the hash table. Whenever the size of the transition
values in the hashtable exceeds memory size, we write one of them to the
tuplesort (with serialized transition value). From then on further input
rows for that group would only be written to the tuplesort, as the group
isn't present in the hashtable anymore.



Ah, so you're suggesting that during the second pass we'd deserialize
the transition value and then add the tuples to it, instead of building
a new transition value. Got it.


Having to deserialize every time we add a new tuple sounds terrible
from a performance point of view.


I didn't mean that we do that, and I don't think David understood it as
that either. I was talking about the approach where the second pass is a
sort rather than hash based aggregation.  Then we would *not* need to
deserialize more than exactly once.


s/David/Tomas/, obviously. Sorry, it's been a long day.



Solution is simple: drink more coffee. ;-)

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Spilling hashed SetOps and aggregates to disk

2018-06-06 Thread Tomas Vondra
On 06/06/2018 04:11 PM, Andres Freund wrote:
> On 2018-06-06 16:06:18 +0200, Tomas Vondra wrote:
>> On 06/06/2018 04:01 PM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2018-06-06 15:58:16 +0200, Tomas Vondra wrote:
>>>> The other issue is that serialize/deserialize is only a part of a problem -
>>>> you also need to know how to do "combine", and not all aggregates can do
>>>> that ... (certainly not in universal way).
>>>
>>> There are several schemes where only serialize/deserialize are needed,
>>> no?  There are a number of fairly sensible schemes where there won't be
>>> multiple transition values for the same group, no?
>>>
>>
>> Possibly, not sure what schemes you have in mind exactly ...
>>
>> But if you know there's only a single transition value, why would you need
>> serialize/deserialize at all. Why couldn't you just finalize the value and
>> serialize that?
> 
> Because you don't necessarily have all the necessary input rows
> yet.
> 
> Consider e.g. a scheme where we'd switch from hashed aggregation to
> sorted aggregation due to memory limits, but already have a number of
> transition values in the hash table. Whenever the size of the transition
> values in the hashtable exceeds memory size, we write one of them to the
> tuplesort (with serialized transition value). From then on further input
> rows for that group would only be written to the tuplesort, as the group
> isn't present in the hashtable anymore.
> 

Ah, so you're suggesting that during the second pass we'd deserialize
the transition value and then add the tuples to it, instead of building
a new transition value. Got it.

That being said, I'm not sure if such generic serialize/deserialize can
work, but I'd guess no, otherwise we'd probably use it when implementing
the parallel aggregate.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: POC: GROUP BY optimization

2018-06-06 Thread Tomas Vondra
oice.

Isn't "estimation of cost of comparing function/number of unique values
in column could be not very accurate and so planner could make a wrong
choice" is more an argument against relying on it when doing these
optimizations?

FWIW it's one of the arguments Tom made in the incremental sort patch,
which relies on it too when computing cost of the incremental sort. I'm
sure it's going to be an obstacle there too.

> I saw 2 times difference in real-world application. Again, improving
> sort cost estimation is a separate task.
Sure. But we also need to ask the other question, i.e. how many people
would be negatively affected by the optimization. And I admit I don't
know the answer to that, the next example is entirely made up.

>>
>> Imagine you have a custom data type that is expensive for comparisons.
>> You know that, so you place it at the end of GROUP BY clauses, to
>> reduce the number of comparisons on that field. And then we come along
>> and just reorder the columns, placing it first, because it happens to
>> have a high ndistinct statistic. And there's no way to get the
>> original behavior :-(
> Hm. that it could be, but I don't know how to improve here.  Current
> cost_sort() will return the same cost for any columns order.
> 
> Okay, here we know an estimation of nrow, we could somehow find a
> comparing function oid and find a its procost field. And then? we also
> need to know width of tuple here and mostly repeat the cost_sort.
> 
> Another option is an introducing enable_groupby_reorder GUC variable
> although personally I don't like an idea to add new GUC variable.
> 

I don't like the idea of yet another GUC either, as it's rather blunt
instrument. Which is why I'm suggesting to split the patch into two
parts. Then we can apply the index stuff (which seems relatively
straightforward) and keep working on this second part.

FWIW I think it would be useful to have "development GUC" that would
allow us to enable/disable these options during development, because
that makes experiments much easier. But then remove them before commit.

>>> Agree, but I don't know how to use it here. Except, may be:
>>> 1) first column - the column with bigger estimated number of groups
>>> 2) second column - the pair of (first, second) with bigger estimated
>>> number of groups
>>> 3) third column - the triple of (first, second, third) with bigger ...
>>>
>>> But seems even with that algorithm, ISTM, it could be implemented in
>>> cheaper manner.
>>>
>>
>> Maybe. I do have some ideas, although I haven't tried implementing it.
> Implemented, pls, have a look.
> 
>>
>> If there's no extended statistic on the columns, you can do the
>> current thing (assuming independence etc.). There's not much we can do
>> here.
> Agree
> 
>>
>> If there's an extended statistic, you can do either a greedy search
>> (get the next column with the highest ndistinct coefficient) or
>> exhaustive search (computing the estimated number of comparisons).
>>
>> Another challenge is that using only the ndistinct coefficient assumes
>> uniform distribution of the values. But you can have a column with 1M
>> distinct values, where a single value represents 99% of the rows. And
>> another column with 100k distinct values, with actual uniform
>> distribution. I'm pretty sure it'd be more efficient to place the 100k
>> column first.
> 
> Interesting.  Will think, thank you
> 

You're welcome.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: POC: GROUP BY optimization

2018-06-06 Thread Tomas Vondra
On 06/06/2018 11:22 PM, Claudio Freire wrote:
> On Wed, Jun 6, 2018 at 5:43 PM Tomas Vondra
>  wrote:
>>
>>>>>> For example, it seems to disregard that different data types have
>>>>>> different comparison costs. For example comparing bytea will be far
>>>>>> more expensive compared to int4, so it may be much more efficient to
>>>>>> compare int4 columns first, even if there are far fewer distinct
>>>>>> values in them.
>>>>> as I can see cost_sort() doesn't pay attention to this details. And
>>>>> it should be a separated patch to improve that.
>>>>>
>>>>
>>>> But sort also does not reorder columns.
>>> Yes. But estimation of cost of comparing function/number of unique
>>> values in column could be not very accurate and so planner could make
>>> a wrong choice.
> 
> ...
>>
>>>> Imagine you have a custom data type that is expensive for comparisons.
>>>> You know that, so you place it at the end of GROUP BY clauses, to
>>>> reduce the number of comparisons on that field. And then we come along
>>>> and just reorder the columns, placing it first, because it happens to
>>>> have a high ndistinct statistic. And there's no way to get the
>>>> original behavior :-(
>>> Hm. that it could be, but I don't know how to improve here.  Current
>>> cost_sort() will return the same cost for any columns order.
>>>
>>> Okay, here we know an estimation of nrow, we could somehow find a
>>> comparing function oid and find a its procost field. And then? we also
>>> need to know width of tuple here and mostly repeat the cost_sort.
>>>
>>> Another option is an introducing enable_groupby_reorder GUC variable
>>> although personally I don't like an idea to add new GUC variable.
>>>
>>
>> I don't like the idea of yet another GUC either, as it's rather blunt
>> instrument. Which is why I'm suggesting to split the patch into two
>> parts. Then we can apply the index stuff (which seems relatively
>> straightforward) and keep working on this second part.
>>
>> FWIW I think it would be useful to have "development GUC" that would
>> allow us to enable/disable these options during development, because
>> that makes experiments much easier. But then remove them before commit.
> 
> Correct me if I'm wrong, but doesn't this patch concern itself with 
> precisely sort performance?
> 

Yes, the second part (reordering pathkeys by ndistinct) does.

> As such, estimating sort performance correctly in the various plan
> variants being considered seems to be a very central aspect of it.
> 
> This means it's pretty much time to consider the effect of operator
> cost *and* distinct values in the cost computation.
> 

Yes, until now that was not really needed because the optimizer does not
consider reordering of the pathkeys - it simply grabs either GROUP BY or
ORDER BY pathkeys and runs with that.

So the costing was fairly trivial, we simply do something like

comparison_cost = 2.0 * cpu_operator_cost;

sort_cost = comparison_cost * tuples * LOG2(tuples);

which essentially ignores that there might be multiple columns, or that
the columns may have sort operator with different costs.

The question is how reliable the heuristics can be. The current patch
uses just plain ndistinct, but that seems rather unreliable but I don't
have a clear idea how to improve that - we may have MCV for the columns
and perhaps some extended statistics, but I'm not sure how far we can
run with that.

Essentially what we need to estimate the number of comparisons for each
column, to compute better comparison_cost.

>
> Assuming cost_sort does its thing, it's just a matter of building the
> desired variants and picking the one with the smallest cost_sort. One
> can use heuristics to build a subset of all possible column orderings
> with a guiding principle that tries to maximize the likelihook of
> finding a better order than the one in the query (like sorting by
> ndistinct), but I'd suggest:
> 
> - Always considering the sort order verbatim as given in the SQL (ie:
> what the user requests)
> - Relying on cost_sort to distinguish among them, and pick a winner, and
> - When in doubt (if cost_sort doesn't produce a clear winner), use the
> order given by the user
> 

I don't see why not to generate all possible orderings (keeping only the
cheapest path for each pathkey variant) and let the optimizer to sort it
out. If the user specified an explicit ORDER BY, we'll slap an extra
Sort by at the end, increasing the cost.

> Comparison cost can be approximated proba

Re: Spilling hashed SetOps and aggregates to disk

2018-06-06 Thread Tomas Vondra



On 06/07/2018 02:11 AM, David Rowley wrote:
> On 7 June 2018 at 08:11, Tomas Vondra  wrote:
>> On 06/06/2018 04:11 PM, Andres Freund wrote:
>>> Consider e.g. a scheme where we'd switch from hashed aggregation to
>>> sorted aggregation due to memory limits, but already have a number of
>>> transition values in the hash table. Whenever the size of the transition
>>> values in the hashtable exceeds memory size, we write one of them to the
>>> tuplesort (with serialized transition value). From then on further input
>>> rows for that group would only be written to the tuplesort, as the group
>>> isn't present in the hashtable anymore.
>>>
>>
>> Ah, so you're suggesting that during the second pass we'd deserialize
>> the transition value and then add the tuples to it, instead of building
>> a new transition value. Got it.
> 
> Having to deserialize every time we add a new tuple sounds terrible
> from a performance point of view.
> 

I don't think Andres was suggesting that. I think the scheme was to sort
the tuples and read all the tuples for a group at once (so we would
deserialize just once).

> Can't we just:
> 
> 1. HashAgg until the hash table reaches work_mem.
> 2. Spill the entire table to disk.
> 3. Destroy the table and create a new one.
> 4. If more tuples: goto 1
> 5. Merge sort and combine each dumped set of tuples.
> 

... and this is pretty much what Jeff Davis suggested, I think. The
trouble is, each of those cases behaves nicely/terribly in different
corner cases.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: POC: GROUP BY optimization

2018-06-06 Thread Tomas Vondra


On 06/07/2018 12:18 AM, Claudio Freire wrote:
> On Wed, Jun 6, 2018 at 6:57 PM Tomas Vondra
>  wrote:
>>
>> On 06/06/2018 11:22 PM, Claudio Freire wrote:
>>> On Wed, Jun 6, 2018 at 5:43 PM Tomas Vondra
>>> As such, estimating sort performance correctly in the various plan
>>> variants being considered seems to be a very central aspect of it.
>>>
>>> This means it's pretty much time to consider the effect of operator
>>> cost *and* distinct values in the cost computation.
>>>
>>
>> Yes, until now that was not really needed because the optimizer does not
>> consider reordering of the pathkeys - it simply grabs either GROUP BY or
>> ORDER BY pathkeys and runs with that.
>>
>> So the costing was fairly trivial, we simply do something like
>>
>> comparison_cost = 2.0 * cpu_operator_cost;
>>
>> sort_cost = comparison_cost * tuples * LOG2(tuples);
>>
>> which essentially ignores that there might be multiple columns, or that
>> the columns may have sort operator with different costs.
>>
>> The question is how reliable the heuristics can be. The current patch
>> uses just plain ndistinct, but that seems rather unreliable but I don't
>> have a clear idea how to improve that - we may have MCV for the columns
>> and perhaps some extended statistics, but I'm not sure how far we can
>> run with that.
>>
>> Essentially what we need to estimate the number of comparisons for
>> each column, to compute better comparison_cost.
> 
> This could be approached by adjusting statistics by any relevant 
> restrictions applicable to the columns being grouped, as is done for 
> selectivity estimations.
> 

I don't follow how is this related to restrictions? Can you elaborate?

> I'm not sure how far that would get us, though. It would be rather
> easy to lose sight of those restrictions if there are complex
> operations involved.
> 
>>> Assuming cost_sort does its thing, it's just a matter of building the
>>> desired variants and picking the one with the smallest cost_sort. One
>>> can use heuristics to build a subset of all possible column orderings
>>> with a guiding principle that tries to maximize the likelihook of
>>> finding a better order than the one in the query (like sorting by
>>> ndistinct), but I'd suggest:
>>>
>>> - Always considering the sort order verbatim as given in the SQL (ie:
>>> what the user requests)
>>> - Relying on cost_sort to distinguish among them, and pick a winner, and
>>> - When in doubt (if cost_sort doesn't produce a clear winner), use the
>>> order given by the user
>>>
>>
>> I don't see why not to generate all possible orderings (keeping only the
>> cheapest path for each pathkey variant) and let the optimizer to sort it
>> out.
> 
> I'm assuming costing the full N! possible orderings would be
> prohibitively expensive.
> 

That's possible, yes - particularly for large N values. Perhaps there's
a simpler algorithm to compute a sufficient approximation. In a greedy
way, starting from some likely-good combination of columns, or so. I
haven't thought about this part very much ...

>> If the user specified an explicit ORDER BY, we'll slap an extra
>> Sort by at the end, increasing the cost.
> 
> I think you misunderstood me. I'm saying the exact opposite.
> 
> When I mention handicap, I mean to give the explicitly requested group
> by order a *reduced* cost, to give it an advantage over the
> heuristics.
> 

I did understand you. I'm saying that if the ordering does not match the
one requested by the user (in ORDER BY), we end up adding an explicit
Sort node on top of it, which increases the cost of all the other paths,
acting as a disadvantage.

But now I realize this only works for when there is an ORDER BY clause,
not for GROUP BY (in which case we don't add any Sort node).

> This is to try to attack the problem you mentioned where users already
> accounting for operator costs in their queries would get overridden by
> the planner, perhaps in detriment of overall execution time.
> 
> In essence:
> 
> - If the user requested that order, we assume it "somewhat
> subjectively better" (by giving it a slightly reduced cost).
> - If there is an explicit order by clause that differs from a
> considered path, the required sort node will already penalize it
> appropriately, nothing needs to be done in relation to sort costs.
> - All other things being equal, cost_sort will make the decision. If a
> plan beats the user-provided order in spite of the handicap, it will
> be used. So it's still optimizing clear cases.
> 

OK. I 

Re: Spilling hashed SetOps and aggregates to disk

2018-06-06 Thread Tomas Vondra

On 06/06/2018 04:01 PM, Andres Freund wrote:

Hi,

On 2018-06-06 15:58:16 +0200, Tomas Vondra wrote:

The other issue is that serialize/deserialize is only a part of a problem -
you also need to know how to do "combine", and not all aggregates can do
that ... (certainly not in universal way).


There are several schemes where only serialize/deserialize are needed,
no?  There are a number of fairly sensible schemes where there won't be
multiple transition values for the same group, no?



Possibly, not sure what schemes you have in mind exactly ...

But if you know there's only a single transition value, why would you 
need serialize/deserialize at all. Why couldn't you just finalize the 
value and serialize that?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: effect of JIT tuple deform?

2018-06-27 Thread Tomas Vondra

On 06/26/2018 09:25 PM, Pavel Stehule wrote:

Hi

...

So I am able to see effect of jit_tuple_deforming, and very well, but 
only if optimization is active. When optimization is not active then 
jit_tuple_deforming does slowdown.


So maybe a usage of jit_tuple_deforming can be conditioned by 
jit_optimization?




Can you share the test case and some detail about the hardware and 
PostgreSQL configuration?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL prefetch

2018-06-27 Thread Tomas Vondra

On 06/27/2018 11:44 AM, Konstantin Knizhnik wrote:


...

I have improved my WAL prefetch patch. The main reason of slowdown 
recovery speed with enabled prefetch was that it doesn't take in account 
initialized pages  (XLOG_HEAP_INIT_PAGE)

and doesn't remember (cache) full page writes.
The main differences of new version of the patch:

1. Use effective_cache_size as size of cache of prefetched blocks
2. Do not prefetch blocks sent in shared buffers
3. Do not prefetch blocks  for RM_HEAP_ID with XLOG_HEAP_INIT_PAGE bit set
4. Remember new/fpw pages in prefetch cache, to avoid prefetch them for 
subsequent  WAL records.
5. Add min/max prefetch lead parameters to make it possible to 
synchronize speed of prefetch with speed of replay.
6. Increase size of open file cache to avoid redundant open/close 
operations.




Thanks. I plan to look at it and do some testing, but I won't have time 
until the end of next week (probably).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-06-24 Thread Tomas Vondra
Hi all,

Attached is a rebased version of this patch series, mostly just fixing
the breakage caused by reworked format of initial catalog data.

Aside from that, the MCV building now adopts the logic introduced by
commit b5db1d93d2 for single-column MCV lists. The new algorithm seems
pretty good and I don't see why multi-column MCV lists should use
something special.

I'm sure there are plenty of open questions to discuss, particularly
stuff related to combining the various types of statistics to the final
estimate (a lot of that was already improved based on Dean's reviews).

On thing that occurred to me while comparing the single-column logic (as
implemented in selfuncs.c) and the new multi-column stuff, is dealing
with partially-matching histogram buckets.

In the single-column case, we pretty much assume uniform distribution in
each bucket, and linearly interpolate the selectivity. So for a bucket
with boundaries [0, 10] and condition "x <= 5" we return 0.5, for "x <
7" we return 0.7 and so on. This is what convert_to_scalar() does.

In the multi-column case, we simply count each matching bucket as 0.5,
without any attempts to linearly interpolate. It would not be difficult
to call "convert_to_scalar" for each condition (essentially repeating
the linear interpolation for each column), but then what? We could
simply compute a product of those results, of course, but that only
works assuming independence. And that's exactly the wrong thing to
assume here, considering the extended statistics are meant for cases
where the columns are not independent.

So I'd argue the 0.5 estimate for partially-matching buckets is the
right thing to do here, as it's minimizing the average error.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-multivariate-MCV-lists.patch.gz
Description: application/gzip


0002-multivariate-histograms.patch.gz
Description: application/gzip


Re: WIP: BRIN multi-range indexes

2018-06-24 Thread Tomas Vondra
On 06/24/2018 11:39 PM, Thomas Munro wrote:
> On Sun, Jun 24, 2018 at 2:01 PM, Tomas Vondra
>  wrote:
>> Attached is rebased version of this BRIN patch series, fixing mostly the
>> breakage due to 372728b0 (aka initial-catalog-data format changes). As
>> 2018-07 CF is meant for almost-ready patches, this is more a 2018-09
>> material. But perhaps someone would like to take a look - and I'd have
>> to fix it anyway ...
> 
> Hi Tomas,
> 
> FYI Windows doesn't like this:
> 
>   src/backend/access/brin/brin_bloom.c(146): warning C4013: 'round'
> undefined; assuming extern returning int
> [C:\projects\postgresql\postgres.vcxproj]
> 
>   brin_bloom.obj : error LNK2019: unresolved external symbol round
> referenced in function bloom_init
> [C:\projects\postgresql\postgres.vcxproj]
> 

Thanks, I've noticed the failure before, but was not sure what's the
exact cause. It seems there's still no 'round' on Windows, so I'll
probably fix that by using rint() instead, or something like that.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

2018-06-24 Thread Tomas Vondra


On 06/25/2018 12:31 AM, Tomas Vondra wrote:
> On 06/24/2018 11:39 PM, Thomas Munro wrote:
>> On Sun, Jun 24, 2018 at 2:01 PM, Tomas Vondra
>>  wrote:
>>> Attached is rebased version of this BRIN patch series, fixing mostly the
>>> breakage due to 372728b0 (aka initial-catalog-data format changes). As
>>> 2018-07 CF is meant for almost-ready patches, this is more a 2018-09
>>> material. But perhaps someone would like to take a look - and I'd have
>>> to fix it anyway ...
>>
>> Hi Tomas,
>>
>> FYI Windows doesn't like this:
>>
>>   src/backend/access/brin/brin_bloom.c(146): warning C4013: 'round'
>> undefined; assuming extern returning int
>> [C:\projects\postgresql\postgres.vcxproj]
>>
>>   brin_bloom.obj : error LNK2019: unresolved external symbol round
>> referenced in function bloom_init
>> [C:\projects\postgresql\postgres.vcxproj]
>>
> 
> Thanks, I've noticed the failure before, but was not sure what's the
> exact cause. It seems there's still no 'round' on Windows, so I'll
> probably fix that by using rint() instead, or something like that.
> 

OK, here is a version tweaked to use floor()/ceil() instead of round().
Let's see if the Windows machine likes that more.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-Pass-all-keys-to-BRIN-consistent-function-a-20180625.patch.gz
Description: application/gzip


0002-Move-IS-NOT-NULL-checks-to-bringetbitmap-20180625.patch.gz
Description: application/gzip


0003-BRIN-bloom-indexes-20180625.patch.gz
Description: application/gzip


0004-BRIN-multi-range-minmax-indexes-20180625.patch.gz
Description: application/gzip


Re: WIP: BRIN multi-range indexes

2018-06-23 Thread Tomas Vondra
general).

Of course, how to size the bloom filter? It's worth noting the false
positive rate of the filter is essentially the fraction of a table that
will be scanned every time.

Similarly to the minmax-multi, parameters for computing optimal filter
size are set as reloptions (false_positive_rate, n_distinct_per_range)
with some reasonable defaults (1% false positive rate and distinct
values 10% of maximum heap tuples in a page range).

Note: When building the filter, we don't compute the hashes from the
original values, but we first use the type-specific hash function (the
same we'd use for hash indexes or hash joins) and then use the hash a as
an input for the bloom filter. This generally works fine, but if "our"
hash function generates a lot of collisions, it increases false positive
ratio of the whole filter. I'm not aware of a case where this would be
an issue, though.

What further complicates sizing of the bloom filter is available space -
the whole bloom filter needs to fit onto an 8kB page, and "full" bloom
filters with about 1/2 the bits set are pretty non-compressible. So
there's maybe ~8000 bytes for the bitmap. So for columns with many
distinct values, it may be necessary to make the page range smaller, to
reduce the number of distinct values in it.

And of course it requires good ndistinct estimates, not just for the
column as a whole, but for a single page range (because that's what
matters for sizing the bloom filter). Which is not a particularly
reliable estimate, I'm afraid.

So reloptions seem like a sufficient solution, at least for now.


open questions
==

* I suspect the definition of cross-type opclasses (int2 vs. int8) are
not entirely correct. That probably needs another look.

* The bloom filter now works in two modes - sorted (where in the sorted
mode it stores the hashes directly) and hashed (the usual bloom filter
behavior). The idea is that for ranges with significantly fewer distinct
values, we only store those to save space (instead of allocating the
whole bloom filter with mostly 0 bits).


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz
Description: application/gzip


0002-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz
Description: application/gzip


0003-BRIN-bloom-indexes.patch.gz
Description: application/gzip


0004-BRIN-multi-range-minmax-indexes.patch.gz
Description: application/gzip


Re: Push down Aggregates below joins

2018-06-20 Thread Tomas Vondra
On 06/20/2018 10:12 PM, Heikki Linnakangas wrote:
> Currently, the planner always first decides the scan/join order, and
> adds Group/Agg nodes on top of the joins. Sometimes it would be legal,
> and beneficial, to perform the aggregation below a join. I've been
> hacking on a patch to allow that.
> 

There was a patch [1] from Antonin Houska aiming to achieve something
similar. IIRC it aimed to push the aggregate down in more cases,
leveraging the partial aggregation stuff. I suppose your patch only aims
to do the pushdown when the two-phase aggregation is not needed?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Push down Aggregates below joins

2018-06-20 Thread Tomas Vondra
On 06/20/2018 10:12 PM, Heikki Linnakangas wrote:
> Currently, the planner always first decides the scan/join order, and
> adds Group/Agg nodes on top of the joins. Sometimes it would be legal,
> and beneficial, to perform the aggregation below a join. I've been
> hacking on a patch to allow that.
> 

There was a patch [1] from Antonin Houska aiming to achieve something
similar. IIRC it aimed to push the aggregate down in more cases,
leveraging the partial aggregation stuff. I suppose your patch only aims
to do the pushdown when the two-phase aggregation is not needed?

[1] https://www.postgresql.org/message-id/flat/9666.1491295317@localhost

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



memory leak when serializing TRUNCATE in reorderbuffer

2018-06-20 Thread Tomas Vondra

Hi,

While rebasing the logical replication patches on top of PG11, I've 
noticed that ReorderBufferSerializeChange claims this:


case REORDER_BUFFER_CHANGE_TRUNCATE:
...
/* ReorderBufferChange contains everything important */

That is not quite correct, though - the OIDs of truncated relations is 
stored in a separately palloc-ed array. So we only serialize the pointer 
to that array (which is part of ReorderBufferChange) and then read it 
back when restoring the change from disk.


Now, this can't cause crashes, because the 'relids' array won't go away 
(more on this later), and so the pointer remains valid. But it's a 
memory leak - a quite small and not very common one, because people 
don't do TRUNCATE very often, particularly not with many tables.


So I think we should fix and serialize/restore the OID array, just like 
we do for tuples, snapshots etc. See the attached fix.


Another thing we should probably reconsider is where the relids is 
allocated - the pointer remains valid because we happen to allocate it 
in TopMemoryContext. It's not that bad because we don't free the other 
reorderbuffer contexts until the walsender exits anyway, but still.


So I propose to allocate it in rb->context just like the other bits of 
data (snapshots, ...). Replacing the palloc() in DecodeTruncate() with 
something like:


   MemoryContextAlloc(ctx->reorder->context,
  xlrec->nrelids * sizeof(Oid));

should do the trick. The other places in decode.c don't allocate memory 
directly but call ReorderBufferGetTupleBuf() instead - perhaps we should 
introduce a similar wrapper here too.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e2f59bf580..582bedf1de 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -412,10 +412,14 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 			}
 			break;
 			/* no data in addition to the struct itself */
+		case REORDER_BUFFER_CHANGE_TRUNCATE:
+			if (change->data.truncate.relids != NULL)
+pfree(change->data.truncate.relids);
+			change->data.truncate.relids = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
-		case REORDER_BUFFER_CHANGE_TRUNCATE:
 			break;
 	}
 
@@ -2289,6 +2293,26 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 break;
 			}
 		case REORDER_BUFFER_CHANGE_TRUNCATE:
+			{
+Size	size;
+char   *data;
+
+/* account for the OIDs of truncated relations */
+size = sizeof(Oid) * change->data.truncate.nrelids;
+sz += size;
+
+/* make sure we have enough space */
+ReorderBufferSerializeReserve(rb, sz);
+
+data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+/* might have been reallocated above */
+ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+
+memcpy(data, change->data.truncate.relids, size);
+data += size;
+
+break;
+			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
@@ -2569,6 +2593,16 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 			/* the base struct contains all the data, easy peasy */
 		case REORDER_BUFFER_CHANGE_TRUNCATE:
+			{
+Size		size;
+
+size = change->data.truncate.nrelids * sizeof(Oid);
+
+change->data.truncate.relids = MemoryContextAllocZero(rb->context, size);
+
+memcpy(change->data.truncate.relids, data, size);
+break;
+			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:


Re: WAL prefetch

2018-06-19 Thread Tomas Vondra

On 06/19/2018 02:33 PM, Konstantin Knizhnik wrote:


On 19.06.2018 14:03, Tomas Vondra wrote:


On 06/19/2018 11:08 AM, Konstantin Knizhnik wrote:


...

>>>
Also there are two points which makes prefetching into shared buffers 
more complex:
1. Need to spawn multiple workers to make prefetch in parallel and 
somehow distribute work between them.
2. Synchronize work of recovery process with prefetch to prevent 
prefetch to go too far and doing useless job.
The same problem exists for prefetch in OS cache, but here risk of 
false prefetch is less critical.




I think the main challenge here is that all buffer reads are currently 
synchronous (correct me if I'm wrong), while the posix_fadvise() 
allows a to prefetch the buffers asynchronously.


Yes, this is why we have to spawn several concurrent background workers 
to perfrom prefetch.


Right. My point is that while spawning bgworkers probably helps, I don't 
expect it to be enough to fill the I/O queues on modern storage systems. 
Even if you start say 16 prefetch bgworkers, that's not going to be 
enough for large arrays or SSDs. Those typically need way more than 16 
requests in the queue.


Consider for example [1] from 2014 where Merlin reported how S3500 
(Intel SATA SSD) behaves with different effective_io_concurrency values:


[1] 
https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azs...@mail.gmail.com


Clearly, you need to prefetch 32/64 blocks or so. Consider you may have 
multiple such devices in a single RAID array, and that this device is 
from 2014 (and newer flash devices likely need even deeper queues).


ISTM a small number of bgworkers is not going to be sufficient. It might 
be enough for WAL prefetching (where we may easily run into the 
redo-is-single-threaded bottleneck), but it's hardly a solution for 
bitmap heap scans, for example. We'll need to invent something else for 
that.


OTOH my guess is that whatever solution we'll end up implementing for 
bitmap heap scans, it will be applicable for WAL prefetching too. Which 
is why I'm suggesting simply using posix_fadvise is not going to make 
the direct I/O patch significantly more complicated.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL prefetch

2018-06-19 Thread Tomas Vondra




On 06/19/2018 04:50 PM, Konstantin Knizhnik wrote:



On 19.06.2018 16:57, Ants Aasma wrote:
On Tue, Jun 19, 2018 at 4:04 PM Tomas Vondra 
mailto:tomas.von...@2ndquadrant.com>> 
wrote:


Right. My point is that while spawning bgworkers probably helps, I
don't
expect it to be enough to fill the I/O queues on modern storage
systems.
Even if you start say 16 prefetch bgworkers, that's not going to be
enough for large arrays or SSDs. Those typically need way more
than 16
requests in the queue.

Consider for example [1] from 2014 where Merlin reported how S3500
(Intel SATA SSD) behaves with different effective_io_concurrency
values:

[1]

https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azs...@mail.gmail.com

Clearly, you need to prefetch 32/64 blocks or so. Consider you may
have
multiple such devices in a single RAID array, and that this device is
from 2014 (and newer flash devices likely need even deeper queues).'


For reference, a typical datacenter SSD needs a queue depth of 128 to 
saturate a single device. [1] Multiply that appropriately for RAID 
arrays.So


How it is related with results for S3500  where this is almost now 
performance improvement for effective_io_concurrency >8?
Starting 128 or more workers for performing prefetch is definitely not 
acceptable...




I'm not sure what you mean by "almost now performance improvement", but 
I guess you meant "almost no performance improvement" instead?


If that's the case, it's not quite true - increasing the queue depth 
above 8 further improved the throughput by about ~10-20% (both by 
duration and peak throughput measured by iotop).


But more importantly, this is just a single device - you typically have 
multiple of them in a larger arrays, to get better capacity, performance 
and/or reliability. So if you have 16 such drives, and you want to send 
at least 8 requests to each, suddenly you need at least 128 requests.


And as pointed out before, S3500 is about 5-years old device (it was 
introduced in Q2/2013). On newer devices the difference is usually way 
more significant / the required queue depth is much higher.


Obviously, this is a somewhat simplified view, ignoring various details 
(e.g. that there may be multiple concurrent queries, each sending I/O 
requests - what matters is the combined number of requests, of course). 
But I don't think this makes a huge difference.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2017-12-23 Thread Tomas Vondra


On 12/23/2017 03:03 PM, Erikjan Rijkers wrote:
> On 2017-12-23 05:57, Tomas Vondra wrote:
>> Hi all,
>>
>> Attached is a patch series that implements two features to the logical
>> replication - ability to define a memory limit for the reorderbuffer
>> (responsible for building the decoded transactions), and ability to
>> stream large in-progress transactions (exceeding the memory limit).
>>
> 
> logical replication of 2 instances is OK but 3 and up fail with:
> 
> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
> "reorderbuffer.c", Line: 1773)
> 
> I can cobble up a script but I hope you have enough from the assertion
> to see what's going wrong...

The assertion says that the iterator produces changes in order that does
not correlate with LSN. But I have a hard time understanding how that
could happen, particularly because according to the line number this
happens in ReorderBufferCommit(), i.e. the current (non-streaming) case.

So instructions to reproduce the issue would be very helpful.

Attached is v2 of the patch series, fixing two bugs I discovered today.
I don't think any of these is related to your issue, though.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch.gz
Description: application/gzip


0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch.gz
Description: application/gzip


0003-Issue-individual-invalidations-with-wal_level-log-v2.patch.gz
Description: application/gzip


0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch.gz
Description: application/gzip


0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch.gz
Description: application/gzip


0006-Add-support-for-streaming-to-built-in-replication-v2.patch.gz
Description: application/gzip


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2017-12-23 Thread Tomas Vondra


On 12/23/2017 11:23 PM, Erik Rijkers wrote:
> On 2017-12-23 21:06, Tomas Vondra wrote:
>> On 12/23/2017 03:03 PM, Erikjan Rijkers wrote:
>>> On 2017-12-23 05:57, Tomas Vondra wrote:
>>>> Hi all,
>>>>
>>>> Attached is a patch series that implements two features to the logical
>>>> replication - ability to define a memory limit for the reorderbuffer
>>>> (responsible for building the decoded transactions), and ability to
>>>> stream large in-progress transactions (exceeding the memory limit).
>>>>
>>>
>>> logical replication of 2 instances is OK but 3 and up fail with:
>>>
>>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
>>> "reorderbuffer.c", Line: 1773)
>>>
>>> I can cobble up a script but I hope you have enough from the assertion
>>> to see what's going wrong...
>>
>> The assertion says that the iterator produces changes in order that does
>> not correlate with LSN. But I have a hard time understanding how that
>> could happen, particularly because according to the line number this
>> happens in ReorderBufferCommit(), i.e. the current (non-streaming) case.
>>
>> So instructions to reproduce the issue would be very helpful.
> 
> Using:
> 
> 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
> 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
> 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
> 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
> 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
> 0006-Add-support-for-streaming-to-built-in-replication-v2.patch
> 
> As you expected the problem is the same with these new patches.
> 
> I have now tested more, and seen that it not always fails.  I guess that
> it here fails 3 times out of 4.  But the laptop I'm using at the moment
> is old and slow -- it may well be a factor as we've seen before [1].
> 
> Attached is the bash that I put together.  I tested with
> NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails
> often.  This same program run with HEAD never seems to fail (I tried a
> few dozen times).
> 

Thanks. Unfortunately I still can't reproduce the issue. I even tried
running it in valgrind, to see if there are some memory access issues
(which should also slow it down significantly).

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2017-12-24 Thread Tomas Vondra


On 12/24/2017 05:51 AM, Craig Ringer wrote:
> On 23 December 2017 at 12:57, Tomas Vondra <tomas.von...@2ndquadrant.com
> <mailto:tomas.von...@2ndquadrant.com>> wrote:
> 
> Hi all,
> 
> Attached is a patch series that implements two features to the logical
> replication - ability to define a memory limit for the reorderbuffer
> (responsible for building the decoded transactions), and ability to
> stream large in-progress transactions (exceeding the memory limit).
> 
> I'm submitting those two changes together, because one builds on the
> other, and it's beneficial to discuss them together.
> 
> 
> PART 1: adding logical_work_mem memory limit (0001)
> ---
> 
> Currently, limiting the amount of memory consumed by logical decoding is
> tricky (or you might say impossible) for several reasons:
> 
> * The value is hard-coded, so it's not quite possible to customize it.
> 
> * The amount of decoded changes to keep in memory is restricted by
> number of changes. It's not very unclear how this relates to memory
> consumption, as the change size depends on table structure, etc.
> 
> * The number is "per (sub)transaction", so a transaction with many
> subtransactions may easily consume significant amount of memory without
> actually hitting the limit.
> 
> 
> Also, even without subtransactions, we assemble a ReorderBufferTXN
> per transaction. Since transactions usually occur concurrently,
> systems with many concurrent txns can face lots of memory use.
> 

I don't see how that could be a problem, considering the number of
toplevel transactions is rather limited (to max_connections or so).

> We can't exclude tables that won't actually be replicated at the reorder
> buffering phase either. So txns use memory whether or not they do
> anything interesting as far as a given logical decoding session is
> concerned. Even if we'll throw all the data away we must buffer and
> assemble it first so we can make that decision.

Yep.

> Because logical decoding considers snapshots and cid increments even
> from other DBs (at least when the txn makes catalog changes) the memory
> use can get BIG too. I was recently working with a system that had
> accumulated 2GB of snapshots ... on each slot. With 7 slots, one for
> each DB.
> 
> So there's lots of room for difficulty with unpredictable memory use.
> 

Yep.

> So the patch does two things. Firstly, it introduces logical_work_mem, a
> GUC restricting memory consumed by all transactions currently kept in
>     the reorder buffer
> 
> 
> Does this consider the (currently high, IIRC) overhead of tracking
> serialized changes?
>  

Consider in what sense?


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2017-12-25 Thread Tomas Vondra


On 12/24/2017 10:00 AM, Erik Rijkers wrote:
>>>>>
>>>>> logical replication of 2 instances is OK but 3 and up fail with:
>>>>>
>>>>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
>>>>> "reorderbuffer.c", Line: 1773)
>>>>>
>>>>> I can cobble up a script but I hope you have enough from the assertion
>>>>> to see what's going wrong...
>>>>
>>>> The assertion says that the iterator produces changes in order that
>>>> does
>>>> not correlate with LSN. But I have a hard time understanding how that
>>>> could happen, particularly because according to the line number this
>>>> happens in ReorderBufferCommit(), i.e. the current (non-streaming)
>>>> case.
>>>>
>>>> So instructions to reproduce the issue would be very helpful.
>>>
>>> Using:
>>>
>>> 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
>>> 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
>>> 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
>>> 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
>>> 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
>>> 0006-Add-support-for-streaming-to-built-in-replication-v2.patch
>>>
>>> As you expected the problem is the same with these new patches.
>>>
>>> I have now tested more, and seen that it not always fails.  I guess that
>>> it here fails 3 times out of 4.  But the laptop I'm using at the moment
>>> is old and slow -- it may well be a factor as we've seen before [1].
>>>
>>> Attached is the bash that I put together.  I tested with
>>> NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails
>>> often.  This same program run with HEAD never seems to fail (I tried a
>>> few dozen times).
>>>
>>
>> Thanks. Unfortunately I still can't reproduce the issue. I even tried
>> running it in valgrind, to see if there are some memory access issues
>> (which should also slow it down significantly).
> 
> One wonders again if 2ndquadrant shouldn't invest in some old hardware ;)
> 

Well, I've done tests on various machines, including some really slow
ones, and I still haven't managed to reproduce the failures using your
script. So I don't think that would really help. But I have reproduced
it by using a custom stress test script.

Turns out the asserts are overly strict - instead of

  Assert(prev_lsn < current_lsn);

it should have been

  Assert(prev_lsn <= current_lsn);

because some XLOG records may contain multiple rows (e.g. MULTI_INSERT).

The attached v3 fixes this issue, and also a couple of other thinkos:

1) The AssertChangeLsnOrder assert check was somewhat broken.

2) We've been sending aborts for all subtransactions, even those not yet
streamed. So downstream got confused and fell over because of an assert.

3) The streamed transactions were written to /tmp, using filenames using
subscription OID and XID of the toplevel transaction. That's fine, as
long as there's just a single replica running - if there are more, the
filenames will clash, causing really strange failures. So move the files
to base/pgsql_tmp where regular temporary files are written. I'm not
claiming this is perfect, perhaps we need to invent another location.

FWIW I believe the relation sync cache is somewhat broken by the
streaming. I thought resetting it would be good enough, but it's more
complicated (and trickier) than that. I'm aware of it, and I'll look
into that next - but probably not before 2018.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v3.patch.gz
Description: application/gzip


0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v3.patch.gz
Description: application/gzip


0003-Issue-individual-invalidations-with-wal_level-log-v3.patch.gz
Description: application/gzip


0004-Extend-the-output-plugin-API-with-stream-methods-v3.patch.gz
Description: application/gzip


0005-Implement-streaming-mode-in-ReorderBuffer-v3.patch.gz
Description: application/gzip


0006-Add-support-for-streaming-to-built-in-replication-v3.patch.gz
Description: application/gzip


Re: WIP: BRIN multi-range indexes

2018-01-07 Thread Tomas Vondra


On 12/20/2017 03:37 AM, Mark Dilger wrote:
> 
>> On Dec 19, 2017, at 5:16 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> 
>> wrote:
>>
>>
>>
>> On 12/19/2017 08:38 PM, Mark Dilger wrote:
>>>
>>>> On Nov 18, 2017, at 12:45 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> 
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Apparently there was some minor breakage due to duplicate OIDs, so here
>>>> is the patch series updated to current master.
>>>>
>>>> regards
>>>>
>>>> -- 
>>>> Tomas Vondra  http://www.2ndQuadrant.com
>>>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>>> <0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz><0002-BRIN-bloom-indexes.patch.gz><0003-BRIN-multi-range-minmax-indexes.patch.gz><0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz>
>>>
>>>
>>> After applying these four patches to my copy of master, the regression
>>> tests fail for F_SATISFIES_HASH_PARTITION 5028 as attached.
>>>
>>
>> D'oh! There was an incorrect OID referenced in pg_opclass, which was
>> also used by the satisfies_hash_partition() function. Fixed patches
>> attached.
> 
> Your use of type ScanKey in src/backend/access/brin/brin.c is a bit 
> confusing.  A
> ScanKey is defined elsewhere as a pointer to ScanKeyData.  When you define
> an array of ScanKeys, you use pointer-to-pointer style:
> 
> +   ScanKey   **keys,
> +  **nullkeys;
> 
> But when you allocate space for the array, you don't treat it that way:
> 
> +   keys = palloc0(sizeof(ScanKey) * bdesc->bd_tupdesc->natts);
> +   nullkeys = palloc0(sizeof(ScanKey) * bdesc->bd_tupdesc->natts);
> 
> But then again when you use nullkeys, you treat it as a two-dimensional array:
> 
> +   nullkeys[keyattno - 1][nnullkeys[keyattno - 1]] = key;
> 
> and likewise when you allocate space within keys:
> 
> +keys[keyattno - 1] = palloc0(sizeof(ScanKey) * scan->numberOfKeys);
> 
> Could you please clarify?  I have been awake a bit too long; hopefully, I am
> not merely missing the obvious.
> 

Yeah, that's wrong - it should be "sizeof(ScanKey *)" instead. It's
harmless, though, because ScanKey itself is a pointer, so the size is
the same.

Attached is an updated version of the patch series, significantly
reworking and improving the multi-minmax part (the rest of the patch is
mostly as it was before).

I've significantly refactored and cleaned up the multi-minmax part, and
I'm much happier with it - no doubt there's room for further improvement
but overall it's much better.

I've also added proper sgml docs for this part, and support for more
data types including variable-length ones (all integer types, numeric,
float-based types including timestamps, uuid, and a couple of others).

At the API level, I needed to add one extra support procedure that
measures distance between two values of the data type. This is needed so
because we only keep a limited number of intervals for each range, and
once in a while we need to decide which of them to "merge" (and we
simply merge the closest ones).

I've passed the indexes through significant testing and fixed a couple
of silly bugs / memory leaks. Let's see if there are more.

Performance-wise, the CREATE INDEX seems a bit slow - it's about an
order of magnitude slower than regular BRIN. Some of that is expected
(we simply need to do more stuff to maintain multiple ranges), but
perhaps there's room for additional improvements - that's something I'd
like to work on next.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz
Description: application/gzip


0002-BRIN-bloom-indexes.patch.gz
Description: application/gzip


0003-BRIN-multi-range-minmax-indexes.patch.gz
Description: application/gzip


0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz
Description: application/gzip


Re: [HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means

2018-01-08 Thread Tomas Vondra

On 01/05/2018 05:25 AM, Stephen Frost wrote:
> Tomas,
> 
> * Michael Paquier (michael.paqu...@gmail.com) wrote:
>> On Sun, Nov 19, 2017 at 6:40 AM, Tomas Vondra
>> <tomas.von...@2ndquadrant.com> wrote:
>>> Thanks for looking into this. I agree your patch is more consistent and
>>> generally cleaner.
>>
>> This is classified as a bug fix, and is marked as waiting on author. I
>> am moving it to next CF as work continues.
> 
> This is still in waiting-on-author state; it'd be great to get your 
> thoughts on where this patch is and what the next steps are to move
> it forward. If you need additional feedback or there needs to be
> more discussion (perhaps with Tom) then maybe this should be in
> needs-review instead, but it'd be great if you could review the
> discussion and patch again and at least provide an update on it (last
> update from you was in November).
> 

Hmmm, I'm not sure, TBH.

As I already mentioned, Tom's updated patch is better than what I posted
initially, and I agree with his approach to the remaining issues he
pointed out. But I somehow assumed that he's already looking into that.

Tom, do you plan to look into this patch soon, or should I?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means

2018-01-08 Thread Tomas Vondra


On 01/08/2018 08:39 PM, Tom Lane wrote:
> Tomas Vondra <tomas.von...@2ndquadrant.com> writes:
>> As I already mentioned, Tom's updated patch is better than what I
>> posted initially, and I agree with his approach to the remaining
>> issues he pointed out. But I somehow assumed that he's already
>> looking into that. Tom, do you plan to look into this patch soon,
>> or should I?
> 
> No, I thought you were going to run with those ideas. I have a lot
> of other stuff on my plate ...
> 

OK, will do.

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PROPOSAL] Shared Ispell dictionaries

2018-01-08 Thread Tomas Vondra
Hi Arthur,

Sorry for the delay, I somehow missed this thread ...

On 12/27/2017 10:20 AM, Arthur Zakirov wrote:
> On Tue, Dec 26, 2017 at 07:03:48PM +0100, Pavel Stehule wrote:
>>
>> Tomas had some workable patches related to this topic
>>
> 
> Tomas, have you planned to propose it?
> 

I believe Pavel was referring to this extension:

https://github.com/tvondra/shared_ispell

I wasn't going to submit that as in-core solution, but I'm happy you're
making improvements in that direction. I'll take a look at your patch
shortly.

ragards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Built-in connection pooling

2018-01-18 Thread Tomas Vondra
Hi Konstantin,

On 01/18/2018 03:48 PM, Konstantin Knizhnik wrote:
> On 17.01.2018 19:09, Konstantin Knizhnik wrote:
>> Hi hackers,
>>
>> ...
> 

I haven't looked at the code yet, but after reading your message I have
a simple question - gow iss this going to work with SSL? If you're only
passing a file descriptor, that does not seem to be sufficient for the
backends to do crypto (that requires the SSL stuff from Port).

Maybe I'm missing something and it already works, though ...


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2018-01-19 Thread Tomas Vondra
Attached is v5, fixing a silly bug in part 0006, causing segfault when
creating a subscription.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v5.patch.gz
Description: application/gzip


0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v5.patch.gz
Description: application/gzip


0003-Issue-individual-invalidations-with-wal_level-log-v5.patch.gz
Description: application/gzip


0004-Extend-the-output-plugin-API-with-stream-methods-v5.patch.gz
Description: application/gzip


0005-Implement-streaming-mode-in-ReorderBuffer-v5.patch.gz
Description: application/gzip


0006-Add-support-for-streaming-to-built-in-replication-v5.patch.gz
Description: application/gzip


0007-Track-statistics-for-streaming-spilling-v5.patch.gz
Description: application/gzip


0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as-v5.patch.gz
Description: application/gzip


0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup-v5.patch.gz
Description: application/gzip


Re: Built-in connection pooling

2018-01-19 Thread Tomas Vondra


On 01/19/2018 05:53 PM, Konstantin Knizhnik wrote:
> 
> 
> On 19.01.2018 19:28, Pavel Stehule wrote:
>>
>>
>> When I've been thinking about adding a built-in connection
>> pool, my
>> rough plan was mostly "bgworker doing something like
>> pgbouncer" (that
>> is, listening on a separate port and proxying everything to
>> regular
>> backends). Obviously, that has pros and cons, and probably
>> would not
>> work serve the threading use case well.
>>
>>
>> And we will get the same problem as with pgbouncer: one process
>> will not be able to handle all connections...
>> Certainly it is possible to start several such scheduling
>> bgworkers... But in any case it is more efficient to multiplex
>> session in backend themselves.
>>
>>
>> pgbouncer hold all time client connect. When we implement the
>> listeners, then all work can be done by worker processes not by listeners.
>>
> 
> Sorry, I do not understand your point.
> In my case pgbench establish connection to the pgbouncer only  once at
> the beginning of the test.
> And pgbouncer spends all time in context switches (CPU usage is 100% and
> it is mostly in kernel space: top of profile are kernel functions).
> The same picture will be if instead of pgbouncer you will do such
> scheduling in one bgworker.
> For the modern systems are not able to perform more than several
> hundreds of connection switches per second.
> So with single multiplexing thread or process you can not get speed more
> than 100k, while at powerful NUMA system it is possible to achieve
> millions of TPS.
> It is illustrated by the results I have sent in the previous mail: by
> spawning 10 instances of pgbouncer I was able to receive 7 times bigger
> speed.
> 

AFAICS making pgbouncer multi-threaded would not be hugely complicated.
A simple solution would be a fixed number of worker threads, and client
connections randomly assigned to them.

But this generally is not a common bottleneck in practical workloads (of
course, YMMV).

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Built-in connection pooling

2018-01-19 Thread Tomas Vondra


On 01/19/2018 10:52 AM, Konstantin Knizhnik wrote:
> 
> 
> On 18.01.2018 18:02, Tomas Vondra wrote:
>> Hi Konstantin,
>>
>> On 01/18/2018 03:48 PM, Konstantin Knizhnik wrote:
>>> On 17.01.2018 19:09, Konstantin Knizhnik wrote:
>>>> Hi hackers,
>>>>
>>>> ...
>> I haven't looked at the code yet, but after reading your message I have
>> a simple question - gow iss this going to work with SSL? If you're only
>> passing a file descriptor, that does not seem to be sufficient for the
>> backends to do crypto (that requires the SSL stuff from Port).
>>
>> Maybe I'm missing something and it already works, though ...
>>
>>
>> regards
>>
> Ooops, I missed this aspect with SSL. Thank you.
> New version of the patch which correctly maintain session context is
> attached.
> Now each session has its own allocator which should be  used instead
> of TopMemoryAllocator. SSL connections work now.
> 

OK. I've looked at the code, but I have a rather hard time understanding
it, because there are pretty much no comments explaining the intent of
the added code :-( I strongly suggest improving that, to help reviewers.

The questions I'm asking myself are mostly these:

1) When assigning a backend, we first try to get one from a pool, which
happens right at the beginning of BackendStartup. If we find a usable
backend, we send the info to the backend (pg_send_sock/pg_recv_sock).

But AFAICS this only only happens at connection time, right? But it your
initial message you say "Rescheduling is done at transaction level,"
which in my understanding means "transaction pooling". So, how does that
part work?

2) How does this deal with backends for different databases? I don't see
any checks that the requested database matches the backend database (not
any code switching the backend from one db to another - which would be
very tricky, I think).

3) Is there any sort of shrinking the pools? I mean, if the backend is
idle for certain period of time (or when we need backends for other
databases), does it get closed automatically?


Furthermore, I'm rather confused about the meaning of session_pool_size.
I mean, that GUC determines the number of backends in the pool, it has
nothing to do with sessions per se, right? Which would mean it's a bit
misleading to name it "session_..." (particularly if the pooling happens
at transaction level, not session level - which is question #1).


When I've been thinking about adding a built-in connection pool, my
rough plan was mostly "bgworker doing something like pgbouncer" (that
is, listening on a separate port and proxying everything to regular
backends). Obviously, that has pros and cons, and probably would not
work serve the threading use case well.

But it would have some features that I find valuable - for example, it's
trivial to decide which connection requests may or may not be served
from a pool (by connection to the main port or pool port).

That is not to say the bgworker approach is better than what you're
proposing, but I wonder if that would be possible with your approach.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Built-in connection pooling

2018-01-19 Thread Tomas Vondra


On 01/19/2018 06:07 PM, Konstantin Knizhnik wrote:
> 
> ...
>
>>>> 3) Is there any sort of shrinking the pools? I mean, if the backend is
>>>> idle for certain period of time (or when we need backends for other
>>>> databases), does it get closed automatically?
>>>>
>>> When client is disconnected, client session is closed. But backen is not
>>> terminated even if there are no more sessions at this backend.
>>> It  was done intentionally, to avoid permanent spawning of new processes
>>> when there is one or few clients which frequently connect/disconnect to
>>> the database.
>>>
>> Sure, but it means a short peak will exhaust the backends indefinitely.
>> That's acceptable for a PoC, but I think needs to be fixed eventually.
>>
> Sorry, I do not understand it.
> You specify size of backends pool which will server client session.
> Size of this pool is chosen to provide the best performance at the
> particular system and workload.
> So number of backends will never exceed this optimal value even in case
> of "short peak".
> From my point of view terminating backends when there are no active
> sessions is wrong idea in any case, it was not temporary decision just
> for PoC.
> 

That is probably true when there is just a single pool (for one
database/user). But when there are multiple such pools, it forces you to
keep the sum(pool_size) below max_connections. Which seems strange.

I do think the ability to evict backends after some timeout, or when
there is pressure in other pools (different user/database) is rather useful.

>>
>> Well, I haven't said it has to be single-threaded like pgbouncer. I
>> don't see why the bgworker could not use multiple threads internally (of
>> course, it'd need to be not to mess the stuff that is not thread-safe).
>>
> 
> Certainly architecture with N multiple scheduling bgworkers and M
> executors (backends) may be more flexible
> than solution when scheduling is done in executor itself. But we will
> have to pay extra cost for redirection.
>
> I am not sure that finally it will allow to reach better performance.
> More flexible solution in many cases doesn't mean more efficient solution.
> 

Sure, I wasn't really suggesting it's a clear win. I was responding to
your argument that pgbouncer in some cases reaches 100% CPU utilization
- that can be mitigated to a large extent by adding threads. Of course,
the cost for extra level of indirection is not zero.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Built-in connection pooling

2018-01-19 Thread Tomas Vondra


On 01/19/2018 07:35 PM, Claudio Freire wrote:
> 
> 
> On Fri, Jan 19, 2018 at 2:22 PM, Tomas Vondra
> <tomas.von...@2ndquadrant.com <mailto:tomas.von...@2ndquadrant.com>> wrote:
> 
> 
> 
> On 01/19/2018 06:13 PM, Claudio Freire wrote:
> >
> >
> > On Fri, Jan 19, 2018 at 2:07 PM, Konstantin Knizhnik
> > <k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>
> <mailto:k.knizh...@postgrespro.ru
> <mailto:k.knizh...@postgrespro.ru>>> wrote:
> >
> >
> >
> >
> >         Well, I haven't said it has to be single-threaded like
> pgbouncer. I
> >         don't see why the bgworker could not use multiple threads
> >         internally (of
> >         course, it'd need to be not to mess the stuff that is not
> >         thread-safe).
> >
> >
> >     Certainly architecture with N multiple scheduling bgworkers and M
> >     executors (backends) may be more flexible
> >     than solution when scheduling is done in executor itself. But we
> >     will have to pay extra cost for redirection.
> >     I am not sure that finally it will allow to reach better
> performance.
> >     More flexible solution in many cases doesn't mean more efficient
> >     solution.
> >
> >
> > I think you can take the best of both worlds.
> >
> > You can take your approach of passing around fds, and build a "load
> > balancing protocol" in a bgworker.
> >
> > The postmaster sends the socket to the bgworker, the bgworker
> waits for
> > a command as pgbouncer does, but instead of proxying everything, when
> > commands arrive, it passes the socket to a backend to handle.
> >
> > That way, the bgworker can do what pgbouncer does, handle different
> > pooling modes, match backends to databases, etc, but it doesn't
> have to
> > proxy all data, it just delegates handling of a command to a backend,
> > and forgets about that socket.
> >
> > Sounds like it could work.
> >
> 
> How could it do all that without actually processing all the data? For
> example, how could it determine the statement/transaction boundaries?
> 
> 
> It only needs to determine statement/transaction start.
> 
> After that, it hands off the connection to a backend, and the
> backend determines when to give it back.
> 
> So instead of processing all the data, it only processes a tiny part of it.
> 

How exactly would the backend "give back" the connection? The only way
for the backend and pgbouncer to communicate is by embedding information
in the data stream. Which means pgbouncer still has to parse it.

Furthermore, those are not the only bits of information pgbouncer may
need. For example, if pgbouncer gets improved to handle prepared
statements (which is likely) it'd need to handle PARSE/BIND/EXECUTE. And
it already needs to handle SET parameters. And so on.

In any case, this discussion is somewhat off topic in this thread, so
let's not hijack it.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



  1   2   3   4   5   6   7   8   9   10   >