Re: [HACKERS] Custom compression methods
Hi, On 11/27/2017 04:52 PM, Ildus Kurbangaliev wrote: > ... > > Hi. This looks like a serious bug, but I couldn't reproduce it yet. > Did you upgrade some old database or this bug happened after > insertion of all data to new database? I tried using your 'archie' > tool to download mailing lists and insert them to database, but > couldn't catch any errors. > I can trigger it pretty reliably with these steps: git checkout f65d21b258085bdc8ef2cc282ab1ff12da9c595c patch -p1 < ~/custom_compression_methods_v6.patch ./configure --enable-debug --enable-cassert \ CFLAGS="-fno-omit-frame-pointer -O0 -DRANDOMIZE_ALLOCATED_MEMORY" \ --prefix=/home/postgres/pg-compress make -s clean && make -s -j4 install cd contrib/ make -s clean && make -s -j4 install export PATH=/home/postgres/pg-compress/bin:$PATH pg_ctl -D /mnt/raid/pg-compress init pg_ctl -D /mnt/raid/pg-compress -l compress.log start createdb archie cd ~/archie/sql/ psql archie < create.sql ~/archie/bin/load.py --workers 4 --db archie */* > load.log 2>&1 I guess the trick might be -DRANDOMIZE_ALLOCATED_MEMORY (I first tried without it, and it seemed working fine). If that's the case, I bet there is a palloc that should have been palloc0, or something like that. If you still can't reproduce that, I may give you access to this machine so that you can debug it there. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Hi, Attached is an updated version of the patch series, fixing the issues reported by Mark Dilger: 1) Fix fabs() issue in histogram.c. 2) Do not rely on extra_data being StdAnalyzeData, and instead lookup the LT operator explicitly. This also adds a simple regression tests to make sure ANALYZE on arrays works fine, but perhaps we should invent some simple queries too. 3) I've removed / clarified some of the comments mentioned by Mark. 4) I haven't changed how the statistics kinds are defined in relation.h, but I agree there should be a comment explaining how STATS_EXT_INFO_* relate to StatisticExtInfo.kinds. 5) The most significant change happened histograms. There used to be two structures for histograms: - MVHistogram - expanded (no deduplication etc.), result of histogram build and never used for estimation - MVSerializedHistogram - deduplicated to save space, produced from MVHistogram before storing in pg_statistic_ext and never used for estimation So there wasn't really any reason to expose the "non-serialized" version outside histogram.c. It was just confusing and unnecessary, so I've moved MVHistogram to histogram.c (and renamed it to MVHistogramBuild), and renamed MVSerializedHistogram. And same for the MVBucket stuff. So now we only deal with MVHistogram everywhere, except in histogram.c. 6) I've also made MVHistogram to include a varlena header directly (and be packed as a bytea), which allows us to store it without having to call any serialization functions). I guess if we should do (5) and (6) for the MCV lists too, it seems more convenient than the current approach. And perhaps even for the statistics added to 9.6 (it does not change the storage format). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 0001-multivariate-MCV-lists.patch.gz Description: application/gzip 0002-multivariate-histograms.patch.gz Description: application/gzip
simplehash: tb->sizemask = 0
Hi, I'm a bit puzzled by this code in SH_COMPUTE_PARAMETERS: if (tb->size == SH_MAX_SIZE) tb->sizemask = 0; else tb->sizemask = tb->size - 1; Doesn't that mean that with SH_MAX_SIZE we end up with sizemask being 0 (i.e. no bits set)? At least that's what I get from printf("%#x\n", (unsigned int)0); That would mean SH_INITIAL_BUCKET/SH_NEXT/SH_PREV can only ever return bucket 0, no? I don't think we're building hash tables with 2^32 buckets, though. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Hi, On 11/25/2017 09:23 PM, Mark Dilger wrote: > >> On Nov 18, 2017, at 12:28 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> >> wrote: >> >> Hi, >> >> Attached is an updated version of the patch, adopting the psql describe >> changes introduced by 471d55859c11b. >> >> regards >> >> -- >> Tomas Vondra http://www.2ndQuadrant.com >> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >> <0001-multivariate-MCV-lists.patch.gz><0002-multivariate-histograms.patch.gz> > > Hello Tomas, > > After applying both your patches, I get a warning: > > histogram.c:1284:10: warning: taking the absolute value of unsigned type > 'uint32' (aka 'unsigned int') has no effect [-Wabsolute-value] > delta = fabs(data->numrows); > ^ > histogram.c:1284:10: note: remove the call to 'fabs' since unsigned values > cannot be negative > delta = fabs(data->numrows); > ^~~~ > 1 warning generated. > Hmm, yeah. The fabs() call is unnecessary, and probably a remnant from some previous version where the field was not uint32. I wonder why you're getting the warning and I don't, though. What compiler are you using? > > Looking closer at this section, there is some odd integer vs. floating point > arithmetic happening > that is not necessarily wrong, but might be needlessly inefficient: > > delta = fabs(data->numrows); > split_value = values[0].value; > > for (i = 1; i < data->numrows; i++) > { > if (values[i].value != values[i - 1].value) > { > /* are we closer to splitting the bucket in half? */ > if (fabs(i - data->numrows / 2.0) < delta) > { > /* let's assume we'll use this value for the split */ > split_value = values[i].value; > delta = fabs(i - data->numrows / 2.0); > nrows = i; > } > } > } > > I'm not sure the compiler will be able to optimize out the recomputation of > data->numrows / 2.0 > each time through the loop, since the compiler might not be able to prove to > itself that data->numrows > does not get changed. Perhaps you should compute it just once prior to > entering the outer loop, > store it in a variable of integer type, round 'delta' off and store in an > integer, and do integer comparisons > within the loop? Just a thought > Yeah, that's probably right. But I wonder if the loop is needed at all, or whether we should start at i=(data->numrows/2.0) instead, and walk to the closest change of value in both directions. That would probably save more CPU than computing numrows/2.0 only once. The other issue in that block of code seems to be that we compare the values using simple inequality. That probably works for passbyval data types, but we should use proper comparator (e.g. compare_datums_simple). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
On 11/26/2017 02:17 AM, Mark Dilger wrote: > >> On Nov 25, 2017, at 3:33 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> >> wrote: >> >> >> >> On 11/25/2017 10:01 PM, Mark Dilger wrote: >>> >>>> On Nov 18, 2017, at 12:28 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> >>>> wrote: >>>> >>>> Hi, >>>> >>>> Attached is an updated version of the patch, adopting the psql describe >>>> changes introduced by 471d55859c11b. >>> >>> Hi Tomas, >>> >>> In src/backend/statistics/dependencies.c, you have introduced a comment: >>> >>> + /* >>> +* build an array of SortItem(s) sorted using the multi-sort support >>> +* >>> +* XXX This relies on all stats entries pointing to the same tuple >>> +* descriptor. Not sure if that might not be the case. >>> +*/ >>> >>> Would you mind explaining that a bit more for me? I don't understand >>> exactly what >>> you mean here, but it sounds like the sort of thing that needs to be >>> clarified/fixed >>> before it can be committed. Am I misunderstanding this? >>> >> >> The call right after that comment is >> >>items = build_sorted_items(numrows, rows, stats[0]->tupDesc, >> mss, k, attnums_dep); >> >> That method processes an array of tuples, and the structure is defined >> by "tuple descriptor" (essentially a list of attribute info - data type, >> length, ...). We get that from stats[0] and assume all the entries point >> to the same tuple descriptor. That's generally safe assumption, I think, >> because all the stats entries relate to columns from the same table. > > Right, I got that, and tried mocking up some code to test that in an Assert. > I did not pursue that far enough to reach any conclusion, however. You > seem to be indicating in the comment some uncertainty about whether the > assumption is safe. Do we need to dig into that further? > I don't think it's worth the effort, really. I don't think we can really get mismatching tuple descriptors here - that could only happen with columns coming from different tables, or something similarly obscure. >>> >>> In src/backend/statistics/mcv.c, you have comments: >>> >>> + * FIXME: Single-dimensional MCV is sorted by frequency (descending). We >>> + * should do that too, because when walking through the list we want to >>> + * check the most frequent items first. >>> + * >>> + * TODO: We're using Datum (8B), even for data types (e.g. int4 or float4). >>> + * Maybe we could save some space here, but the bytea compression should >>> + * handle it just fine. >>> + * >>> + * TODO: This probably should not use the ndistinct directly (as computed >>> from >>> + * the table, but rather estimate the number of distinct values in the >>> + * table), no? >>> >>> Do you intend these to be fixed/implemented prior to committing this patch? >>> >> >> Actually, the first FIXME is obsolete, as build_distinct_groups returns >> the groups sorted by frequency. I'll remove that. > > Ok, good. That's the one I understood least. > >> I think the rest is more a subject for discussion, so I'd need to hear >> some feedback. > > In terms of storage efficiency, you are using float8 for the frequency, which > is consistent > with what other stats work uses, but may be overkill. A float4 seems > sufficient to me. > The extra four bytes for a float8 may be pretty small compared to the size of > the arrays > being stored, so I'm not sure it matters. Also, this might have been > discussed before, > and I am not asking for a reversal of decisions the members of this mailing > list may > already have reached. > > As for using arrays of something smaller than Datum, you'd need some logic to > specify > what the size is in each instance, and that probably complicates the code > rather a lot. > Maybe someone else has a technique for doing that cleanly? > Note that this is not about storage efficiency. The comment is before statext_mcv_build, so it's actually related to in-memory representation. If you look into statext_mcv_serialize, it does use typlen to only copy the number of bytes needed for each column. >>> >>> Further down in function statext_mcv_build, you have two loops, the first >>> allocating >>> memory and the second initializing the memory. There is no cle
Re: [HACKERS] CUBE seems a bit confused about ORDER BY
On 11/29/2017 06:13 AM, Michael Paquier wrote: > On Tue, Nov 21, 2017 at 7:07 AM, Alexander Korotkov > <a.korot...@postgrespro.ru> wrote: >> On Mon, Nov 20, 2017 at 1:59 PM, Alexander Korotkov >> <a.korot...@postgrespro.ru> wrote: >>> >>> On Mon, Nov 20, 2017 at 4:09 AM, Tomas Vondra >>> <tomas.von...@2ndquadrant.com> wrote: >>>> >>>> Seems fine to me, although perhaps it should be split into two parts. >>>> One with the cube_coord_llur fixes, and then g_cube_distance changes >>>> adding support for negative coordinates. >>> >>> >>> Thank you for your feedback. I'll split this patch into two. >> >> >> Please, find two patches attached. > > This got no reviews, so moved to next CF. > That is somewhat inaccurate, I believe. I won't claim it received enough reviews, but it certainly received some. And splitting it into two does not invalidate that, I guess. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Custom compression methods
On 11/30/2017 04:20 PM, Ildus Kurbangaliev wrote: > On Thu, 30 Nov 2017 00:30:37 +0100 > Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > > ... > >> I can imagine other interesting use cases - for example values in >> JSONB columns often use the same "schema" (keys, nesting, ...), so >> can I imagine building a "dictionary" of JSON keys for the whole >> column ... >> >> Ildus, is this a use case you've been aiming for, or were you aiming >> to use the new API in a different way? > > Thank you for such good overview. I agree that pglz is pretty good as > general compression method, and there's no point to change it, at > least now. > > I see few useful use cases for compression methods, it's special > compression methods for int[], timestamp[] for time series and yes, > dictionaries for jsonb, for which I have even already created an > extension (https://github.com/postgrespro/jsonbd). It's working and > giving promising results. > I understand the reluctance to put everything into core, particularly for complex patches that evolve quickly. Also, not having to put everything into core is kinda why we have extensions. But perhaps some of the simpler cases would be good candidates for core, making it possible to test the feature? >> >> I wonder if the patch can be improved to handle this use case better. >> For example, it requires knowledge the actual data type, instead of >> treating it as opaque varlena / byte array. I see tsvector compression >> does that by checking typeid in the handler. >> >> But that fails for example with this example >> >> db=# create domain x as tsvector; >> CREATE DOMAIN >> db=# create table t (a x compressed ts1); >> ERROR: unexpected type 28198672 for tsvector compression handler >> >> which means it's a few brick shy to properly support domains. But I >> wonder if this should be instead specified in CREATE COMPRESSION >> METHOD instead. I mean, something like >> >> CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler >> TYPE tsvector; >> >> When type is no specified, it applies to all varlena values. Otherwise >> only to that type. Also, why not to allow setting the compression as >> the default method for a data type, e.g. >> >> CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler >> TYPE tsvector DEFAULT; >> >> would automatically add 'COMPRESSED ts1' to all tsvector columns in >> new CREATE TABLE commands. > > Initial version of the patch contains ALTER syntax that change > compression method for whole types, but I have decided to remove > that functionality for now because the patch is already quite complex > and it could be added later as separate patch. > > Syntax was: > ALTER TYPE SET COMPRESSION ; > > Specifying the supported type for the compression method is a good idea. > Maybe the following syntax would be better? > > CREATE COMPRESSION METHOD ts1 FOR tsvector HANDLER > tsvector_compression_handler; > Understood. Good to know you've considered it, and I agree it doesn't need to be there from the start (which makes the patch simpler). >> >> BTW do you expect the tsvector compression to be generally useful, or >> is it meant to be used only by the tests? If generally useful, >> perhaps it should be created in pg_compression by default. If only >> for tests, maybe it should be implemented in an extension in contrib >> (thus also serving as example how to implement new methods). >> >> I haven't thought about the JSONB use case very much, but I suppose >> that could be done using the configure/drop methods. I mean, >> allocating the dictionary somewhere (e.g. in a table created by an >> extension?). The configure method gets the Form_pg_attribute record, >> so that should be enough I guess. >> >> But the patch is not testing those two methods at all, which seems >> like something that needs to be addresses before commit. I don't >> expect a full-fledged JSONB compression extension, but something >> simple that actually exercises those methods in a meaningful way. > > I will move to tsvector_compression_handler to separate extension in > the next version. I added it more like as example, but also it could be > used to achieve a better compression for tsvectors. Tests on maillists > database ('archie' tables): > > usual compression: > > maillists=# select body_tsvector, subject_tsvector into t1 from > messages; SELECT 1114213 > maillists=# select pg_size_pretty(pg_total_relation_size('t1')); >
Re: [HACKERS] Custom compression methods
Hi, I ran into another issue - after inserting some data into a table with a tsvector column (without any compression defined), I can no longer read the data. This is what I get in the console: db=# select max(md5(body_tsvector::text)) from messages; ERROR: cache lookup failed for compression options 6432 and the stack trace looks like this: Breakpoint 1, get_cached_compression_options (cmoptoid=6432) at tuptoaster.c:2563 2563elog(ERROR, "cache lookup failed for compression options %u", cmoptoid); (gdb) bt #0 get_cached_compression_options (cmoptoid=6432) at tuptoaster.c:2563 #1 0x004bf3da in toast_decompress_datum (attr=0x2b44148) at tuptoaster.c:2390 #2 0x004c0c1e in heap_tuple_untoast_attr (attr=0x2b44148) at tuptoaster.c:225 #3 0x0083f976 in pg_detoast_datum (datum=) at fmgr.c:1829 #4 0x008072de in tsvectorout (fcinfo=0x2b41e00) at tsvector.c:315 #5 0x005fae00 in ExecInterpExpr (state=0x2b414b8, econtext=0x2b25ab0, isnull=) at execExprInterp.c:1131 #6 0x0060bdf4 in ExecEvalExprSwitchContext (isNull=0x7fe9bd37 "", econtext=0x2b25ab0, state=0x2b414b8) at ../../../src/include/executor/executor.h:299 It seems the VARATT_IS_CUSTOM_COMPRESSED incorrectly identifies the value as custom-compressed for some reason. Not sure why, but the tsvector column is populated by a trigger that simply does NEW.body_tsvector := to_tsvector('english', strip_replies(NEW.body_plain)); If needed, the complete tool is here: https://bitbucket.org/tvondra/archie regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Custom compression methods
Hi, On 11/23/2017 10:38 AM, Ildus Kurbangaliev wrote: > On Tue, 21 Nov 2017 18:47:49 +0100 > Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > >>> >> >> Hmmm, it still doesn't work for me. See this: >> >> test=# create extension pg_lz4 ; >> CREATE EXTENSION >> test=# create table t_lz4 (v text compressed lz4); >> CREATE TABLE >> test=# create table t_pglz (v text); >> CREATE TABLE >> test=# insert into t_lz4 select repeat(md5(1::text),300); >> INSERT 0 1 >> test=# insert into t_pglz select * from t_lz4; >> INSERT 0 1 >> test=# drop extension pg_lz4 cascade; >> NOTICE: drop cascades to 2 other objects >> DETAIL: drop cascades to compression options for lz4 >> drop cascades to table t_lz4 column v >> DROP EXTENSION >> test=# \c test >> You are now connected to database "test" as user "user". >> test=# insert into t_lz4 select repeat(md5(1::text),300);^C >> test=# select * from t_pglz ; >> ERROR: cache lookup failed for compression options 16419 >> >> That suggests no recompression happened. > > Should be fixed in the attached patch. I've changed your extension a > little bit according changes in the new patch (also in attachments). > Hmm, this seems to have fixed it, but only in one direction. Consider this: create table t_pglz (v text); create table t_lz4 (v text compressed lz4); insert into t_pglz select repeat(md5(i::text),300) from generate_series(1,10) s(i); insert into t_lz4 select repeat(md5(i::text),300) from generate_series(1,10) s(i); \d+ Schema | Name | Type | Owner | Size | Description ++---+---+---+- public | t_lz4 | table | user | 12 MB | public | t_pglz | table | user | 18 MB | (2 rows) truncate t_pglz; insert into t_pglz select * from t_lz4; \d+ Schema | Name | Type | Owner | Size | Description ++---+---+---+- public | t_lz4 | table | user | 12 MB | public | t_pglz | table | user | 18 MB | (2 rows) which is fine. But in the other direction, this happens truncate t_lz4; insert into t_lz4 select * from t_pglz; \d+ List of relations Schema | Name | Type | Owner | Size | Description ++---+---+---+- public | t_lz4 | table | user | 18 MB | public | t_pglz | table | user | 18 MB | (2 rows) which means the data is still pglz-compressed. That's rather strange, I guess, and it should compress the data using the compression method set for the target table instead. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Custom compression methods
On 12/01/2017 08:48 PM, Alvaro Herrera wrote: > Ildus Kurbangaliev wrote: > >> If the table is big, decompression could take an eternity. That's why i >> decided to only to disable it and the data could be decompressed using >> compression options. >> >> My idea was to keep compression options forever, since there will not >> be much of them in one database. Still that requires that extension is >> not removed. >> >> I will try to find a way how to recompress data first in case it moves >> to another table. > > I think what you should do is add a dependency between a column that > compresses using a method, and that method. So the method cannot be > dropped and leave compressed data behind. Since the method is part of > the extension, the extension cannot be dropped either. If you ALTER > the column so that it uses another compression method, then the table is > rewritten and the dependency is removed; once you do that for all the > columns that use the compression method, the compression method can be > dropped. > +1 to do the rewrite, just like for other similar ALTER TABLE commands > > Maybe our dependency code needs to be extended in order to support this. > I think the current logic would drop the column if you were to do "DROP > COMPRESSION .. CASCADE", but I'm not sure we'd see that as a feature. > I'd rather have DROP COMPRESSION always fail instead until no columns > use it. Let's hear other's opinions on this bit though. > Why should this behave differently compared to data types? Seems quite against POLA, if you ask me ... If you want to remove the compression, you can do the SET NOT COMPRESSED (or whatever syntax we end up using), and then DROP COMPRESSION METHOD. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Custom compression methods
On 12/01/2017 08:20 PM, Robert Haas wrote: > On Fri, Dec 1, 2017 at 10:18 AM, Tomas Vondra > <tomas.von...@2ndquadrant.com> wrote: >> It has very little impact on this patch, as it has nothing to do with >> columnar storage. That is, each value is compressed independently. > > I understand that this patch is not about columnar storage, but I > think the idea that we may want to operate on the compressed data > directly is not only applicable to that case. > Yeah. To clarify, my point was that column stores benefit from compressing many values at once, and then operating on this compressed vector. That is not what this patch is doing (or can do), of course. But I certainly do agree that if the compression can be integrated into the data type, allowing processing on compressed representation, then that will beat whatever this patch is doing, of course ... >> >> I agree with these thoughts in general, but I'm not quite sure >> what is your conclusion regarding the patch. > > I have not reached one. Sometimes I like to discuss problems before > deciding what I think. :-) > That's lame! Let's make decisions without discussion ;-) > > It does seem to me that the patch may be aiming at a relatively narrow > target in a fairly large problem space, but I don't know whether to > label that as short-sightedness or prudent incrementalism. > I don't know either. I don't think people will start switching their text columns to lz4 just because they can, or because they get 4% space reduction compared to pglz. But the ability to build per-column dictionaries seems quite powerful, I guess. And I don't think that can be easily built directly into JSONB, because we don't have a way to provide information about the column (i.e. how would you fetch the correct dictionary?). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Custom compression methods
On 12/14/2017 04:21 PM, Robert Haas wrote: > On Wed, Dec 13, 2017 at 5:10 AM, Tomas Vondra > <tomas.von...@2ndquadrant.com> wrote: >>> 2. If several data types can benefit from a similar approach, it has >>> to be separately implemented for each one. >> >> I don't think the current solution improves that, though. If you >> want to exploit internal features of individual data types, it >> pretty much requires code customized to every such data type. >> >> For example you can't take the tsvector compression and just slap >> it on tsquery, because it relies on knowledge of internal tsvector >> structure. So you need separate implementations anyway. > > I don't think that's necessarily true. Certainly, it's true that > *if* tsvector compression depends on knowledge of internal tsvector > structure, *then* that you can't use the implementation for anything > else (this, by the way, means that there needs to be some way for a > compression method to reject being applied to a column of a data > type it doesn't like). I believe such dependency (on implementation details) is pretty much the main benefit of datatype-aware compression methods. If you don't rely on such assumption, then I'd say it's a general-purpose compression method. > However, it seems possible to imagine compression algorithms that can > work for a variety of data types, too. There might be a compression > algorithm that is theoretically a general-purpose algorithm but has > features which are particularly well-suited to, say, JSON or XML > data, because it looks for word boundaries to decide on what strings > to insert into the compression dictionary. > Can you give an example of such algorithm? Because I haven't seen such example, and I find arguments based on hypothetical compression methods somewhat suspicious. FWIW I'm not against considering such compression methods, but OTOH it may not be such a great primary use case to drive the overall design. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Custom compression methods
On 12/17/2017 04:32 AM, Robert Haas wrote: > On Thu, Dec 14, 2017 at 12:23 PM, Tomas Vondra > <tomas.von...@2ndquadrant.com> wrote: >> Can you give an example of such algorithm? Because I haven't seen such >> example, and I find arguments based on hypothetical compression methods >> somewhat suspicious. >> >> FWIW I'm not against considering such compression methods, but OTOH it >> may not be such a great primary use case to drive the overall design. > > Well it isn't, really. I am honestly not sure what we're arguing > about at this point. I think you've agreed that (1) opening avenues > for extensibility is useful, (2) substitution a general-purpose > compression algorithm could be useful, and (3) having datatype > compression that is enabled through TOAST rather than built into the > datatype might sometimes be desirable. That's more than adequate > justification for this proposal, whether half-general compression > methods exist or not. I am prepared to concede that there may be no > useful examples of such a thing. > I don't think we're arguing - we're discussing if a proposed patch is the right design solving relevant use cases. I personally am not quite convinced about that, for the reason I tried to explain in my previous messages. I see it as a poor alternative to compression built into the data type. I do like the idea of compression with external dictionary, however. But don't forget that it's not me in this thread - it's my evil twin, moonlighting as Mr. Devil's lawyer ;-) -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Custom compression methods
On 12/13/2017 05:55 PM, Alvaro Herrera wrote: > Tomas Vondra wrote: > >> On 12/13/2017 01:54 AM, Robert Haas wrote: > >>> 3. Compression is only applied to large-ish values. If you are just >>> making the data type representation more compact, you probably want to >>> apply the new representation to all values. If you are compressing in >>> the sense that the original data gets smaller but harder to interpret, >>> then you probably only want to apply the technique where the value is >>> already pretty wide, and maybe respect the user's configured storage >>> attributes. TOAST knows about some of that kind of stuff. >> >> Good point. One such parameter that I really miss is compression level. >> I can imagine tuning it through CREATE COMPRESSION METHOD, but it does >> not seem quite possible with compression happening in a datatype. > > Hmm, actually isn't that the sort of thing that you would tweak using a > column-level option instead of a compression method? > ALTER TABLE ALTER COLUMN SET (compression_level=123) > The only thing we need for this is to make tuptoaster.c aware of the > need to check for a parameter. > Wouldn't that require some universal compression level, shared by all supported compression algorithms? I don't think there is such thing. Defining it should not be extremely difficult, although I'm sure there will be some cumbersome cases. For example what if an algorithm "a" supports compression levels 0-10, and algorithm "b" only supports 0-3? You may define 11 "universal" compression levels, and map the four levels for "b" to that (how). But then everyone has to understand how that "universal" mapping is defined. Another issue is that there are algorithms without a compression level (e.g. pglz does not have one, AFAICS), or with somewhat definition (lz4 does not have levels, and instead has "acceleration" which may be an arbitrary positive integer, so not really compatible with "universal" compression level). So to me the ALTER TABLE ALTER COLUMN SET (compression_level=123) seems more like an unnecessary hurdle ... >>> I don't think TOAST needs to be entirely transparent for the >>> datatypes. We've already dipped our toe in the water by allowing some >>> operations on "short" varlenas, and there's really nothing to prevent >>> a given datatype from going further. The OID problem you mentioned >>> would presumably be solved by hard-coding the OIDs for any built-in, >>> privileged compression methods. >> >> Stupid question, but what do you mean by "short" varlenas? > > Those are varlenas with 1-byte header rather than the standard 4-byte > header. > OK, that's what I thought. But that is still pretty transparent to the data types, no? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Tracking of page changes for backup purposes. PTRACK [POC]
Hi, a couple of months ago there was proposal / patch with the similar goals, from Andrey Borodin. See these two threads [1] https://www.postgresql.org/message-id/flat/843D96CC-7C55-4296-ADE0-622A7ACD4978%40yesql.se#843d96cc-7c55-4296-ade0-622a7acd4...@yesql.se [2] https://www.postgresql.org/message-id/flat/449A7A9D-DB58-40F8-B80E-4C4EE7DB47FD%40yandex-team.ru#449a7a9d-db58-40f8-b80e-4c4ee7db4...@yandex-team.ru I recall there was a long discussion regarding which of the approaches is the *right* one (although that certainly depends on other factors). On 12/18/2017 11:18 AM, Anastasia Lubennikova wrote: > In this thread I would like to raise the issue of incremental backups. > What I suggest in this thread, is to choose one direction, so we can > concentrate our community efforts. > There is already a number of tools, which provide incremental backup. > And we can see five principle techniques they use: > > 1. Use file modification time as a marker that the file has changed. > 2. Compute file checksums and compare them. > 3. LSN-based mechanisms. Backup pages with LSN >= last backup LSN. > 4. Scan all WAL files in the archive since the previous backup and > collect information about changed pages. > 5. Track page changes on the fly. (ptrack) > > They can also be combined to achieve better performance. > > My personal candidate is the last one, since it provides page-level > granularity, while most of the others approaches can only do file-level > incremental backups or require additional reads or calculations. > I share the opinion that options 1 and 2 are not particularly attractive, due to either unreliability, or not really saving that much CPU and I/O. I'm not quite sure about 3, because it doesn't really explain how would it be done - it seems to assume we'd have to reread the files. I'll get back to this. Option 4 has some very interesting features. Firstly, relies on WAL and so should not require any new code (and it could, in theory, support even older PostgreSQL releases, for example). Secondly, this can be offloaded to a different machine. And it does even support additional workflows - e.g. "given these two full backups and the WAL, generate an incremental backup between them". So I'm somewhat hesitant to proclaim option 5 as the clear winner, here. > In a nutshell, using ptrack patch, PostgreSQL can track page changes on > the fly. Each time a relation page is updated, this page is marked in a > special PTRACK bitmap fork for this relation. As one page requires just > one bit in the PTRACK fork, such bitmaps are quite small. Tracking > implies some minor overhead on the database server operation but speeds > up incremental backups significantly. > That makes sense, I guess, although I find the "ptrack" acronym somewhat cryptic, and we should probably look for something more descriptive. But the naming can wait, I guess. My main question is if bitmap is the right data type. It seems to cause a lot of complexity later, because it needs to be reset once in a while, you have to be careful about failed incremental backups etc. What if we tracked the LSN for each page instead? Sure, it'd require so, 64x more space (1 bit -> 8 bytes per page), but it would not require resets, you could take incremental backups from arbitrary point in time, and so on. That seems like a significant improvement to me, so perhaps the space requirements are justified (still just 1MB for 1GB segment, with the default 8kB pages). > Detailed overview of the implementation with all pros and cons, > patches and links to the related threads you can find here: > > https://wiki.postgresql.org/index.php?title=PTRACK_incremental_backups. > > Patches for v 10.1 and v 9.6 are attached. > Since ptrack is basically just an API for use in backup tools, it is > impossible to test the patch independently. > Now it is integrated with our backup utility, called pg_probackup. You can > find it herehttps://github.com/postgrespro/pg_probackup > Let me know if you find the documentation too long and complicated, I'll > write a brief How-to for ptrack backups. > > Spoiler: Please consider this patch and README as a proof of concept. It > can be improved in some ways, but in its current state PTRACK is a > stable prototype, reviewed and tested well enough to find many > non-trivial corner cases and subtle problems. And any discussion of > change track algorithm must be aware of them. Feel free to share your > concerns and point out any shortcomings of the idea or the implementation. > Thanks for the proposal and patch! regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: WIP: BRIN multi-range indexes
On 12/19/2017 08:38 PM, Mark Dilger wrote: > >> On Nov 18, 2017, at 12:45 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> >> wrote: >> >> Hi, >> >> Apparently there was some minor breakage due to duplicate OIDs, so here >> is the patch series updated to current master. >> >> regards >> >> -- >> Tomas Vondra http://www.2ndQuadrant.com >> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >> <0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz><0002-BRIN-bloom-indexes.patch.gz><0003-BRIN-multi-range-minmax-indexes.patch.gz><0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz> > > > After applying these four patches to my copy of master, the regression > tests fail for F_SATISFIES_HASH_PARTITION 5028 as attached. > D'oh! There was an incorrect OID referenced in pg_opclass, which was also used by the satisfies_hash_partition() function. Fixed patches attached. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz Description: application/gzip 0002-BRIN-bloom-indexes.patch.gz Description: application/gzip 0003-BRIN-multi-range-minmax-indexes.patch.gz Description: application/gzip 0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz Description: application/gzip
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Hi, On 12/19/2017 08:17 PM, Mark Dilger wrote: > > I tested your latest patches on my mac os x laptop and got one test > failure due to the results of 'explain' coming up differently. For the > record, > I followed these steps: > > cd postgresql/ > git pull > # this got my directory up to 8526bcb2df76d5171b4f4d6dc7a97560a73a5eff with > no local changes > patch -p 1 < ../0001-multivariate-MCV-lists.patch > patch -p 1 < ../0002-multivariate-histograms.patch > ./configure --prefix=/Users/mark/master/testinstall --enable-cassert > --enable-tap-tests --enable-depend && make -j4 && make check-world > Yeah, those steps sounds about right. Apparently this got broken by ecc27d55f4, although I don't quite understand why - but it works fine before. Can you try if it works fine on 9f4992e2a9 and fails with ecc27d55f4? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: CUBE seems a bit confused about ORDER BY
On 12/12/2017 01:52 PM, Alexander Korotkov wrote: > On Tue, Dec 12, 2017 at 3:49 PM, Teodor Sigaev <teo...@sigaev.ru > <mailto:teo...@sigaev.ru>> wrote: > > Yes, the thing is that we change behavior of existing ~> > operator. In general, this is not good idea because it could > affect existing users whose already use this operator. > Typically in such situation, we could leave existing operator as > is, and invent new operator with new behavior. However, in this > particular case, I think there are reasons to make an exception > to the rules. The reasons are following: > 1) The ~> operator was designed especially for knn gist. > 2) Knn gist support for current behavior is broken by design and > can't be fixed. Most we can do to fix existing ~> operator > behavior as is to drop knn gist support. But then ~> operator > would be left useless. > 3) It doesn't seems that ~> operator have many users yet, > because an error wasn't reported during whole release cycle. > > I hope these reasons justify altering behavior of existing > operator as an exception to the rules. Another question is > whether we should backpatch it. But I think we could leave this > decision to committer. > > I think that this patch is ready for committer. > > I'm agree with changing behavior of existing ~> operator, but is any > objection here? Current implementation is not fixable and I hope > that users which use this operator will understand why we change it. > Fortunately, the fix doesn't require changes in system catalog. > > The single question here is about index over expression with this > operator, they will need to reindex, which should be noted in > release notes. > > > Yes. I bet only few users have built indexes over ~> operator if any. > Ask them to reindex in the release notes seems OK for me. > Is there a good way to detect such cases? Either in pg_upgrade, so that we can print warnings, or at least manually (which would be suitable for release notes). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Custom compression methods
On 12/12/2017 10:33 PM, Robert Haas wrote: > On Mon, Dec 11, 2017 at 2:53 PM, Tomas Vondra > <tomas.von...@2ndquadrant.com> wrote: >> But let me play the devil's advocate for a while and question the >> usefulness of this approach to compression. Some of the questions were >> mentioned in the thread before, but I don't think they got the attention >> they deserve. > > Sure, thanks for chiming in. I think it is good to make sure we are > discussing this stuff. > >> But perhaps we should simply make it an initdb option (in which case the >> whole cluster would simply use e.g. lz4 instead of pglz)? >> >> That seems like a much simpler approach - it would only require some >> ./configure options to add --with-lz4 (and other compression libraries), >> an initdb option to pick compression algorithm, and probably noting the >> choice in cluster controldata. >> >> No dependencies tracking, no ALTER TABLE issues, etc. >> >> Of course, it would not allow using different compression algorithms for >> different columns (although it might perhaps allow different compression >> level, to some extent). >> >> Conclusion: If we want to offer a simple cluster-wide pglz alternative, >> perhaps this patch is not the right way to do that. > > I actually disagree with your conclusion here. I mean, if you do it > that way, then it has the same problem as checksums: changing > compression algorithms requires a full dump-and-reload of the > database, which makes it more or less a non-starter for large > databases. On the other hand, with the infrastructure provided by > this patch, we can have a default_compression_method GUC that will be > set to 'pglz' initially. If the user changes it to 'lz4', or we ship > a new release where the new default is 'lz4', then new tables created > will use that new setting, but the existing stuff keeps working. If > you want to upgrade your existing tables to use lz4 rather than pglz, > you can change the compression option for those columns to COMPRESS > lz4 PRESERVE pglz if you want to do it incrementally or just COMPRESS > lz4 to force a rewrite of an individual table. That's really > powerful, and I think users will like it a lot. > > In short, your approach, while perhaps a little simpler to code, seems > like it is fraught with operational problems which this design avoids. > I agree the checksum-like limitations are annoying and make it impossible to change the compression algorithm after the cluster is initialized (although I recall a discussion about addressing that). So yeah, if such flexibility is considered valuable/important, then the patch is a better solution. >> Custom datatype-aware compression (e.g. the tsvector) >> -- >> >> Exploiting knowledge of the internal data type structure is a promising >> way to improve compression ratio and/or performance. >> >> The obvious question of course is why shouldn't this be done by the data >> type code directly, which would also allow additional benefits like >> operating directly on the compressed values. >> >> Another thing is that if the datatype representation changes in some >> way, the compression method has to change too. So it's tightly coupled >> to the datatype anyway. >> >> This does not really require any new infrastructure, all the pieces are >> already there. >> >> In some cases that may not be quite possible - the datatype may not be >> flexible enough to support alternative (compressed) representation, e.g. >> because there are no bits available for "compressed" flag, etc. >> >> Conclusion: IMHO if we want to exploit the knowledge of the data type >> internal structure, perhaps doing that in the datatype code directly >> would be a better choice. > > I definitely think there's a place for compression built right into > the data type. I'm still happy about commit > 145343534c153d1e6c3cff1fa1855787684d9a38 -- although really, more > needs to be done there. But that type of improvement and what is > proposed here are basically orthogonal. Having either one is good; > having both is better. > Why orthogonal? For example, why couldn't (or shouldn't) the tsvector compression be done by tsvector code itself? Why should we be doing that at the varlena level (so that the tsvector code does not even know about it)? For example we could make the datatype EXTERNAL and do the compression on our own, using a custom algorithm. Of course, that would require datatype-specific implementation, but tsvector_compress does that too. It seems to me the main reason is that tsvector
Re: spgist rangetypes compiler warning (gcc 7.2.0)
On 11/18/2017 10:40 PM, Tom Lane wrote: > Tomas Vondra <tomas.von...@2ndquadrant.com> writes: >> while compiling on gcc 7.2.0 (on ARM), I got this warning: > >> rangetypes_spgist.c: In function 'spg_range_quad_inner_consistent': >> rangetypes_spgist.c:559:29: warning: comparison between pointer and >> zero character constant [-Wpointer-compare] >> if (in->traversalValue != (Datum) 0) >> ^~ > > Huh. I wonder why 7.2.1 on Fedora isn't producing that warning. > Not sure. Maybe Ubuntu uses different flags (on ARM)? This is what I get from 'gcc -v' on the machine: Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabihf/7/lto-wrapper Target: arm-linux-gnueabihf Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 7.2.0-8ubuntu3' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=arm-linux-gnueabihf- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libitm --disable-libquadmath --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --enable-multilib --disable-sjlj-exceptions --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float=hard --with-mode=thumb --disable-werror --enable-multilib --enable-checking=release --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf Thread model: posix gcc version 7.2.0 (Ubuntu/Linaro 7.2.0-8ubuntu3) regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means
Hi, On 11/02/2017 08:15 PM, Tom Lane wrote: > Haribabu Kommi <kommi.harib...@gmail.com> writes: >> The changes are fine and now it reports proper live tuples in both >> pg_class and stats. The other issue of continuous vacuum operation >> leading to decrease of number of live tuples is not related to this >> patch and that can be handled separately. > > I did not like this patch too much, because it did nothing to fix > the underlying problem of confusion between "live tuples" and "total > tuples"; in fact, it made that worse, because now the comment on > LVRelStats.new_rel_tuples is a lie. And there's at least one usage > of that field value where I think we do want total tuples not live > tuples: where we pass it to index AM cleanup functions. Indexes > don't really care whether heap entries are live or not, only that > they're there. Plus the VACUUM VERBOSE log message that uses the > field is supposed to be reporting total tuples not live tuples. > > I hacked up the patch trying to make that better, finding along the > way that contrib/pgstattuple shared in the confusion about what > it was trying to count. Results attached. > Thanks for looking into this. I agree your patch is more consistent and generally cleaner. > However, I'm not sure we're there yet, because there remains a fairly > nasty discrepancy even once we've gotten everyone onto the same page > about reltuples counting just live tuples: VACUUM and ANALYZE have > different definitions of what's "live". In particular they do not treat > INSERT_IN_PROGRESS and DELETE_IN_PROGRESS tuples the same. Should we > try to do something about that? If so, what? It looks like ANALYZE's > behavior is pretty tightly constrained, judging by the comments in > acquire_sample_rows. > Yeah :-( For the record (and people following this thread who are too lazy to open the analyze.c and search for the comments), the problem is that acquire_sample_rows does not count HEAPTUPLE_INSERT_IN_PROGRESS as live (and HEAPTUPLE_DELETE_IN_PROGRESS as dead) as it assumes the transaction will send the stats at the end. Which is a correct assumption, but it means that when there is a long-running transaction that inserted/deleted many rows, the reltuples value will oscillate during VACUUM / ANALYZE runs. ISTM we need to unify those definitions, probably so that VACUUM adopts what acquire_sample_rows does. I mean, if ANALYZE assumes that the stats will be updated at the end of transaction, why shouldn't VACUUM do the same thing? The one reason I can think of is that VACUUM is generally expected to run longer than ANALYZE, so the assumption that large transactions complete later is not quite reliable. Or is there some other reason why VACUUM can't treat the in-progress tuples the same way? > Another problem is that it looks like CREATE INDEX will set reltuples > to the total number of heap entries it chose to index, because that > is what IndexBuildHeapScan counts. Maybe we should adjust that? > You mean by only counting live tuples in IndexBuildHeapRangeScan, following whatever definition we end up using in VACUUM/ANALYZE? Seems like a good idea I guess, although CREATE INDEX not as frequent as the other commands. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: percentile value check can be slow
Hi, On 11/18/2017 10:30 PM, David Fetter wrote: > On Sat, Nov 18, 2017 at 10:44:36AM -0500, Tom Lane wrote: >> jotpe <jo...@posteo.de> writes: >>> I tried to enter invalid percentile fractions, and was astonished >>> that it seems to be checked after many work is done? >> >> IIRC, only the aggregate's final-function is concerned with direct >> arguments, so it's not all that astonishing. > > It may not be surprising from the point of view of a systems > programmer, but it's pretty surprising that this check is deferred to > many seconds in when the system has all the information it need in > order to establish this before execution begins. > > I'm not sure I see an easy way to do this check early, but it's worth > trying on grounds of POLA violation. I have a couple of ideas on how > to do this, one less invasive but hinky, the other a lot more invasive > but better overall. > > Ugly Hack With Ugly Performance Consequences: > Inject a subtransaction at the start of execution that supplies an > empty input to the final function with the supplied direct > arguments. > I'm pretty sure you realize this is quite unlikely to get accepted. > Bigger Lift: > Require a separate recognizer function direct arguments and fire > it during post-parse analysis. Perhaps this could be called > recognizer along with the corresponding mrecognizer. It's not > clear that it's sane to have separate ones, but I thought I'd > bring it up for completeness. > Is 'recognizer' an established definition I should know? Is it the same as 'validator' or is it something new/different? > Way Bigger Lift, As Far As I Can Tell, But More Fun For Users: > Allow optional CHECK constraints in CREATE AGGREGATE for direct > arguments. > How will any of the approaches deal with something like select percentile_cont((select array_agg(v) from p)) within group (order by a) from t; In this case the the values are unknown after the parse analysis, so I guess it does not really address that. FWIW while I won't stand in the way of improving this, I wonder if this is really worth the additional complexity. If you get errors like this with a static list of values, you will fix the list and you're done. If the list is dynamic (computed in the query itself), you'll still get the error much later during query execution. So if you're getting many failures like this for the "delayed error reporting" to be an issue, perhaps there's something wrong in you stack and you should address that instead? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] new function for tsquery creartion
Hi, On 09/13/2017 10:57 AM, Victor Drobny wrote: > On 2017-09-09 06:03, Thomas Munro wrote: >> Please send a rebased version of the patch for people to review and >> test as that one has bit-rotted. > Hello, > Thank you for interest. In the attachment you can find rebased > version(based on 69835bc8988812c960f4ed5aeee86b62ac73602a commit) > I did a quick review of the patch today. The patch unfortunately no longer applies, so I had to use an older commit from September. Please rebase to current master. I've only looked on the diff at this point, will do more testing once the rebase happens. Some comments: 1) This seems to mix multiple improvements into one patch. There's the support for alternative query syntax, and then there are the new operators (AROUND and <m,n>). I propose to split the patch into two or more parts, each addressing one of those bits. I guess there will be two or three parts - first adding the syntax, second adding <m,n> and third adding the AROUND(n). Seems reasonable? 2) I don't think we should mention Google in the docs explicitly. Not that I'm somehow anti-google, but this syntax was certainly not invented by Google - I vividly remember using something like that on Altavista (yeah, I'm old). And it's used by pretty much every other web search engine out there ... 3) In the SGML docs, please use instead of just quoting the values. So it should be | instead of '|' etc. Just like in the parts describing plainto_tsquery, for example. 4) Also, I recommend adding a brief explanation what the examples do. Right now there's just a bunch of queryto_tsquery, and the reader is expected to understand the output. I suggest adding a sentence or two, explaining what's happening (just like for plainto_tsquery examples). 5) I'm not sure about negative values in the <n,m> operator. I don't find it particularly confusing - once you understand that (a <n,m> b) means "there are 'k' words between 'a' and 'b' (n <= k <= m)", then negative values seem like a fairly straightforward extension. But I guess the main question is "Is there really a demand for the new <n,m> operator, or have we just invented if because we can?" 6) There seem to be some new constants defined, with names not following the naming convention. I mean this #define WAITOPERAND 1 #define WAITOPERATOR2 #define WAITFIRSTOPERAND3 #define WAITSINGLEOPERAND 4 #define INSIDEQUOTES5 <-- the new one and #define TSPO_L_ONLY0x01 #define TSPO_R_ONLY0x02 #define TSPO_BOTH 0x04 #define TS_NOT_EXAC0x08 <-- the new one Perhaps that's OK, but it seems a bit inconsistent. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] pg_basebackup --progress output for batch execution
On 11/10/2017 02:32 PM, Martín Marqués wrote: > Hi, > > Thanks for having a look at this patch. > > 2017-11-09 20:55 GMT-03:00 Jeff Janes <jeff.ja...@gmail.com>: >> On Fri, Sep 29, 2017 at 4:00 PM, Martin Marques >> <martin.marq...@2ndquadrant.com> wrote: >>> >>> Hi, >>> >>> Some time ago I had to work on a system where I was cloning a standby >>> using pg_basebackup, that didn't have screen or tmux. For that reason I >>> redirected the output to a file and ran it with nohup. >>> >>> I normally (always actually ;) ) run pg_basebackup with --progress and >>> --verbose so I can follow how much has been done. When done on a tty you >>> get a nice progress bar with the percentage that has been cloned. >>> >>> The problem came with the execution and redirection of the output, as >>> the --progress option will write a *very* long line! >>> >>> Back then I thought of writing a patch (actually someone suggested I do >>> so) to add a --batch-mode option which would change the behavior >>> pg_basebackup has when printing the output messages. >> >> >> >> While separate lines in the output file is better than one very long line, >> it still doesn't seem so useful. If you aren't watching it in real time, >> then you really need to have a timestamp on each line so that you can >> interpret the output. The lines are about one second apart, but I don't >> know robust that timing is under all conditions. > > I kind of disagree with your view here. > > If the cloning process takes many hours to complete (in my case, it > was around 12 hours IIRC) you might want to peak at the log every now > and then with tail. > > I do agree on adding a timestamp prefix to each line, as it's not > clear from the code how often progress_report() is called. > >> I think I agree with Arthur that I'd rather have the decision made by >> inspecting whether output is going to a tty, rather than by adding another >> command line option. But maybe that is not detected robustly enough across >> all platforms and circumstances? > > In this case, Arthur's idea is good, but would make the patch less > generic/flexible for the end user. > I'm not quite sure about that. We're not getting the flexibility for free, but for additional complexity (additional command-line option). And what valuable capability would we (or the users) loose, really, by relying on the isatty call? So what use case is possible with --batch-mode but not with isatty (e.g. when combined with tee)? > > That's why I tried to reproduce what top does when executed with -b > (Batch mode operation). There, it's the end user who decides how the > output is formatted (well, saying it decides on formatting a bit of > an overstatement, but you get it ;) ) > In 'top' a batch mode also disables input, and runs for a certain number of interactions (as determined by '-n' option). That's not the case in pg_basebackup, where it only adds the extra newline. > > An example where using isatty() might fail is if you run > pg_basebackup from a tty but redirect the output to a file, I > believe that in that case isatty() will return true, but it's very > likely that the user might want batch mode output. > IMHO that should work correctly, as already explained by Arthur. > > But maybe we should also add Arthurs idea anyway (when not in batch > mode), as it seems pretty lame to output progress in one line if you > are not in a tty. > I'd say to just use isatty, unless we can come up with a compelling use case that is not satisfied by it. And if we end up using --batch-mode, perhaps we should only allow it when --progress is used. It's the only thing it affects, and it makes no difference when used without it. Furthermore, I propose to reword this: Runs pg_basebackup in batch mode. This is useful if the output is to be pipped so the other end of the pipe reads each line. Using this option with --progress will result in printing each progress output with a newline at the end, instead of a carrige return. like this: Runs pg_basebackup in batch mode, in which --progress terminates the lines with a regular newline instead of carriage return. This is useful if the output is redirected to a file or a pipe, instead of a plain console. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: WIP: BRIN multi-range indexes
Hi, Apparently there was some minor breakage due to duplicate OIDs, so here is the patch series updated to current master. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz Description: application/gzip 0002-BRIN-bloom-indexes.patch.gz Description: application/gzip 0003-BRIN-multi-range-minmax-indexes.patch.gz Description: application/gzip 0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz Description: application/gzip
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Hi, Attached is an updated version of the patch, adopting the psql describe changes introduced by 471d55859c11b. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 0001-multivariate-MCV-lists.patch.gz Description: application/gzip 0002-multivariate-histograms.patch.gz Description: application/gzip
Re: [HACKERS] Custom compression methods
Hi, On 11/14/2017 02:23 PM, Ildus Kurbangaliev wrote: > > ... > > Attached version 4 of the patch. Fixed pg_upgrade and few other bugs. > I did a review of this today, and I think there are some things that need improvement / fixing. Firstly, some basic comments from just eye-balling the diff, then some bugs I discovered after writing an extension adding lz4. 1) formatRelOptions/freeRelOptions are no longer needed (I see Ildar already pointer that out) 2) There's unnecessary whitespace (extra newlines) on a couple of places, which is needlessly increasing the size of the patch. Small difference, but annoying. 3) tuptoaster.c Why do you change 'info' from int32 to uint32? Seems unnecessary. Adding new 'att' variable in toast_insert_or_update is confusing, as there already is 'att' in the very next loop. Technically it's correct, but I'd bet it'll lead to some WTF?! moments later. I propose to just use TupleDescAttr(tupleDesc,i) on the two places where it matters, around line 808. There are no comments for init_compression_options_htab and get_compression_options_info, so that needs to be fixed. Moreover, the names are confusing because what we really get is not just 'options' but the compression routines too. 4) gen_db_file_maps probably shouldn't do the fprints, right? 5) not sure why you modify src/tools/pgindent/exclude_file_patterns 6) I'm rather confused by AttributeCompression vs. ColumnCompression. I mean, attribute==column, right? Of course, one is for data from parser, the other one is for internal info. But can we make the naming clearer? 7) The docs in general are somewhat unsatisfactory, TBH. For example the ColumnCompression has no comments, unlike everything else in parsenodes. Similarly for the SGML docs - I suggest to expand them to resemble FDW docs (https://www.postgresql.org/docs/10/static/fdwhandler.html) which also follows the handler/routines pattern. 8) One of the unclear things if why we even need 'drop' routing. It seems that if it's defined DropAttributeCompression does something. But what should it do? I suppose dropping the options should be done using dependencies (just like we drop columns in this case). BTW why does DropAttributeCompression mess with att->attisdropped in this way? That seems a bit odd. 9) configure routines that only check if (options != NIL) and then error out (like tsvector_configure) seem a bit unnecessary. Just allow it to be NULL in CompressionMethodRoutine, and throw an error if options is not NIL for such compression method. 10) toast_compress_datum still does this: if (!ac && (valsize < PGLZ_strategy_default->min_input_size || valsize > PGLZ_strategy_default->max_input_size)) which seems rather pglz-specific (the naming is a hint). Why shouldn't this be specific to compression, exposed either as min/max constants, or wrapped in another routine - size_is_valid() or something like that? 11) The comments in toast_compress_datum probably need updating, as it still references to pglz specifically. I guess the new compression methods do matter too. 12) get_compression_options_info organizes the compression info into a hash table by OID. The hash table implementation assumes the hash key is at the beginning of the entry, but AttributeCompression is defined like this: typedef struct { CompressionMethodRoutine *routine; List *options; Oid cmoptoid; } AttributeCompression; Which means get_compression_options_info is busted, will never lookup anything, and the hash table will grow by adding more and more entries into the same bucket. Of course, this has extremely negative impact on performance (pretty much arbitrarily bad, depending on how many entries you've already added to the hash table). Moving the OID to the beginning of the struct fixes the issue. 13) When writing the experimental extension, I was extremely confused about the regular varlena headers, custom compression headers, etc. In the end I stole the code from tsvector.c and whacked it a bit until it worked, but I wouldn't dare to claim I understand how it works. This needs to be documented somewhere. For example postgres.h has a bunch of paragraphs about varlena headers, so perhaps it should be there? I see the patch tweaks some of the constants, but does not update the comment at all. Perhaps it would be useful to provide some additional macros making access to custom-compressed varlena values easier. Or perhaps the VARSIZE_ANY / VARSIZE_ANY_EXHDR / VARDATA_ANY already support that? This part is not very clear to me. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services pg_lz4.tgz Description: application/compressed-tar
Re: percentile value check can be slow
Hi, On 11/19/2017 03:10 AM, David Fetter wrote: > On Sat, Nov 18, 2017 at 11:05:47PM +0100, Tomas Vondra wrote: >> Hi, >> >> ... >> >> Is 'recognizer' an established definition I should know? Is it the same >> as 'validator' or is it something new/different? > > I borrowed it from http://langsec.org/ > > I'm not entirely sure what you mean by a validator, but a recognizer > is something that gives a quick and sure read as to whether the input > is well-formed. In general, it's along the lines of a tokenizer, a > parser, and something that does very light post-parse analysis for > correctness of form. > > For the case that started the thread, a recognizer would check > something along the lines of > > CHECK('[0,1]' @> ALL(input_array)) > OK, thanks. From what I understand, recognizer is more about recognizing if a string is valid within a given formal language (essentially, if it's a well-formed program). That may not be the right term for checks on parameter values. OTOH we already have "validators" on a number of places - functions checking various parameters, e.g. reloptions for FDWs, etc. But I guess the naming can be solved later ... >>> Way Bigger Lift, As Far As I Can Tell, But More Fun For Users: >>> Allow optional CHECK constraints in CREATE AGGREGATE for direct >>> arguments. >>> >> >> How will any of the approaches deal with something like >> >> select percentile_cont((select array_agg(v) from p)) >>within group (order by a) from t; >> >> In this case the the values are unknown after the parse analysis, so I >> guess it does not really address that. > > It doesn't. Does it make sense to do a one-shot execution for cases > like that? It might well be worth it to do the aggregate once in > advance as a throw-away if the query execution time is already going > to take awhile. Of course, you can break that one by making p a JOIN > to yet another thing... > >> FWIW while I won't stand in the way of improving this, I wonder if this >> is really worth the additional complexity. If you get errors like this >> with a static list of values, you will fix the list and you're done. If >> the list is dynamic (computed in the query itself), you'll still get the >> error much later during query execution. >> >> So if you're getting many failures like this for the "delayed error >> reporting" to be an issue, perhaps there's something wrong in you stack >> and you should address that instead? > > I'd like to think that getting something to fail quickly and cheaply > when it can will give our end users a better experience. Here, > "cheaply" refers to their computing resources and time. The trouble is, this increases execution time for everyone, including people who carefully construct the parameter values. That seems rather undesirable. > > Clearly, not having this happen in this case bothered Johannes > enough to wade in here. > No. He was surprised the error is reported after significant amount of time, but he does not seem to claim failing faster would be valuable to him. That is your assumption, and I have my doubts about it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] CUBE seems a bit confused about ORDER BY
On 10/30/2017 03:04 PM, Alexander Korotkov wrote: > On Sun, Oct 29, 2017 at 1:30 AM, Alexander Korotkov > <a.korot...@postgrespro.ru <mailto:a.korot...@postgrespro.ru>> wrote: ... > > Thus, I heard no objection to my approach. Attached patch changes ~> > operator in the proposed way and fixes KNN-GiST search for that. I'm > going to register this patch to the nearest commitfest. > Seems fine to me, although perhaps it should be split into two parts. One with the cube_coord_llur fixes, and then g_cube_distance changes adding support for negative coordinates. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] GnuTLS support
Hi, On 11/02/2017 11:33 PM, Andreas Karlsson wrote: > On 09/18/2017 07:04 PM, Jeff Janes wrote:> You fixed the first issue, > but I still get the second one: >> >> be-secure-gnutls.c: In function 'get_peer_certificate': >> be-secure-gnutls.c:667: error: 'GNUTLS_X509_CRT_LIST_SORT' undeclared >> (first use in this function) >> be-secure-gnutls.c:667: error: (Each undeclared identifier is reported >> only once >> be-secure-gnutls.c:667: error: for each function it appears in.) > > Thanks again for testing the code. I have now rebased the patch and > fixed the second issue. I tested that it works on CentOS 6. > > Work which remains: > > - sslinfo > - pgcrypto > - Documentation > - Decide if what I did with the config is a good idea > I don't want to be the annoying guy, but this patch no longer applies due to 642bafa0c5f9f08d106a14f31429e0e0c718b603 touching the tests :-( BTW regarding the config, I believe you've kept is static (no hiding of sections based on configure parameters), and you've separated the openssl and gnutls options, right? Seems fine to me. The one thing is that it was proposed to rename the openssl-specific options to start with openssl_ instead of ssl_. One question though. What happens when you do ./configure --with-openssl --with-gnutls If I get it right we ignore gnutls and use openssl (as it's the first checked in #ifdefs). Shouldn't we enforce in configure that only one TLS implementation is enabled? Either by some elaborate check, or by switching to something like --with-ssl=(openssl|gnutls) regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Custom compression methods
On 11/15/2017 02:13 PM, Robert Haas wrote: > On Wed, Nov 15, 2017 at 4:09 AM, Ildus Kurbangaliev > <i.kurbangal...@postgrespro.ru> wrote: >> So in the next version of the patch I can just unlink the options from >> compression methods and dropping compression method will not affect >> already compressed tuples. They still could be decompressed. > > I guess I don't understand how that can work. I mean, if somebody > removes a compression method - i.e. uninstalls the library - and you > don't have a way to make sure there are no tuples that can only be > uncompressed by that library - then you've broken the database. > Ideally, there should be a way to add a new compression method via an > extension ... and then get rid of it and all dependencies thereupon. > I share your confusion. Once you do DROP COMPRESSION METHOD, there must be no remaining data compressed with it. But that's what the patch is doing already - it enforces this using dependencies, as usual. Ildus, can you explain what you meant? How could the data still be decompressed after DROP COMPRESSION METHOD, and possibly after removing the .so library? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Custom compression methods
On 11/21/2017 09:28 PM, Ildus K wrote: >> Hmmm, it still doesn't work for me. See this: >> >> test=# create extension pg_lz4 ; >> CREATE EXTENSION >> test=# create table t_lz4 (v text compressed lz4); >> CREATE TABLE >> test=# create table t_pglz (v text); >> CREATE TABLE >> test=# insert into t_lz4 select repeat(md5(1::text),300); >> INSERT 0 1 >> test=# insert into t_pglz select * from t_lz4; >> INSERT 0 1 >> test=# drop extension pg_lz4 cascade; >> NOTICE: drop cascades to 2 other objects >> DETAIL: drop cascades to compression options for lz4 >> drop cascades to table t_lz4 column v >> DROP EXTENSION >> test=# \c test >> You are now connected to database "test" as user "user". >> test=# insert into t_lz4 select repeat(md5(1::text),300);^C >> test=# select * from t_pglz ; >> ERROR: cache lookup failed for compression options 16419 >> >> That suggests no recompression happened. > > I will check that. Is your extension published somewhere? > No, it was just an experiment, so I've only attached it to the initial review. Attached is an updated version, with a fix or two. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services pg_lz4.tgz Description: application/compressed-tar
Re: [HACKERS] Custom compression methods
On 11/30/2017 09:51 PM, Alvaro Herrera wrote: > Tomas Vondra wrote: > >> On 11/30/2017 04:20 PM, Ildus Kurbangaliev wrote: > >>> CREATE COMPRESSION METHOD ts1 FOR tsvector HANDLER >>> tsvector_compression_handler; >> >> Understood. Good to know you've considered it, and I agree it doesn't >> need to be there from the start (which makes the patch simpler). > > Just passing by, but wouldn't this fit in the ACCESS METHOD group of > commands? So this could be simplified down to > CREATE ACCESS METHOD ts1 TYPE COMPRESSION > we have that for indexes and there are patches flying for heap storage, > sequences, etc. I think that's simpler than trying to invent all new > commands here. Then (in a future patch) you can use ALTER TYPE to > define compression for that type, or even add a column-level option to > reference a specific compression method. > I think that would conflate two very different concepts. In my mind, access methods define how rows are stored. Compression methods are an orthogonal concept, e.g. you can compress a value (using a custom compression algorithm) and store it in an index (using whatever access method it's using). So not only access methods operate on rows (while compression operates on varlena values), but you can combine those two things together. I don't see how you could do that if both are defined as "access methods" ... Furthermore, the "TYPE" in CREATE COMPRESSION method was meant to restrict the compression algorithm to a particular data type (so, if it relies on tsvector, you can't apply it to text columns). Which is very different from "TYPE COMPRESSION" in CREATE ACCESS METHOD. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Custom compression methods
On 12/01/2017 10:52 PM, Andres Freund wrote: > On 2017-12-01 16:14:58 -0500, Robert Haas wrote: >> Honestly, if we can give everybody a 4% space reduction by >> switching to lz4, I think that's totally worth doing -- but let's >> not make people choose it, let's make it the default going forward, >> and keep pglz support around so we don't break pg_upgrade >> compatibility (and so people can continue to choose it if for some >> reason it works better in their use case). That kind of improvement >> is nothing special in a specific workload, but TOAST is a pretty >> general-purpose mechanism. I have become, through a few bitter >> experiences, a strong believer in the value of trying to reduce our >> on-disk footprint, and knocking 4% off the size of every TOAST >> table in the world does not sound worthless to me -- even though >> context-aware compression can doubtless do a lot better. > > +1. It's also a lot faster, and I've seen way way to many workloads > with 50%+ time spent in pglz. > TBH the 4% figure is something I mostly made up (I'm fake news!). On the mailing list archive (which I believe is pretty compressible) I observed something like 2.5% size reduction with lz4 compared to pglz, at least with the compression levels I've used ... Other algorithms (e.g. zstd) got significantly better compression (25%) compared to pglz, but in exchange for longer compression. I'm sure we could lower compression level to make it faster, but that will of course hurt the compression ratio. I don't think switching to a different compression algorithm is a way forward - it was proposed and explored repeatedly in the past, and every time it failed for a number of reasons, most of which are still valid. Firstly, it's going to be quite hard (or perhaps impossible) to find an algorithm that is "universally better" than pglz. Some algorithms do work better for text documents, some for binary blobs, etc. I don't think there's a win-win option. Sure, there are workloads where pglz performs poorly (I've seen such cases too), but IMHO that's more an argument for the custom compression method approach. pglz gives you good default compression in most cases, and you can change it for columns where it matters, and where a different space/time trade-off makes sense. Secondly, all the previous attempts ran into some legal issues, i.e. licensing and/or patents. Maybe the situation changed since then (no idea, haven't looked into that), but in the past the "pluggable" approach was proposed as a way to address this. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Custom compression methods
On 12/01/2017 03:23 PM, Robert Haas wrote: > On Thu, Nov 30, 2017 at 2:47 PM, Tomas Vondra > <tomas.von...@2ndquadrant.com> wrote: >> OK. I think it's a nice use case (and nice gains on the compression >> ratio), demonstrating the datatype-aware compression. The question is >> why shouldn't this be built into the datatypes directly? > > Tomas, thanks for running benchmarks of this. I was surprised to see > how little improvement there was from other modern compression > methods, although lz4 did appear to be a modest win on both size and > speed. But I share your intuition that a lot of the interesting work > is in datatype-specific compression algorithms. I have noticed in a > number of papers that I've read that teaching other parts of the > system to operate directly on the compressed data, especially for > column stores, is a critical performance optimization; of course, that > only makes sense if the compression is datatype-specific. I don't > know exactly what that means for the design of this patch, though. > It has very little impact on this patch, as it has nothing to do with columnar storage. That is, each value is compressed independently. Column stores exploit the fact that they get a vector of values, compressed in some data-aware way. E.g. some form of RLE or dictionary compression, which allows them to evaluate expressions on the compressed vector. But that's irrelevant here, we only get row-by-row execution. Note: The idea to build dictionary for the whole jsonb column (which this patch should allow) does not make it "columnar compression" in the "column store" way. The executor will still get the decompressed value. > As a general point, no matter which way you go, you have to somehow > deal with on-disk compatibility. If you want to build in compression > to the datatype itself, you need to find at least one bit someplace to > mark the fact that you applied built-in compression. If you want to > build it in as a separate facility, you need to denote the compression > used someplace else. I haven't looked at how this patch does it, but > the proposal in the past has been to add a value to vartag_external. AFAICS the patch does that by setting a bit in the varlena header, and then adding OID of the compression method after the varlena header. So you get (verlena header + OID + data). This has good and bad consequences. Good: It's transparent for the datatype, so it does not have to worry about the custom compression at all (and it may change arbitrarily). Bad: It's transparent for the datatype, so it can't operate directly on the compressed representation. I don't think this is an argument against the patch, though. If the datatype can support intelligent compression (and execution without decompression), it has to be done in the datatype anyway. > One nice thing about the latter method is that it can be used for any > data type generically, regardless of how much bit-space is available > in the data type representation itself. It's realistically hard to > think of a data-type that has no bit space available anywhere but is > still subject to data-type specific compression; bytea definitionally > has no bit space but is also can't benefit from special-purpose > compression, whereas even something like text could be handled by > starting the varlena with a NUL byte to indicate compressed data > following. However, you'd have to come up with a different trick for > each data type. Piggybacking on the TOAST machinery avoids that. It > also implies that we only try to compress values that are "big", which > is probably be desirable if we're talking about a kind of compression > that makes comprehending the value slower. Not all types of > compression do, cf. commit 145343534c153d1e6c3cff1fa1855787684d9a38, > and for those that don't it probably makes more sense to just build it > into the data type. > > All of that is a somewhat separate question from whether we should > have CREATE / DROP COMPRESSION, though (or Alvaro's proposal of using > the ACCESS METHOD stuff instead). Even if we agree that piggybacking > on TOAST is a good way to implement pluggable compression methods, it > doesn't follow that the compression method is something that should be > attached to the datatype from the outside; it could be built into it > in a deep way. For example, "packed" varlenas (1-byte header) are a > form of compression, and the default functions for detoasting always > produced unpacked values, but the operators for the text data type > know how to operate on the packed representation. That's sort of a > trivial example, but it might well be that there are other cases where > we can do something similar. Maybe jsonb, for example, can compress > data in such a
Re: [HACKERS] Custom compression methods
On 12/02/2017 09:24 PM, konstantin knizhnik wrote: > > On Dec 2, 2017, at 6:04 PM, Tomas Vondra wrote: > >> On 12/01/2017 10:52 PM, Andres Freund wrote: >> ... >> >> Other algorithms (e.g. zstd) got significantly better compression (25%) >> compared to pglz, but in exchange for longer compression. I'm sure we >> could lower compression level to make it faster, but that will of course >> hurt the compression ratio. >> >> I don't think switching to a different compression algorithm is a way >> forward - it was proposed and explored repeatedly in the past, and every >> time it failed for a number of reasons, most of which are still valid. >> >> >> Firstly, it's going to be quite hard (or perhaps impossible) to >> find an algorithm that is "universally better" than pglz. Some >> algorithms do work better for text documents, some for binary >> blobs, etc. I don't think there's a win-win option. >> >> Sure, there are workloads where pglz performs poorly (I've seen >> such cases too), but IMHO that's more an argument for the custom >> compression method approach. pglz gives you good default >> compression in most cases, and you can change it for columns where >> it matters, and where a different space/time trade-off makes >> sense. >> >> >> Secondly, all the previous attempts ran into some legal issues, i.e. >> licensing and/or patents. Maybe the situation changed since then (no >> idea, haven't looked into that), but in the past the "pluggable" >> approach was proposed as a way to address this. >> >> > > May be it will be interesting for you to see the following results > of applying page-level compression (CFS in PgPro-EE) to pgbench > data: > I don't follow. If I understand what CFS does correctly (and I'm mostly guessing here, because I haven't seen the code published anywhere, and I assume it's proprietary), it essentially compresses whole 8kB blocks. I don't know it reorganizes the data into columnar format first, in some way (to make it more "columnar" which is more compressible), which would make somewhat similar to page-level compression in Oracle. But it's clearly a very different approach from what the patch aims to improve (compressing individual varlena values). > > All algorithms (except zlib) were used with best-speed option: using > better compression level usually has not so large impact on > compression ratio (<30%), but can significantly increase time > (several times). Certainly pgbench isnot the best candidate for > testing compression algorithms: it generates a lot of artificial and > redundant data. But we measured it also on real customers data and > still zstd seems to be the best compression methods: provides good > compression with smallest CPU overhead. > I think this really depends on the dataset, and drawing conclusions based on a single test is somewhat crazy. Especially when it's synthetic pgbench data with lots of inherent redundancy - sequential IDs, ... My takeaway from the results is rather that page-level compression may be very beneficial in some cases, although I wonder how much of that can be gained by simply using compressed filesystem (thus making it transparent to PostgreSQL). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Custom compression methods
On 12/02/2017 09:38 PM, Andres Freund wrote: > Hi, > > On 2017-12-02 16:04:52 +0100, Tomas Vondra wrote: >> Firstly, it's going to be quite hard (or perhaps impossible) to find an >> algorithm that is "universally better" than pglz. Some algorithms do >> work better for text documents, some for binary blobs, etc. I don't >> think there's a win-win option. > > lz4 is pretty much there. > That's a matter of opinion, I guess. It's a solid compression algorithm, that's for sure ... >> Secondly, all the previous attempts ran into some legal issues, i.e. >> licensing and/or patents. Maybe the situation changed since then (no >> idea, haven't looked into that), but in the past the "pluggable" >> approach was proposed as a way to address this. > > Those were pretty bogus. IANAL so I don't dare to judge on bogusness of such claims. I assume if we made it optional (e.g. configure/initdb option, it'd be much less of an issue). Of course, that has disadvantages too (because when you compile/init with one algorithm, and then find something else would work better for your data, you have to start from scratch). > > I think we're not doing our users a favor if they've to download > some external projects, then fiddle with things, just to not choose > a compression algorithm that's been known bad for at least 5+ years. > If we've a decent algorithm in-core *and* then allow extensibility, > that's one thing, but keeping the bad and tell forks "please take > our users with this code we give you" is ... > I don't understand what exactly is your issue with external projects, TBH. I think extensibility is one of the great strengths of Postgres. It's not all rainbows and unicorns, of course, and it has costs too. FWIW I don't think pglz is a "known bad" algorithm. Perhaps there are cases where other algorithms (e.g. lz4) are running circles around it, particularly when it comes to decompression speed, but I wouldn't say it's "known bad". Not sure which forks you're talking about ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Rethinking MemoryContext creation
On 12/11/2017 05:27 PM, Tom Lane wrote: > I wrote: >> While fooling around with a different problem, I got annoyed at how slow >> MemoryContext creation and deletion is. > > Looking at this some more, there's another micro-optimization we could > make, which is to try to get rid of the strlen+strcpy operations needed > for the context name. (And yes, I'm seeing those show up in profiles, > to the tune of a couple percent of total runtime in some examples.) > > For a *very* large majority of the callers of AllocSetContextCreate, > the context name is a simple C string constant, so we could just store > the pointer to it and save the space and cycles required to copy it. > This would require providing a separate API for the few callers that are > actually passing transient strings, but that's easy enough. I envision > AllocSetContextCreate becoming a wrapper macro for > "AllocSetContextCreateExtended", which'd take a flag argument specifying > whether the context name needs to be copied. > > However, unless we want to run around and touch all the ~ 150 calls > with constant arguments, we'd have to set things up so that the default > behavior for AllocSetContextCreate is to not copy. This risks breaking > callers in extensions. Not entirely sure if it's worth that --- any > thoughts? > I don't think silently breaking extensions is particularly attractive option, so I guess we'll have to run around and tweak the ~150 calls. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Rethinking MemoryContext creation
On 12/11/2017 06:22 PM, Robert Haas wrote: > On Mon, Dec 11, 2017 at 11:59 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: >> Tomas Vondra <tomas.von...@2ndquadrant.com> writes: >>> On 12/11/2017 05:27 PM, Tom Lane wrote: >>>> However, unless we want to run around and touch all the ~ 150 calls >>>> with constant arguments, we'd have to set things up so that the default >>>> behavior for AllocSetContextCreate is to not copy. This risks breaking >>>> callers in extensions. Not entirely sure if it's worth that --- any >>>> thoughts? >> >>> I don't think silently breaking extensions is particularly attractive >>> option, so I guess we'll have to run around and tweak the ~150 calls. >> >> Meh. I suppose that of the ~150 call sites, there are probably only >> a dozen or two where it would actually make a performance difference, >> so maybe this needn't be quite as invasive as I first thought. > > I think changing only a subset of the call sites is unappealing > because, even though it may not make a measurable performance > difference in other cases, it may get cargo-culted into some place > where it does make a difference. > Not sure. One the one hand, it might certainly be somewhat confusing when we use two different methods to create memory contexts with C string constants. On the other hand, I'm sure we have other similar "local" optimizations. I'd say "let's just tweak all the calls to use the new function" but I'm not the person doing the leg work ... > I also don't think silently breaking extensions is a terribly big deal > in this case. It seems likely that most extensions use static names > just as most of our internal stuff does. I'm going to guess that the > number of extensions that will actually break as a result of a change > in this area is probably very small - conceivably zero, and likely > less than five. I don't think we should be willing to uglify the core > code too much for that level of breakage. > I don't know how many extensions you maintain, but in my experience this type of silent breakage is extremely annoying. I definitely prefer a clean compile-time API breakage, for example. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Custom compression methods
Hi, I see there's an ongoing discussion about the syntax and ALTER TABLE behavior when changing a compression method for a column. So the patch seems to be on the way to be ready in the January CF, I guess. But let me play the devil's advocate for a while and question the usefulness of this approach to compression. Some of the questions were mentioned in the thread before, but I don't think they got the attention they deserve. FWIW I don't know the answers, but I think it's important to ask them. Also, apologies if this post looks to be against the patch - that's part of the "devil's advocate" thing. The main question I'm asking myself is what use cases the patch addresses, and whether there is a better way to do that. I see about three main use-cases: 1) Replacing the algorithm used to compress all varlena types (in a way that makes it transparent for the data type code). 2) Custom datatype-aware compression (e.g. the tsvector). 3) Custom datatype-aware compression with additional column-specific metadata (e.g. the jsonb with external dictionary). Now, let's discuss those use cases one by one, and see if there are simpler (or better in some way) solutions ... Replacing the algorithm used to compress all varlena values (in a way that makes it transparent for the data type code). -- While pglz served us well over time, it was repeatedly mentioned that in some cases it becomes the bottleneck. So supporting other state of the art compression algorithms seems like a good idea, and this patch is one way to do that. But perhaps we should simply make it an initdb option (in which case the whole cluster would simply use e.g. lz4 instead of pglz)? That seems like a much simpler approach - it would only require some ./configure options to add --with-lz4 (and other compression libraries), an initdb option to pick compression algorithm, and probably noting the choice in cluster controldata. No dependencies tracking, no ALTER TABLE issues, etc. Of course, it would not allow using different compression algorithms for different columns (although it might perhaps allow different compression level, to some extent). Conclusion: If we want to offer a simple cluster-wide pglz alternative, perhaps this patch is not the right way to do that. Custom datatype-aware compression (e.g. the tsvector) -- Exploiting knowledge of the internal data type structure is a promising way to improve compression ratio and/or performance. The obvious question of course is why shouldn't this be done by the data type code directly, which would also allow additional benefits like operating directly on the compressed values. Another thing is that if the datatype representation changes in some way, the compression method has to change too. So it's tightly coupled to the datatype anyway. This does not really require any new infrastructure, all the pieces are already there. In some cases that may not be quite possible - the datatype may not be flexible enough to support alternative (compressed) representation, e.g. because there are no bits available for "compressed" flag, etc. Conclusion: IMHO if we want to exploit the knowledge of the data type internal structure, perhaps doing that in the datatype code directly would be a better choice. Custom datatype-aware compression with additional column-specific metadata (e.g. the jsonb with external dictionary). -- Exploiting redundancy in multiple values in the same column (instead of compressing them independently) is another attractive way to help the compression. It is inherently datatype-aware, but currently can't be implemented directly in datatype code as there's no concept of column-specific storage (e.g. to store dictionary shared by all values in a particular column). I believe any patch addressing this use case would have to introduce such column-specific storage, and any solution doing that would probably need to introduce the same catalogs, etc. The obvious disadvantage of course is that we need to decompress the varlena value before doing pretty much anything with it, because the datatype is not aware of the compression. So I wonder if the patch should instead provide infrastructure for doing that in the datatype code directly. The other question is if the patch should introduce some infrastructure for handling the column context (e.g. column dictionary). Right now, whoever implements the compression has to implement this bit too. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Custom compression methods
On 12/01/2017 08:38 PM, Alvaro Herrera wrote: > Tomas Vondra wrote: > >> On 11/30/2017 09:51 PM, Alvaro Herrera wrote: > >>> Just passing by, but wouldn't this fit in the ACCESS METHOD group of >>> commands? So this could be simplified down to >>> CREATE ACCESS METHOD ts1 TYPE COMPRESSION >>> we have that for indexes and there are patches flying for heap storage, >>> sequences, etc. >> >> I think that would conflate two very different concepts. In my mind, >> access methods define how rows are stored. > > In mine, they define how things are accessed (i.e. more general than > what you're thinking). We *currently* use them to store rows [in > indexes], but there is no reason why we couldn't expand that. > Not sure I follow. My argument was not as much about whether the rows are stored as rows or in some other (columnar) format, but that access methods deal with "tuples" (i.e. row in the "logical" way). I assume that even if we end up implementing other access method types, they will still be tuple-based. OTOH compression methods (at least as introduced by this patch) operate on individual values, and have very little to do with access to the value (in a sense it's a transparent thing). > > So we group access methods in "types"; the current type we have is for > indexes, and methods in that type define how are indexes accessed. This > new type would indicate how would values be compressed. I disagree that > there is no parallel there. > > I'm trying to avoid pointless proliferation of narrowly defined DDL > commands. > Of course, the opposite case is using the same DDL for very different concepts (although I understand you don't see it that way). But in fairness, I don't really care if we call this COMPRESSION METHOD or ACCESS METHOD or DARTH VADER ... >> Furthermore, the "TYPE" in CREATE COMPRESSION method was meant to >> restrict the compression algorithm to a particular data type (so, if it >> relies on tsvector, you can't apply it to text columns). > > Yes, of course. I'm saying that the "datatype" property of a > compression access method would be declared somewhere else, not in the > TYPE clause of the CREATE ACCESS METHOD command. Perhaps it makes sense > to declare that a certain compression access method is good only for a > certain data type, and then you can put that in the options clause, > "CREATE ACCESS METHOD hyperz TYPE COMPRESSION WITH (type = tsvector)". > But many compression access methods would be general in nature and so > could be used for many datatypes (say, snappy). > > To me it makes sense to say "let's create this method which is for data > compression" (CREATE ACCESS METHOD hyperz TYPE COMPRESSION) followed by > either "let's use this new compression method for the type tsvector" > (ALTER TYPE tsvector SET COMPRESSION hyperz) or "let's use this new > compression method for the column tc" (ALTER TABLE ALTER COLUMN tc SET > COMPRESSION hyperz). > The WITH syntax does not seem particularly pretty to me, TBH. I'd be much happier with "TYPE tsvector" and leaving WITH for the options specific to each compression method. FWIW I think syntax is the least critical part of this patch. It's ~1% of the patch, and the gram.y additions are rather trivial. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: POC: GROUP BY optimization
ay be: 1) first column - the column with bigger estimated number of groups 2) second column - the pair of (first, second) with bigger estimated number of groups 3) third column - the triple of (first, second, third) with bigger ... But seems even with that algorithm, ISTM, it could be implemented in cheaper manner. Maybe. I do have some ideas, although I haven't tried implementing it. If there's no extended statistic on the columns, you can do the current thing (assuming independence etc.). There's not much we can do here. If there's an extended statistic, you can do either a greedy search (get the next column with the highest ndistinct coefficient) or exhaustive search (computing the estimated number of comparisons). Another challenge is that using only the ndistinct coefficient assumes uniform distribution of the values. But you can have a column with 1M distinct values, where a single value represents 99% of the rows. And another column with 100k distinct values, with actual uniform distribution. I'm pretty sure it'd be more efficient to place the 100k column first. The real issue however is that this decision can't be made entirely locally. Consider for example this: explain select a,b,c, count(*) from t group by a,b,c order by c,b,a; Which is clearly cheaper (at least according to the optimizer) than doing two separate sorts. So the optimization actually does the wrong thing here, and it needs to somehow consider the other ordering requirements (in this case ORDER BY) - either by generating multiple paths with different orderings or some heuristics. Hm, thank you. I consider it is a bug of my implementation - basic idea was that we try to match already existing or needed order and only if we fail or have unmatched tail of pathkey list than we will try to find cheapest column order. Fixed in v7 (0002-opt_group_by_index_and_order-v7.patch), but may be by naive way: if we don't have a path pathkey first try to reorder columns accordingly to order by clause. Test for your is also added. OK. I'll give it a try. I'm also wondering how freely we can change the group by result ordering. Assume you disable hash aggregate and parallel query - currently we are guaranteed to use group aggregate that produces exactly the ordering as specified in GROUP BY, but this patch removes that "guarantee" and we can produce arbitrary permutation of the ordering. But as I argued in other threads, such implicit guarantees are really fragile, can silently break for arbitrary reasons (say, parallel query will do just that) and can be easily fixed by adding a proper ORDER BY. So I don't think this is an issue. Agree. SQL by design doesn't give a warranty of particular order without explicit ORDER BY clause. The other random thought is how will/could this interact with the incremental sorts patch. I know Alexander wanted to significantly limit where we actually consider the incremental sort, but I'm not sure if this was one of those places or not (is sure looks like a place where we would greatly benefit from it). Seems, they should not directly interact. Patch tries to find cheapest column order, Alexander's patch tries to make sort cheaper - they are a different tasks. But I will try. Thanks. So to summarize this initial review - I do suggest splitting the patch into two parts. One that does the index magic, and one for this reordering optimization. The first part (additional indexes) seems quite fairly safe, likely to get committable soon. The other part (ndistinct reordering) IMHO requires more thought regarding costing and interaction with other query parts. Thank you for review! ;-) regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Spilling hashed SetOps and aggregates to disk
On 06/05/2018 07:39 PM, David Fetter wrote: On Tue, Jun 05, 2018 at 01:27:01PM -0400, Tom Lane wrote: David Fetter writes: On Tue, Jun 05, 2018 at 02:56:23PM +1200, David Rowley wrote: True. Although not all built in aggregates have those defined. Just out of curiosity, which ones don't? As of 3f85c62d9e825eedd1315d249ef1ad793ca78ed4, pg_aggregate has both of those as NOT NULL. NOT NULL isn't too relevant; that's just protecting the fixed-width nature of the catalog rows. What's important is which ones are zero. Thanks for helping me understand this better. # select aggfnoid::regprocedure, aggkind from pg_aggregate where (aggserialfn=0 or aggdeserialfn=0) and aggtranstype = 'internal'::regtype; aggfnoid | aggkind --+- [snip] (19 rows) Probably the ordered-set/hypothetical ones aren't relevant for this issue. Whether or not we feel like fixing the above "normal" aggs for this, the patch would have to not fail on extension aggregates that don't support serialization. Could there be some kind of default serialization with reasonable properties? Not really, because the aggregates often use "internal" i.e. a pointer referencing whothehellknowswhat, and how do you serialize/deserialize that? The other issue is that serialize/deserialize is only a part of a problem - you also need to know how to do "combine", and not all aggregates can do that ... (certainly not in universal way). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: POC: GROUP BY optimization
On 06/09/2018 08:09 PM, Tomas Vondra wrote: > > /snip/ > > 4) when adding Sort for grouping, try producing the right output order >(if the ORDER BY was specified) > BTW I've just realized we already do something similar in master. If you run a query like this: SELECT a, b, count(*) FROM t GROUP BY b, a ORDER BY a; we will actually plan it like this: QUERY PLAN --- GroupAggregate Group Key: a, b -> Sort Sort Key: a, b -> Seq Scan on t (5 rows) I.e. we already do reorder the group clauses to match ORDER BY, to only require a single sort. This happens in preprocess_groupclause(), which also explains the reasoning behind that. I wonder if some of the new code reordering group pathkeys could/should be moved here (not sure, maybe it's too early for those decisions). In any case, it might be appropriate to update some of the comments before preprocess_groupclause() which claim we don't do certain things added by the proposed patches. This probably also somewhat refutes my claim that the order of grouping keys is currently fully determined by users (and so they may pick the most efficient order), while the reorder-by-ndistinct patch would make that impossible. Apparently when there's ORDER BY, we already mess with the order of group clauses - there are ways to get around it (subquery with OFFSET 0) but it's much less clear. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: POC: GROUP BY optimization
). I mean, we first call n_preordered = group_keys_reorder_by_pathkeys(path->pathkeys, ...); so the reordering tries to maintain ordering of the input path. Then if (n_preordered == 0) we do this: group_keys_reorder_by_pathkeys(root->sort_pathkeys, ...); Doesn't that mean that if we happen to have input path with partially matching ordering (so still requiring explicit Sort before grouping) we may end up ignoring the ORDER BY entirely? Instead of using Sort that would match the ORDER BY? I might have misunderstood this, though. I'm not sure why the first group_keys_reorder_by_pathkeys() call does not simply return 0 when the input path ordering is not sufficient for the grouping? Is there some reasoning behind that (e.g. that sorting partially sorted is faster)? Overall I think this patch introduces four different optimizations that are indeed very interesting: 1) consider additional sorted paths for grouping 2) reorder GROUP BY clauses per ndistinct to minimize comparisons 3) when adding Sort for grouping, maintain partial input ordering 4) when adding Sort for grouping, try producing the right output order (if the ORDER BY was specified) But at this point it's kinda mashed together in ad-hoc manner, using very simple heuristics / assumptions. We'll need to figure out how to combine those optimizations. BTW I get compiler warnings that n_preordered_groups may be used uninitialized in add_paths_to_grouping_rel, because it's only set in an if/else branch, and then used further down. > Next, I had a look on cost_incremental_sort() provided by > incremental sort patch and, it's a pity, it doesn't solve our problem > with the impact of the cost of per-column comparison function and > number of its calls. I currently doesn't, but that might be more an issue in the incremental sort patch and we may need to fix that. Clearly the costing in general in that patch needs more work, and I do recall Tom pointing out that the current heuristics (estimating number of sort groups using ndistincts) seems rather weak. It may not be exactly the same problem, but it seems kinda similar to what this patch does. I have a hunch that those patches will end up inventing something fairly similar. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: POC: GROUP BY optimization
On 06/07/2018 03:41 PM, Teodor Sigaev wrote: >> ... snip ... >> Priorization of the user-provided order can be as simple as giving that comparison_cost a small handicap. I see no point in doing that, and I don't recall a single place in the planner where we do that. If the user specified ORDER BY, we'll slap an explicit Sort on top when needed (which acts as the handicap, but in a clear way). Otherwise we don't do such things - it'd be just plain confusing (consider "ORDER BY a,b" vs. "ORDER BY b,c" with same data types, ndistinct etc. but unexpectedly different costs). Also, what would be a good value for the handicap? Again agree. If we have fixed order of columns (ORDER BY) then we should not try to reorder it. Current patch follows that if I didn't a mistake. This part seems to be more a misunderstanding between me and Claudio. I believe Claudio was referring to the column order in a GROUP BY, not ORDER BY. In which case we don't add any Sort, of course. I'm still opposed to adding arbitrary handicap to prioritize the order specified by user, for the reasons I explained before. We should aim to make the heuristics/costing reliable enough to make this unnecessary. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: WAL prefetch
On 06/16/2018 12:06 PM, Thomas Munro wrote: On Sat, Jun 16, 2018 at 9:38 PM, Tomas Vondra wrote: On 06/15/2018 08:01 PM, Andres Freund wrote: On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote: On 14.06.2018 09:52, Thomas Munro wrote: Why stop at the page cache... what about shared buffers? It is good question. I thought a lot about prefetching directly to shared buffers. I think that's definitely how this should work. I'm pretty strongly opposed to a prefetching implementation that doesn't read into s_b. Could you elaborate why prefetching into s_b is so much better (I'm sure it has advantages, but I suppose prefetching into page cache would be much easier to implement). posix_fadvise(POSIX_FADV_WILLNEED) might already get most of the speed-up available here in the short term for this immediate application, but in the long term a shared buffers prefetch system is one of the components we'll need to support direct IO. Sure. Assuming the switch to direct I/O will happen (it probably will, sooner or later), my question is whether this patch should be required to introduce the prefetching into s_b. Or should we use posix_fadvise for now, get most of the benefit, and leave the prefetch into s_b as an improvement for later? The thing is - we're already doing posix_fadvise prefetching in bitmap heap scans, it would not be putting additional burden on the direct I/O patch (hypothetical, so far). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: WAL prefetch
On 06/15/2018 08:01 PM, Andres Freund wrote: On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote: On 14.06.2018 09:52, Thomas Munro wrote: On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik wrote: pg_wal_prefetch function will infinitely traverse WAL and prefetch block references in WAL records using posix_fadvise(WILLNEED) system call. Hi Konstantin, Why stop at the page cache... what about shared buffers? It is good question. I thought a lot about prefetching directly to shared buffers. I think that's definitely how this should work. I'm pretty strongly opposed to a prefetching implementation that doesn't read into s_b. Could you elaborate why prefetching into s_b is so much better (I'm sure it has advantages, but I suppose prefetching into page cache would be much easier to implement). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: WAL prefetch
On 06/16/2018 09:02 PM, Andres Freund wrote: > On 2018-06-16 11:38:59 +0200, Tomas Vondra wrote: >> >> >> On 06/15/2018 08:01 PM, Andres Freund wrote: >>> On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote: >>>> >>>> >>>> On 14.06.2018 09:52, Thomas Munro wrote: >>>>> On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik >>>>> wrote: >>>>>> pg_wal_prefetch function will infinitely traverse WAL and prefetch block >>>>>> references in WAL records >>>>>> using posix_fadvise(WILLNEED) system call. >>>>> Hi Konstantin, >>>>> >>>>> Why stop at the page cache... what about shared buffers? >>>>> >>>> >>>> It is good question. I thought a lot about prefetching directly to shared >>>> buffers. >>> >>> I think that's definitely how this should work. I'm pretty strongly >>> opposed to a prefetching implementation that doesn't read into s_b. >>> >> >> Could you elaborate why prefetching into s_b is so much better (I'm sure it >> has advantages, but I suppose prefetching into page cache would be much >> easier to implement). > > I think there's a number of issues with just issuing prefetch requests > via fadvise etc: > > - it leads to guaranteed double buffering, in a way that's just about > guaranteed to *never* be useful. Because we'd only prefetch whenever > there's an upcoming write, there's simply no benefit in the page > staying in the page cache - we'll write out the whole page back to the > OS. How does reading directly into shared buffers substantially change the behavior? The only difference is that we end up with the double buffering after performing the write. Which is expected to happen pretty quick after the read request. > - reading from the page cache is far from free - so you add costs to the > replay process that it doesn't need to do. > - you don't have any sort of completion notification, so you basically > just have to guess how far ahead you want to read. If you read a bit > too much you suddenly get into synchronous blocking land. > - The OS page is actually not particularly scalable to large amounts of > data either. Nor are the decisions what to keep cached likley to be > particularly useful. The posix_fadvise approach is not perfect, no doubt about that. But it works pretty well for bitmap heap scans, and it's about 13249x better (rough estimate) than the current solution (no prefetching). > - We imo need to add support for direct IO before long, and adding more > and more work to reach feature parity strikes meas a bad move. > IMHO it's unlikely to happen in PG12, but I might be over-estimating the invasiveness and complexity of the direct I/O change. While this patch seems pretty doable, and the improvements are pretty significant. My point was that I don't think this actually adds a significant amount of work to the direct IO patch, as we already do prefetch for bitmap heap scans. So this needs to be written anyway, and I'd expect those two places to share most of the code. So where's the additional work? I don't think we should reject patches just because it might add a bit of work to some not-yet-written future patch ... (which I however don't think is this case). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On 06/18/2018 05:06 PM, Joe Conway wrote: On 06/18/2018 10:52 AM, Tom Lane wrote: Robert Haas writes: On Mon, Jun 18, 2018 at 10:12 AM, Joe Conway wrote: Not necessarily. Our pages probably have enough predictable bytes to aid cryptanalysis, compared to user data in a column which might not be very predicable. Really? I would guess that the amount of entropy in a page is WAY higher than in an individual column value. Depending on the specifics of the encryption scheme, having some amount of known (or guessable) plaintext may allow breaking the cipher, even if much of the plaintext is not known. This is cryptology 101, really. Exactly At the same time, having to have a bunch of independently-decipherable short field values is not real secure either, especially if they're known to all be encrypted with the same key. But what you know or can guess about the plaintext in such cases would be target-specific, rather than an attack that could be built once and used against any PG database. Again is dependent on the specific solution for encryption. In some cases you might do something like generate a single use random key, encrypt the payload with that, encrypt the single use key with the "global" key, append the two results and store. Yeah, I suppose we could even have per-page keys, for example. One topic I haven't seen mentioned in this thread yet is indexes. That's a pretty significant side-channel, when built on encrypted columns. Even if the indexes are encrypted too, you can often deduce a lot of information from them. So what's the plan here? Disallow indexes on encrypted columns? Index encypted values directly? Something else? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Hi, On 05/25/2018 01:41 PM, Moon, Insung wrote: Hello Hackers, ... BTW, I want to support CBC mode encryption[3]. However, I'm not sure how to use the IV in CBC mode for this proposal. I'd like to hear opinions by security engineer. I'm not a cryptographer either, but this is exactly where you need a prior discussion about the threat models - there are a couple of chaining modes, each with different weaknesses. FWIW it may also matter if data_checksums are enabled, because that may prevent malleability attacks affecting of the modes. Assuming active attacker (with the ability to modify the data files) is part of the threat model, of course. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On 06/11/2018 11:22 AM, Masahiko Sawada wrote: On Fri, May 25, 2018 at 8:41 PM, Moon, Insung wrote: Hello Hackers, This propose a way to develop "Table-level" Transparent Data Encryption (TDE) and Key Management Service (KMS) support in PostgreSQL. ... As per discussion at PGCon unconference, I think that firstly we need to discuss what threats we want to defend database data against. If user wants to defend against a threat that is malicious user who logged in OS or database steals an important data on datbase this design TDE would not help. Because such user can steal the data by getting a memory dump or by SQL. That is of course differs depending on system requirements or security compliance but what threats do you want to defend database data against? and why? I do agree with this - a description of the threat model needs to be part of the design discussion, otherwise it's not possible to compare it to alternative solutions (e.g. full-disk encryption using LUKS or using existing privilege controls and/or RLS). TDE was proposed/discussed repeatedly in the past, and every time it died exactly because it was not very clear which issue it was attempting to solve. Let me share some of the issues mentioned as possibly addressed by TDE (I'm not entirely sure TDE actually solves them, I'm just saying those were mentioned in previous discussions): 1) enterprise requirement - Companies want in-database encryption, for various reasons (because "enterprise solution" or something). 2) like FDE, but OS/filesystem independent - Same config on any OS and filesystem, which may make maintenance easier. 3) does not require special OS/filesystem setup - Does not require help from system adminitrators, setup of LUKS devices or whatever. 4) all filesystem access (basebackups/rsync) is encrypted anyway 5) solves key management (the main challenge with pgcrypto) 6) allows encrypting only some of the data (tables, columns) to minimize performance impact IMHO it makes sense to have TDE even if it provides the same "security" as disk-level encryption, assuming it's more convenient to setup/use from the database. Also, if I understand correctly, at unconference session there also were two suggestions about the design other than the suggestion by Alexander: implementing TDE at column level using POLICY, and implementing TDE at table-space level. The former was suggested by Joe but I'm not sure the detail of that suggestion. I'd love to hear the deal of that suggestion. The latter was suggested by Tsunakawa-san. Have you considered that? You mentioned that encryption of temporary data for query processing and large objects are still under the consideration. But other than them you should consider the temporary data generated by other subsystems such as reorderbuffer and transition table as well. The severity of those limitations is likely related to the threat model. I don't think encrypting temporary data would be a big problem, assuming you know which key to use. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: POC: GROUP BY optimization
On 06/13/2018 06:56 PM, Teodor Sigaev wrote: >> I.e. we already do reorder the group clauses to match ORDER BY, to only >> require a single sort. This happens in preprocess_groupclause(), which >> also explains the reasoning behind that. > Huh. I missed that. That means group_keys_reorder_by_pathkeys() call > inside get_cheapest_group_keys_order() duplicates work of > preprocess_groupclause(). > And so it's possible to replace it to simple counting number of the > same pathkeys in beginning of lists. Or remove most part of > preprocess_groupclause() - but it seems to me we should use first > option, preprocess_groupclause() is simpler, it doesn't operate with > pathkeys only with SortGroupClause which is simpler. > > BTW, incremental sort path provides pathkeys_common(), exactly what we > need. > >> I wonder if some of the new code reordering group pathkeys could/should >> be moved here (not sure, maybe it's too early for those decisions). In >> any case, it might be appropriate to update some of the comments before >> preprocess_groupclause() which claim we don't do certain things added by >> the proposed patches. > > preprocess_groupclause() is too early step to use something like > group_keys_reorder_by_pathkeys() because pathkeys is not known here yet. > > >> This probably also somewhat refutes my claim that the order of >> grouping keys is currently fully determined by users (and so they >> may pick the most efficient order), while the reorder-by-ndistinct >> patch would make that impossible. Apparently when there's ORDER BY, >> we already mess with the order of group clauses - there are ways to >> get around it (subquery with OFFSET 0) but it's much less clear. > > I like a moment when objections go away :) > I'm afraid that's a misunderstanding - my objections did not really go away. I'm just saying my claim that we're not messing with order of grouping keys is not entirely accurate, because we do that in one particular case. I still think we need to be careful when introducing new optimizations in this area - reordering the grouping keys by ndistinct, ORDER BY or whatever. In particular I don't think we should commit these patches that may quite easily cause regressions, and then hope some hypothetical future patch fixes the costing. Those objections still stand. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Spilling hashed SetOps and aggregates to disk
On 06/11/2018 07:14 PM, Andres Freund wrote: Hi, On 2018-06-11 17:29:52 +0200, Tomas Vondra wrote: It would be great to get something that performs better than just falling back to sort (and I was advocating for that), but I'm worried we might be moving the goalposts way too far. I'm unclear on why that'd have that bad performance in relevant cases. You're not going to hit the path unless the number of groups is pretty large (or work_mem is ridiculously small, in which case we don't care). With a large number of groups the sorting path isn't particularly inefficient, because repeatedly storing the input values isn't such a large fraction in comparison to the number of groups (and their transition values). Which scenarios are you concerned about? Say you have a 1TB table, and keeping the groups in memory would require work_mem=2GB. After hitting the work_mem limit, there may be pretty large amount of tuples you'd have to spill to disk and sort. For example we hit the work_mem limit after processing 10% tuples, switching to sort would mean spill+sort of 900GB of data. Or we might say - hmm, we're 10% through, so we expect hitting the limit 10x, so let's spill the hash table and then do sort on that, writing and sorting only 10GB of data. (Or merging it in some hash-based way, per Robert's earlier message.) I don't quite understand the argument that the number of groups needs to be pretty large for us to hit this. So what if the groups are 2x or 10x more than work_mem? It can still be cheaper than switching to sort-based approach, no? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Spilling hashed SetOps and aggregates to disk
On 06/11/2018 08:13 PM, Jeff Davis wrote: > On Mon, 2018-06-11 at 19:33 +0200, Tomas Vondra wrote: >> For example we hit the work_mem limit after processing 10% tuples, >> switching to sort would mean spill+sort of 900GB of data. Or we >> might say - hmm, we're 10% through, so we expect hitting the limit >> 10x, so let's spill the hash table and then do sort on that, >> writing and sorting only 10GB of data. (Or merging it in some >> hash-based way, per Robert's earlier message.) > > Your example depends on large groups and a high degree of group > clustering. That's fine, but it's a special case, > True, it's a special case and it won't work for other cases. It was merely an example for Andres. OTOH it's not entirely unrealistic, I think. Consider something like SELECT extract(year from ts) AS y, extract(month from ts) AS m, extract(day from ts) AS d, string_agg(x), array_agg(y) FROM fact_table GROUP BY y, m, d; which is likely very correlated (assuming new data is appended to the table), and the string_agg/array_agg are likely to produce fairly large groups (about proportional to the number of tuples in the group). Another example might be about HLL aggregate, although in that case the transition state does not grow, so it may not be that bad (and the default estimate of 1kB would work pretty nicely). But there certainly are other aggregates with large transition state, where this might not be the case, and we currently have no way to communicate that to the planner - except for setting work_mem much lower :-/ However, I now realize I've ignored the fact that we typically don't sort the whole table but only a very few columns, so the example was not entirely fair - we would not sort the whole remaining 900GB but likely much less. > and complexity does have a cost, too. Sure. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: WAL prefetch
On 06/19/2018 05:50 PM, Andres Freund wrote: On 2018-06-19 12:08:27 +0300, Konstantin Knizhnik wrote: I do not think that prefetching in shared buffers requires much more efforts and make patch more envasive... It even somehow simplify it, because there is no to maintain own cache of prefetched pages... But it will definitely have much more impact on Postgres performance: contention for buffer locks, throwing away pages accessed by read-only queries,... These arguments seem bogus to me. Otherwise the startup process is going to do that work. Also there are two points which makes prefetching into shared buffers more complex: 1. Need to spawn multiple workers to make prefetch in parallel and somehow distribute work between them. I'm not even convinced that's true. It doesn't seem insane to have a queue of, say, 128 requests that are done with posix_fadvise WILLNEED, where the oldest requests is read into shared buffers by the prefetcher. And then discarded from the page cache with WONTNEED. I think we're going to want a queue that's sorted in the prefetch process anyway, because there's a high likelihood that we'll otherwise issue prfetch requets for the same pages over and over again. That gets rid of most of the disadvantages: We have backpressure (because the read into shared buffers will block if not yet ready), we'll prevent double buffering, we'll prevent the startup process from doing the victim buffer search. I'm confused. I thought you wanted to prefetch directly to shared buffers, so that it also works with direct I/O in the future. But now you suggest to use posix_fadvise() to work around the synchronous buffer read limitation. I don't follow ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: WAL prefetch
ber of installations, that set the timeout high, just so they can avoid FPWs. May be, but I am not so sure. This is why I will try to investigate it more. I'd say checkpoints already do act as such timeout (not only, but people are setting it high to get rid of FPIs). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Spilling hashed SetOps and aggregates to disk
On 06/11/2018 05:16 PM, Robert Haas wrote: On Wed, Jun 6, 2018 at 8:16 PM, Tomas Vondra wrote: ... and this is pretty much what Jeff Davis suggested, I think. The trouble is, each of those cases behaves nicely/terribly in different corner cases. That's a really good point. If the number of groups is pretty small compared to the number of input tuples, then you really only ever want to dump out transition values. By doing so, you minimize the amount of data you have to write. But if the number of groups is, say, half the number of input tuples, then computing transition values before you have all the values that belong to that group is probably a waste of time. I wonder if we could decide what to do by comparing the number of input tuples consumed to the number of groups created at the time we run out of memory. If we've got less than, I dunno, five tuples per group, then assume most groups are small. Pick a strategy that (mainly) spools input tuples. Otherwise, pick a strategy that spools transition tuples. In either case, it seems like we can pick a pure hashing strategy or switch to using both hashing and sorting. For example, IIUC, Andres's proposal involves spooling mostly input tuples, but can also spool transition tuples if the transition values consume more memory as they absorb more tuples. However, you could do a similar kind of thing without needing sort support. When you see a value that's not doesn't fit in your in-memory hash table, use the hash code to route it to 1 of N batch files. Have a second set of batch files for transition tuples in case you need to kick things out of the in-memory hash table. Once you finish reading the input, emit all the values that remain in the in-memory hash table and then process each batch file separately. Similarly, David's strategy involves spooling only transition tuples and then sorting on the group key, but it's again possible to do something similar without relying on sorting. Instead of flushing the in-memory hash table to a tuple store, split the transition tuples it contains among N batch files based on the hash code. Once you've read all of the input, go back and reprocess each batch file, combining transition values whenever the same group keys appear in more than one transition tuple. Yeah, essentially something like the batching in hash joins. To me, the pure-hashing strategies look a little simpler, but maybe there's enough performance benefit from combining hashing and sorting that it's worth the complexity, or maybe we should just accept whichever variant somebody's willing to code. But I think we almost have to have separate handling for many-row-per-group and few-rows-per-group, because those seem fundamentally different. I think the underlying question is whether we want to treat this as an emergency safety against OOM (which should be a rare thing in most cases) or something allowing us to pick hash aggregate more often. It would be great to get something that performs better than just falling back to sort (and I was advocating for that), but I'm worried we might be moving the goalposts way too far. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Spilling hashed SetOps and aggregates to disk
On 06/11/2018 04:25 PM, Robert Haas wrote: > ... Maybe that's not exactly what Tomas (or you) meant by eviction strategy, though. Not sure. But it does seem to me that we need to pick the algorithm we're going to use, more or less, in order to decide what infrastructure we need, and at least to some extent, we ought to let our choice of algorithm be informed by the desire not to need too much infrastructure. I was using eviction strategy in a somewhat narrower sense - pretty much just "Which groups to evict from the hash table?" but you're right the question is more general and depends on which scheme we end up using (just hashagg, hash+sort, something else ...) regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: POC: GROUP BY optimization
- either by generating multiple paths with different orderings or some heuristics. I'm also wondering how freely we can change the group by result ordering. Assume you disable hash aggregate and parallel query - currently we are guaranteed to use group aggregate that produces exactly the ordering as specified in GROUP BY, but this patch removes that "guarantee" and we can produce arbitrary permutation of the ordering. But as I argued in other threads, such implicit guarantees are really fragile, can silently break for arbitrary reasons (say, parallel query will do just that) and can be easily fixed by adding a proper ORDER BY. So I don't think this is an issue. The other random thought is how will/could this interact with the incremental sorts patch. I know Alexander wanted to significantly limit where we actually consider the incremental sort, but I'm not sure if this was one of those places or not (is sure looks like a place where we would greatly benefit from it). So to summarize this initial review - I do suggest splitting the patch into two parts. One that does the index magic, and one for this reordering optimization. The first part (additional indexes) seems quite fairly safe, likely to get committable soon. The other part (ndistinct reordering) IMHO requires more thought regarding costing and interaction with other query parts. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PATCH] Improve geometric types
On 06/03/2018 11:50 PM, Tom Lane wrote: Tomas Vondra writes: The main remaining question I have is what do do with back-branches. Shall we back-patch this or not? Given the behavioral changes involved, I'd say "no way". That's reinforced by the lack of field complaints; if there were lots of complaints, maybe we'd be willing to break backwards compatibility, but ... Fair enough, I tend to over-estimate importance of bugfixes and under-estimate breakage due to behavior change. But if we don't want to back-patch this, I'm fine with that. I was a bit worried about making future backpatches more painful, but this code received only ~20 commits over the past files, half of that due tot pgindent, so that seems to be a non-issue. But now I'm wondering what does this mean for existing indexes? Doesn't this effectively mean those are unlikely to give meaningful responses (in the old or new semantics)? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PATCH] Improve geometric types
Hi Emre, Thanks for the rebased patch. I remember reviewing the patch in the last CF, and it seems in a pretty good shape. I plan to look at it again in the next commitfest, but it seems to have been reviewed by other experienced people so I'm not worried about this part. The main remaining question I have is what do do with back-branches. Shall we back-patch this or not? The trouble is that while the patch is essentially a bugfix, it refactors quite significant amount of code to make the fixes practical. If it was possible to back-patch just the fixes without the refactoring, that would be ideal, but unfortunately that's not the case. Based on discussion with Emre in Ottawa that would be rather impractical due to the nature of the bugs and low code reuse. I do believe we should back-patch - after all, it fixes real bugs. It's true the bugs were there for years and no one noticed/reported them, but it's still buggy and that's not fun. Opinions? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Spilling hashed SetOps and aggregates to disk
On 06/05/2018 09:22 AM, David Rowley wrote: On 5 June 2018 at 17:04, Tomas Vondra wrote: On 06/05/2018 04:56 AM, David Rowley wrote: Isn't there still a problem determining when the memory exhaustion actually happens though? As far as I know, we've still little knowledge how much memory each aggregate state occupies. Jeff tried to solve this in [1], but from what I remember, there was too much concern about the overhead of the additional accounting code. [1] https://www.postgresql.org/message-id/flat/CAKJS1f8yvvvj-sVDv_bcxkzcZKq0ZOTVhX0dHfnYDct2Mycq5Q%40mail.gmail.com#cakjs1f8yvvvj-svdv_bcxkzczkq0zotvhx0dhfnydct2myc...@mail.gmail.com I had a chat with Jeff Davis at pgcon about this, and IIRC he suggested a couple of people who were originally worried about the overhead seem to be accepting it now. Is there any great need to make everything pay the small price for this? Couldn't we just create a new MemoryContextMethod that duplicates aset.c, but has the require additional accounting built in at the implementation level, then make execGrouping.c use that allocator for its hash table? The code would not really need to be duplicated, we could just do things the same way as like.c does with like_match.c and include a .c file. We'd need another interface function in MemoryContextMethods to support getting the total memory allocated in the context, but that does not seem like a problem. There probably is not a need, but there was not a great way to enable it explicitly only for some contexts. IIRC what was originally considered back in 2015 was some sort of a flag in the memory context, and the overhead was about the same as with the int64 arithmetic alone. But I don't think we've considered copying the whole AllocSet. Maybe that would work, not sure. I wonder if an aggregate might use a custom context internally (I don't recall anything like that). The accounting capability seems potentially useful for other places, and those might not use AllocSet (or at least not directly). All of this of course assumes the overhead is still there. I sure will do some new measurements. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Spilling hashed SetOps and aggregates to disk
On 06/05/2018 04:56 AM, David Rowley wrote: > On 5 June 2018 at 06:52, Andres Freund wrote: >> That part has gotten a bit easier since, because we have serialize / >> deserialize operations for aggregates these days. > > True. Although not all built in aggregates have those defined. Not sure what to do about those, unfortunately. We could stop adding more groups and hope the aggregate state does not grow further, but that's about it. > >> I wonder whether, at least for aggregates, the better fix wouldn't be to >> switch to feeding the tuples into tuplesort upon memory exhaustion and >> doing a sort based aggregate. We have most of the infrastructure to do >> that due to grouping sets. It's just the pre-existing in-memory tuples >> that'd be problematic, in that the current transition values would need >> to serialized as well. But with a stable sort that'd not be >> particularly problematic, and that could easily be achieved. That's one of the possible solutions, yes. It's hard to say if it's better or worse than the other approaches, because that depends on the number of tuples vs. number of groups etc. Evicting some (or all) of the in-memory groups can be easily better or worse, depending on the data set. And I'm not sure the code complexity would be significantly different. I expect the eviction strategy to be the primary design challenge of this patch. The other bits will be mostly determined by this one piece. While the primary goal of the patch is addressing the OOM risks in hash aggregate, I think it'd be a mistake to see it just that way. I see it could allow us to perform hash aggregate more often, even if we know the groups won't fit into work_mem. If we could estimate how much of the aggregate state we'll have to spill to disk, we could still prefer hashagg over groupagg. We would pay the price for eviction, but on large data sets that can be massively cheaper than having to do sort. I admit this is a bit hand-wavy, and the costing depends on us being able to estimate the amount of evicted data. I certainly don't expect to get this right away (that would move the goal posts for the patch very far away), but it would be unfortunate to make this improvement impossible/more difficult in the future. > > Isn't there still a problem determining when the memory exhaustion > actually happens though? As far as I know, we've still little > knowledge how much memory each aggregate state occupies. > > Jeff tried to solve this in [1], but from what I remember, there was > too much concern about the overhead of the additional accounting code. > > [1] > https://www.postgresql.org/message-id/flat/CAKJS1f8yvvvj-sVDv_bcxkzcZKq0ZOTVhX0dHfnYDct2Mycq5Q%40mail.gmail.com#cakjs1f8yvvvj-svdv_bcxkzczkq0zotvhx0dhfnydct2myc...@mail.gmail.com > I had a chat with Jeff Davis at pgcon about this, and IIRC he suggested a couple of people who were originally worried about the overhead seem to be accepting it now. IMHO we do want a memory-bound hash aggregate, and doing some sort of memory accounting is a must-have part of that. I don't see a way around this, really. We need to minimize the overhead, of course. I do not recall the exact approach we ended up with in 2015, though, or what the measurements with that version were. I see 1-3% mentioned early in the thread, and there were some additional improvements in subsequent patch version I think. I don't think we can realistically improve this (accounting at block level), and there was a discussion if this is actually an overhead or merely due to different binary layout. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Spilling hashed SetOps and aggregates to disk
On 06/05/2018 07:46 AM, Jeff Davis wrote: On Tue, 2018-06-05 at 07:04 +0200, Tomas Vondra wrote: I expect the eviction strategy to be the primary design challenge of this patch. The other bits will be mostly determined by this one piece. Not sure I agree that this is the primary challenge. The cases that benefit from eviction are probably a minority. I see two categories that would benefit: * Natural clustering in the heap. This sounds fairly common, but a lot of the cases that come to mind are too low-cardinality to be compelling; e.g. timestamps grouped by hour/day/month. If someone has run into a high-cardinality natural grouping case, let me know. * ARRAY_AGG (or similar): individual state values can be large enough that we need to evict to avoid exceeding work_mem even if not adding any new groups. In either case, it seems like a fairly simple eviction strategy would work. For instance, we could just evict the entire hash table if work_mem is exceeded or if the hit rate on the hash table falls below a certain threshold. If there was really something important that should have stayed in the hash table, it will go back in soon anyway. So why should eviction be a major driver for the entire design? I agree it should be an area of improvement for the future, so let me know if you see a major problem, but I haven't been as focused on eviction. My concern is more about what happens when the input tuple ordering is inherently incompatible with the eviction strategy, greatly increasing the amount of data written to disk during evictions. Say for example that we can fit 1000 groups into work_mem, and that there are 2000 groups in the input data set. If the input is correlated with the groups, everything is peachy because we'll evict the first batch, and then group the remaining 1000 groups (or something like that). But if the input data is random (which can easily happen, e.g. for IP addresses, UUIDs and such) we'll hit the limit repeatedly and will evict much sooner. I know you suggested simply dumping the whole hash table and starting from scratch while we talked about this at pgcon, but ISTM it has exactly this issue. But I don't know if there actually is a better option - maybe we simply have to accept this problem. After all, running slowly is still better than OOM (which may or may not happen now). I wonder if we can somehow detect this at run-time and maybe fall-back to groupagg. E.g. we could compare number of groups vs. number of input tuples when we first hit the limit. It's a rough heuristics, but maybe sufficient. While the primary goal of the patch is addressing the OOM risks in hash aggregate, I think it'd be a mistake to see it just that way. I see it could allow us to perform hash aggregate more often, even if we know the groups won't fit into work_mem. If we could estimate how much of the aggregate state we'll have to spill to disk, we could still prefer hashagg over groupagg. We would pay the price for eviction, but on large data sets that can be massively cheaper than having to do sort. Agreed. I ran some tests of my patch in the last round, and they strongly supported choosing HashAgg a lot more often. A lot of sort improvements have been made though, so I really need to run some new numbers. Yeah, Peter made the sort stuff a lot faster. But I think there still is a lot of cases where the hashagg would greatly reduce the amount of data that needs to be written to disk, and that's not something the sort improvements could improve. If you need to aggregate a 1TB table, it's still going to be ~1TB of data you need to write to disk during on-disk sort. But if there is only a reasonable number of groups, that greatly simplifies things. far away), but it would be unfortunate to make this improvement impossible/more difficult in the future. If you see anything that would make this difficult in the future, let me know. Sure. Sorry for being too hand-wavy in this thread, and possibly moving the goal posts further away :-/ I got excited by you planning to work on this again and perhaps a bit carried away ;-) regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Spilling hashed SetOps and aggregates to disk
On 06/05/2018 02:49 PM, Andres Freund wrote: Hi, On 2018-06-05 10:05:35 +0200, Tomas Vondra wrote: My concern is more about what happens when the input tuple ordering is inherently incompatible with the eviction strategy, greatly increasing the amount of data written to disk during evictions. Say for example that we can fit 1000 groups into work_mem, and that there are 2000 groups in the input data set. If the input is correlated with the groups, everything is peachy because we'll evict the first batch, and then group the remaining 1000 groups (or something like that). But if the input data is random (which can easily happen, e.g. for IP addresses, UUIDs and such) we'll hit the limit repeatedly and will evict much sooner. I know you suggested simply dumping the whole hash table and starting from scratch while we talked about this at pgcon, but ISTM it has exactly this issue. Yea, that's the case I was thinking of where going to sorting will very likely have better performance. I think it'd even be sensible to have a "skew tuple" like optimization. When detecting getting closer to memory exhaustion, move new groups to the tuplesort, but already hashed tuples stay in the hashtable. That'd still need tuples being moved to the sort in the cases where the transition values get to big (say array_agg), but that should be comparatively rare. I'm sure we could do better in selecting the hash-tabled values than just taking the first incoming ones, but that shouldn't be too bad. Not sure. I'd imagine the "compression" due to aggregating many tuples into much smaller amount of memory would be a clear win, so I don't find the "let's just dump all input tuples into a file" very attractive. I think evicting a fraction of the aggregate state (say, ~25%) would work better - choosing the "oldest" items seems OK, as those likely represent many input tuples (thus having a high "compaction" ratio). Or we could evict items representing rare groups, to make space for the common ones. Perhaps a combined eviction strategy (10% of each type) would work well. I'm conveniently ignoring the fact that we don't have information to determine this, at the moment, of course. I'm sure it's possible to make better decisions based on some cost estimates, both at plan time and then during execution. That being said, I don't want to over-think / over-engineer this. Getting something that reduces the risk of OOM in the first step is a good enough improvement. If we can do something better next, great. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Spilling hashed SetOps and aggregates to disk
On 06/05/2018 03:41 PM, se...@rielau.com wrote: In our code base we have added a WithStats-Flavor for creating memory contexts. This api accepts a pointer to metric for accounting and it is inherited by all subcontexts unless overridden. So we only needed to change context creation API where we wanted (such as TopTansactionContext, Message Context, ..) That's quite trivial, actually. I think that's pretty much what we tried when this patch was first discussed in 2015, but it added measurable overhead (1-3%) even when the accounting was disabled. Which is not quite desirable, of course. Also we have fixed all those missing hash spills - albeit based on the 9.6 hash table design I think. I don't think this part of the code changed all that much. If there is interest by the community we are very willing to share. Absolutely! I think it'd be great to share both the code and the reasoning behind the design. Cheers Serge Salesforce cheers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Spilling hashed SetOps and aggregates to disk
On 06/05/2018 03:17 PM, David Rowley wrote: On 6 June 2018 at 01:09, Andres Freund wrote: On 2018-06-06 01:06:39 +1200, David Rowley wrote: My concern is that only accounting memory for the group and not the state is only solving half the problem. It might be fine for aggregates that don't stray far from their aggtransspace, but for the other ones, we could still see OOM. If solving the problem completely is too hard, then a half fix (maybe 3/4) is better than nothing, but if we can get a design for a full fix before too much work is done, then isn't that better? I don't think we actually disagree. I was really primarily talking about the case where we can't really do better because we don't have serialization support. I mean we could just rescan from scratch, using a groupagg, but that obviously sucks. I don't think we do. To take yours to the 100% solution might just take adding the memory accounting to palloc that Jeff proposed a few years ago and use that accounting to decide when we should switch method. However, I don't quite fully recall how the patch accounted for memory consumed by sub-contexts and if getting the entire consumption required recursively looking at subcontexts. If that's the case then checking the consumption would likely cost too much if it was done after each tuple was aggregated. Yeah, a simple wrapper would not really fix the issue, because allocating memory in the subcontext would not update the accounting information. So we'd be oblivious of possibly large amounts of memory. I don't quite see how is this related to (not) having serialization support, as it's more about not knowing we already hit the memory limit. That being said, I don't think array_agg actually has the issue, at least since b419865a814abbca12bdd6eef6a3d5ed67f432e1 (it does behave differently in aggregate and non-aggregate contexts, IIRC). I don't know how common this issue is, though. I don't think any built-in aggregates create additional contexts, but I may be wrong. But we have this damn extensibility thing, where users can write their own aggregates ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Spilling hashed SetOps and aggregates to disk
On 06/07/2018 02:18 AM, Andres Freund wrote: On 2018-06-06 17:17:52 -0700, Andres Freund wrote: On 2018-06-07 12:11:37 +1200, David Rowley wrote: On 7 June 2018 at 08:11, Tomas Vondra wrote: On 06/06/2018 04:11 PM, Andres Freund wrote: Consider e.g. a scheme where we'd switch from hashed aggregation to sorted aggregation due to memory limits, but already have a number of transition values in the hash table. Whenever the size of the transition values in the hashtable exceeds memory size, we write one of them to the tuplesort (with serialized transition value). From then on further input rows for that group would only be written to the tuplesort, as the group isn't present in the hashtable anymore. Ah, so you're suggesting that during the second pass we'd deserialize the transition value and then add the tuples to it, instead of building a new transition value. Got it. Having to deserialize every time we add a new tuple sounds terrible from a performance point of view. I didn't mean that we do that, and I don't think David understood it as that either. I was talking about the approach where the second pass is a sort rather than hash based aggregation. Then we would *not* need to deserialize more than exactly once. s/David/Tomas/, obviously. Sorry, it's been a long day. Solution is simple: drink more coffee. ;-) regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Spilling hashed SetOps and aggregates to disk
On 06/06/2018 04:11 PM, Andres Freund wrote: > On 2018-06-06 16:06:18 +0200, Tomas Vondra wrote: >> On 06/06/2018 04:01 PM, Andres Freund wrote: >>> Hi, >>> >>> On 2018-06-06 15:58:16 +0200, Tomas Vondra wrote: >>>> The other issue is that serialize/deserialize is only a part of a problem - >>>> you also need to know how to do "combine", and not all aggregates can do >>>> that ... (certainly not in universal way). >>> >>> There are several schemes where only serialize/deserialize are needed, >>> no? There are a number of fairly sensible schemes where there won't be >>> multiple transition values for the same group, no? >>> >> >> Possibly, not sure what schemes you have in mind exactly ... >> >> But if you know there's only a single transition value, why would you need >> serialize/deserialize at all. Why couldn't you just finalize the value and >> serialize that? > > Because you don't necessarily have all the necessary input rows > yet. > > Consider e.g. a scheme where we'd switch from hashed aggregation to > sorted aggregation due to memory limits, but already have a number of > transition values in the hash table. Whenever the size of the transition > values in the hashtable exceeds memory size, we write one of them to the > tuplesort (with serialized transition value). From then on further input > rows for that group would only be written to the tuplesort, as the group > isn't present in the hashtable anymore. > Ah, so you're suggesting that during the second pass we'd deserialize the transition value and then add the tuples to it, instead of building a new transition value. Got it. That being said, I'm not sure if such generic serialize/deserialize can work, but I'd guess no, otherwise we'd probably use it when implementing the parallel aggregate. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: POC: GROUP BY optimization
oice. Isn't "estimation of cost of comparing function/number of unique values in column could be not very accurate and so planner could make a wrong choice" is more an argument against relying on it when doing these optimizations? FWIW it's one of the arguments Tom made in the incremental sort patch, which relies on it too when computing cost of the incremental sort. I'm sure it's going to be an obstacle there too. > I saw 2 times difference in real-world application. Again, improving > sort cost estimation is a separate task. Sure. But we also need to ask the other question, i.e. how many people would be negatively affected by the optimization. And I admit I don't know the answer to that, the next example is entirely made up. >> >> Imagine you have a custom data type that is expensive for comparisons. >> You know that, so you place it at the end of GROUP BY clauses, to >> reduce the number of comparisons on that field. And then we come along >> and just reorder the columns, placing it first, because it happens to >> have a high ndistinct statistic. And there's no way to get the >> original behavior :-( > Hm. that it could be, but I don't know how to improve here. Current > cost_sort() will return the same cost for any columns order. > > Okay, here we know an estimation of nrow, we could somehow find a > comparing function oid and find a its procost field. And then? we also > need to know width of tuple here and mostly repeat the cost_sort. > > Another option is an introducing enable_groupby_reorder GUC variable > although personally I don't like an idea to add new GUC variable. > I don't like the idea of yet another GUC either, as it's rather blunt instrument. Which is why I'm suggesting to split the patch into two parts. Then we can apply the index stuff (which seems relatively straightforward) and keep working on this second part. FWIW I think it would be useful to have "development GUC" that would allow us to enable/disable these options during development, because that makes experiments much easier. But then remove them before commit. >>> Agree, but I don't know how to use it here. Except, may be: >>> 1) first column - the column with bigger estimated number of groups >>> 2) second column - the pair of (first, second) with bigger estimated >>> number of groups >>> 3) third column - the triple of (first, second, third) with bigger ... >>> >>> But seems even with that algorithm, ISTM, it could be implemented in >>> cheaper manner. >>> >> >> Maybe. I do have some ideas, although I haven't tried implementing it. > Implemented, pls, have a look. > >> >> If there's no extended statistic on the columns, you can do the >> current thing (assuming independence etc.). There's not much we can do >> here. > Agree > >> >> If there's an extended statistic, you can do either a greedy search >> (get the next column with the highest ndistinct coefficient) or >> exhaustive search (computing the estimated number of comparisons). >> >> Another challenge is that using only the ndistinct coefficient assumes >> uniform distribution of the values. But you can have a column with 1M >> distinct values, where a single value represents 99% of the rows. And >> another column with 100k distinct values, with actual uniform >> distribution. I'm pretty sure it'd be more efficient to place the 100k >> column first. > > Interesting. Will think, thank you > You're welcome. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: POC: GROUP BY optimization
On 06/06/2018 11:22 PM, Claudio Freire wrote: > On Wed, Jun 6, 2018 at 5:43 PM Tomas Vondra > wrote: >> >>>>>> For example, it seems to disregard that different data types have >>>>>> different comparison costs. For example comparing bytea will be far >>>>>> more expensive compared to int4, so it may be much more efficient to >>>>>> compare int4 columns first, even if there are far fewer distinct >>>>>> values in them. >>>>> as I can see cost_sort() doesn't pay attention to this details. And >>>>> it should be a separated patch to improve that. >>>>> >>>> >>>> But sort also does not reorder columns. >>> Yes. But estimation of cost of comparing function/number of unique >>> values in column could be not very accurate and so planner could make >>> a wrong choice. > > ... >> >>>> Imagine you have a custom data type that is expensive for comparisons. >>>> You know that, so you place it at the end of GROUP BY clauses, to >>>> reduce the number of comparisons on that field. And then we come along >>>> and just reorder the columns, placing it first, because it happens to >>>> have a high ndistinct statistic. And there's no way to get the >>>> original behavior :-( >>> Hm. that it could be, but I don't know how to improve here. Current >>> cost_sort() will return the same cost for any columns order. >>> >>> Okay, here we know an estimation of nrow, we could somehow find a >>> comparing function oid and find a its procost field. And then? we also >>> need to know width of tuple here and mostly repeat the cost_sort. >>> >>> Another option is an introducing enable_groupby_reorder GUC variable >>> although personally I don't like an idea to add new GUC variable. >>> >> >> I don't like the idea of yet another GUC either, as it's rather blunt >> instrument. Which is why I'm suggesting to split the patch into two >> parts. Then we can apply the index stuff (which seems relatively >> straightforward) and keep working on this second part. >> >> FWIW I think it would be useful to have "development GUC" that would >> allow us to enable/disable these options during development, because >> that makes experiments much easier. But then remove them before commit. > > Correct me if I'm wrong, but doesn't this patch concern itself with > precisely sort performance? > Yes, the second part (reordering pathkeys by ndistinct) does. > As such, estimating sort performance correctly in the various plan > variants being considered seems to be a very central aspect of it. > > This means it's pretty much time to consider the effect of operator > cost *and* distinct values in the cost computation. > Yes, until now that was not really needed because the optimizer does not consider reordering of the pathkeys - it simply grabs either GROUP BY or ORDER BY pathkeys and runs with that. So the costing was fairly trivial, we simply do something like comparison_cost = 2.0 * cpu_operator_cost; sort_cost = comparison_cost * tuples * LOG2(tuples); which essentially ignores that there might be multiple columns, or that the columns may have sort operator with different costs. The question is how reliable the heuristics can be. The current patch uses just plain ndistinct, but that seems rather unreliable but I don't have a clear idea how to improve that - we may have MCV for the columns and perhaps some extended statistics, but I'm not sure how far we can run with that. Essentially what we need to estimate the number of comparisons for each column, to compute better comparison_cost. > > Assuming cost_sort does its thing, it's just a matter of building the > desired variants and picking the one with the smallest cost_sort. One > can use heuristics to build a subset of all possible column orderings > with a guiding principle that tries to maximize the likelihook of > finding a better order than the one in the query (like sorting by > ndistinct), but I'd suggest: > > - Always considering the sort order verbatim as given in the SQL (ie: > what the user requests) > - Relying on cost_sort to distinguish among them, and pick a winner, and > - When in doubt (if cost_sort doesn't produce a clear winner), use the > order given by the user > I don't see why not to generate all possible orderings (keeping only the cheapest path for each pathkey variant) and let the optimizer to sort it out. If the user specified an explicit ORDER BY, we'll slap an extra Sort by at the end, increasing the cost. > Comparison cost can be approximated proba
Re: Spilling hashed SetOps and aggregates to disk
On 06/07/2018 02:11 AM, David Rowley wrote: > On 7 June 2018 at 08:11, Tomas Vondra wrote: >> On 06/06/2018 04:11 PM, Andres Freund wrote: >>> Consider e.g. a scheme where we'd switch from hashed aggregation to >>> sorted aggregation due to memory limits, but already have a number of >>> transition values in the hash table. Whenever the size of the transition >>> values in the hashtable exceeds memory size, we write one of them to the >>> tuplesort (with serialized transition value). From then on further input >>> rows for that group would only be written to the tuplesort, as the group >>> isn't present in the hashtable anymore. >>> >> >> Ah, so you're suggesting that during the second pass we'd deserialize >> the transition value and then add the tuples to it, instead of building >> a new transition value. Got it. > > Having to deserialize every time we add a new tuple sounds terrible > from a performance point of view. > I don't think Andres was suggesting that. I think the scheme was to sort the tuples and read all the tuples for a group at once (so we would deserialize just once). > Can't we just: > > 1. HashAgg until the hash table reaches work_mem. > 2. Spill the entire table to disk. > 3. Destroy the table and create a new one. > 4. If more tuples: goto 1 > 5. Merge sort and combine each dumped set of tuples. > ... and this is pretty much what Jeff Davis suggested, I think. The trouble is, each of those cases behaves nicely/terribly in different corner cases. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: POC: GROUP BY optimization
On 06/07/2018 12:18 AM, Claudio Freire wrote: > On Wed, Jun 6, 2018 at 6:57 PM Tomas Vondra > wrote: >> >> On 06/06/2018 11:22 PM, Claudio Freire wrote: >>> On Wed, Jun 6, 2018 at 5:43 PM Tomas Vondra >>> As such, estimating sort performance correctly in the various plan >>> variants being considered seems to be a very central aspect of it. >>> >>> This means it's pretty much time to consider the effect of operator >>> cost *and* distinct values in the cost computation. >>> >> >> Yes, until now that was not really needed because the optimizer does not >> consider reordering of the pathkeys - it simply grabs either GROUP BY or >> ORDER BY pathkeys and runs with that. >> >> So the costing was fairly trivial, we simply do something like >> >> comparison_cost = 2.0 * cpu_operator_cost; >> >> sort_cost = comparison_cost * tuples * LOG2(tuples); >> >> which essentially ignores that there might be multiple columns, or that >> the columns may have sort operator with different costs. >> >> The question is how reliable the heuristics can be. The current patch >> uses just plain ndistinct, but that seems rather unreliable but I don't >> have a clear idea how to improve that - we may have MCV for the columns >> and perhaps some extended statistics, but I'm not sure how far we can >> run with that. >> >> Essentially what we need to estimate the number of comparisons for >> each column, to compute better comparison_cost. > > This could be approached by adjusting statistics by any relevant > restrictions applicable to the columns being grouped, as is done for > selectivity estimations. > I don't follow how is this related to restrictions? Can you elaborate? > I'm not sure how far that would get us, though. It would be rather > easy to lose sight of those restrictions if there are complex > operations involved. > >>> Assuming cost_sort does its thing, it's just a matter of building the >>> desired variants and picking the one with the smallest cost_sort. One >>> can use heuristics to build a subset of all possible column orderings >>> with a guiding principle that tries to maximize the likelihook of >>> finding a better order than the one in the query (like sorting by >>> ndistinct), but I'd suggest: >>> >>> - Always considering the sort order verbatim as given in the SQL (ie: >>> what the user requests) >>> - Relying on cost_sort to distinguish among them, and pick a winner, and >>> - When in doubt (if cost_sort doesn't produce a clear winner), use the >>> order given by the user >>> >> >> I don't see why not to generate all possible orderings (keeping only the >> cheapest path for each pathkey variant) and let the optimizer to sort it >> out. > > I'm assuming costing the full N! possible orderings would be > prohibitively expensive. > That's possible, yes - particularly for large N values. Perhaps there's a simpler algorithm to compute a sufficient approximation. In a greedy way, starting from some likely-good combination of columns, or so. I haven't thought about this part very much ... >> If the user specified an explicit ORDER BY, we'll slap an extra >> Sort by at the end, increasing the cost. > > I think you misunderstood me. I'm saying the exact opposite. > > When I mention handicap, I mean to give the explicitly requested group > by order a *reduced* cost, to give it an advantage over the > heuristics. > I did understand you. I'm saying that if the ordering does not match the one requested by the user (in ORDER BY), we end up adding an explicit Sort node on top of it, which increases the cost of all the other paths, acting as a disadvantage. But now I realize this only works for when there is an ORDER BY clause, not for GROUP BY (in which case we don't add any Sort node). > This is to try to attack the problem you mentioned where users already > accounting for operator costs in their queries would get overridden by > the planner, perhaps in detriment of overall execution time. > > In essence: > > - If the user requested that order, we assume it "somewhat > subjectively better" (by giving it a slightly reduced cost). > - If there is an explicit order by clause that differs from a > considered path, the required sort node will already penalize it > appropriately, nothing needs to be done in relation to sort costs. > - All other things being equal, cost_sort will make the decision. If a > plan beats the user-provided order in spite of the handicap, it will > be used. So it's still optimizing clear cases. > OK. I
Re: Spilling hashed SetOps and aggregates to disk
On 06/06/2018 04:01 PM, Andres Freund wrote: Hi, On 2018-06-06 15:58:16 +0200, Tomas Vondra wrote: The other issue is that serialize/deserialize is only a part of a problem - you also need to know how to do "combine", and not all aggregates can do that ... (certainly not in universal way). There are several schemes where only serialize/deserialize are needed, no? There are a number of fairly sensible schemes where there won't be multiple transition values for the same group, no? Possibly, not sure what schemes you have in mind exactly ... But if you know there's only a single transition value, why would you need serialize/deserialize at all. Why couldn't you just finalize the value and serialize that? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: effect of JIT tuple deform?
On 06/26/2018 09:25 PM, Pavel Stehule wrote: Hi ... So I am able to see effect of jit_tuple_deforming, and very well, but only if optimization is active. When optimization is not active then jit_tuple_deforming does slowdown. So maybe a usage of jit_tuple_deforming can be conditioned by jit_optimization? Can you share the test case and some detail about the hardware and PostgreSQL configuration? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: WAL prefetch
On 06/27/2018 11:44 AM, Konstantin Knizhnik wrote: ... I have improved my WAL prefetch patch. The main reason of slowdown recovery speed with enabled prefetch was that it doesn't take in account initialized pages (XLOG_HEAP_INIT_PAGE) and doesn't remember (cache) full page writes. The main differences of new version of the patch: 1. Use effective_cache_size as size of cache of prefetched blocks 2. Do not prefetch blocks sent in shared buffers 3. Do not prefetch blocks for RM_HEAP_ID with XLOG_HEAP_INIT_PAGE bit set 4. Remember new/fpw pages in prefetch cache, to avoid prefetch them for subsequent WAL records. 5. Add min/max prefetch lead parameters to make it possible to synchronize speed of prefetch with speed of replay. 6. Increase size of open file cache to avoid redundant open/close operations. Thanks. I plan to look at it and do some testing, but I won't have time until the end of next week (probably). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Hi all, Attached is a rebased version of this patch series, mostly just fixing the breakage caused by reworked format of initial catalog data. Aside from that, the MCV building now adopts the logic introduced by commit b5db1d93d2 for single-column MCV lists. The new algorithm seems pretty good and I don't see why multi-column MCV lists should use something special. I'm sure there are plenty of open questions to discuss, particularly stuff related to combining the various types of statistics to the final estimate (a lot of that was already improved based on Dean's reviews). On thing that occurred to me while comparing the single-column logic (as implemented in selfuncs.c) and the new multi-column stuff, is dealing with partially-matching histogram buckets. In the single-column case, we pretty much assume uniform distribution in each bucket, and linearly interpolate the selectivity. So for a bucket with boundaries [0, 10] and condition "x <= 5" we return 0.5, for "x < 7" we return 0.7 and so on. This is what convert_to_scalar() does. In the multi-column case, we simply count each matching bucket as 0.5, without any attempts to linearly interpolate. It would not be difficult to call "convert_to_scalar" for each condition (essentially repeating the linear interpolation for each column), but then what? We could simply compute a product of those results, of course, but that only works assuming independence. And that's exactly the wrong thing to assume here, considering the extended statistics are meant for cases where the columns are not independent. So I'd argue the 0.5 estimate for partially-matching buckets is the right thing to do here, as it's minimizing the average error. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 0001-multivariate-MCV-lists.patch.gz Description: application/gzip 0002-multivariate-histograms.patch.gz Description: application/gzip
Re: WIP: BRIN multi-range indexes
On 06/24/2018 11:39 PM, Thomas Munro wrote: > On Sun, Jun 24, 2018 at 2:01 PM, Tomas Vondra > wrote: >> Attached is rebased version of this BRIN patch series, fixing mostly the >> breakage due to 372728b0 (aka initial-catalog-data format changes). As >> 2018-07 CF is meant for almost-ready patches, this is more a 2018-09 >> material. But perhaps someone would like to take a look - and I'd have >> to fix it anyway ... > > Hi Tomas, > > FYI Windows doesn't like this: > > src/backend/access/brin/brin_bloom.c(146): warning C4013: 'round' > undefined; assuming extern returning int > [C:\projects\postgresql\postgres.vcxproj] > > brin_bloom.obj : error LNK2019: unresolved external symbol round > referenced in function bloom_init > [C:\projects\postgresql\postgres.vcxproj] > Thanks, I've noticed the failure before, but was not sure what's the exact cause. It seems there's still no 'round' on Windows, so I'll probably fix that by using rint() instead, or something like that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: WIP: BRIN multi-range indexes
On 06/25/2018 12:31 AM, Tomas Vondra wrote: > On 06/24/2018 11:39 PM, Thomas Munro wrote: >> On Sun, Jun 24, 2018 at 2:01 PM, Tomas Vondra >> wrote: >>> Attached is rebased version of this BRIN patch series, fixing mostly the >>> breakage due to 372728b0 (aka initial-catalog-data format changes). As >>> 2018-07 CF is meant for almost-ready patches, this is more a 2018-09 >>> material. But perhaps someone would like to take a look - and I'd have >>> to fix it anyway ... >> >> Hi Tomas, >> >> FYI Windows doesn't like this: >> >> src/backend/access/brin/brin_bloom.c(146): warning C4013: 'round' >> undefined; assuming extern returning int >> [C:\projects\postgresql\postgres.vcxproj] >> >> brin_bloom.obj : error LNK2019: unresolved external symbol round >> referenced in function bloom_init >> [C:\projects\postgresql\postgres.vcxproj] >> > > Thanks, I've noticed the failure before, but was not sure what's the > exact cause. It seems there's still no 'round' on Windows, so I'll > probably fix that by using rint() instead, or something like that. > OK, here is a version tweaked to use floor()/ceil() instead of round(). Let's see if the Windows machine likes that more. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 0001-Pass-all-keys-to-BRIN-consistent-function-a-20180625.patch.gz Description: application/gzip 0002-Move-IS-NOT-NULL-checks-to-bringetbitmap-20180625.patch.gz Description: application/gzip 0003-BRIN-bloom-indexes-20180625.patch.gz Description: application/gzip 0004-BRIN-multi-range-minmax-indexes-20180625.patch.gz Description: application/gzip
Re: WIP: BRIN multi-range indexes
general). Of course, how to size the bloom filter? It's worth noting the false positive rate of the filter is essentially the fraction of a table that will be scanned every time. Similarly to the minmax-multi, parameters for computing optimal filter size are set as reloptions (false_positive_rate, n_distinct_per_range) with some reasonable defaults (1% false positive rate and distinct values 10% of maximum heap tuples in a page range). Note: When building the filter, we don't compute the hashes from the original values, but we first use the type-specific hash function (the same we'd use for hash indexes or hash joins) and then use the hash a as an input for the bloom filter. This generally works fine, but if "our" hash function generates a lot of collisions, it increases false positive ratio of the whole filter. I'm not aware of a case where this would be an issue, though. What further complicates sizing of the bloom filter is available space - the whole bloom filter needs to fit onto an 8kB page, and "full" bloom filters with about 1/2 the bits set are pretty non-compressible. So there's maybe ~8000 bytes for the bitmap. So for columns with many distinct values, it may be necessary to make the page range smaller, to reduce the number of distinct values in it. And of course it requires good ndistinct estimates, not just for the column as a whole, but for a single page range (because that's what matters for sizing the bloom filter). Which is not a particularly reliable estimate, I'm afraid. So reloptions seem like a sufficient solution, at least for now. open questions == * I suspect the definition of cross-type opclasses (int2 vs. int8) are not entirely correct. That probably needs another look. * The bloom filter now works in two modes - sorted (where in the sorted mode it stores the hashes directly) and hashed (the usual bloom filter behavior). The idea is that for ranges with significantly fewer distinct values, we only store those to save space (instead of allocating the whole bloom filter with mostly 0 bits). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz Description: application/gzip 0002-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz Description: application/gzip 0003-BRIN-bloom-indexes.patch.gz Description: application/gzip 0004-BRIN-multi-range-minmax-indexes.patch.gz Description: application/gzip
Re: Push down Aggregates below joins
On 06/20/2018 10:12 PM, Heikki Linnakangas wrote: > Currently, the planner always first decides the scan/join order, and > adds Group/Agg nodes on top of the joins. Sometimes it would be legal, > and beneficial, to perform the aggregation below a join. I've been > hacking on a patch to allow that. > There was a patch [1] from Antonin Houska aiming to achieve something similar. IIRC it aimed to push the aggregate down in more cases, leveraging the partial aggregation stuff. I suppose your patch only aims to do the pushdown when the two-phase aggregation is not needed? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Push down Aggregates below joins
On 06/20/2018 10:12 PM, Heikki Linnakangas wrote: > Currently, the planner always first decides the scan/join order, and > adds Group/Agg nodes on top of the joins. Sometimes it would be legal, > and beneficial, to perform the aggregation below a join. I've been > hacking on a patch to allow that. > There was a patch [1] from Antonin Houska aiming to achieve something similar. IIRC it aimed to push the aggregate down in more cases, leveraging the partial aggregation stuff. I suppose your patch only aims to do the pushdown when the two-phase aggregation is not needed? [1] https://www.postgresql.org/message-id/flat/9666.1491295317@localhost regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
memory leak when serializing TRUNCATE in reorderbuffer
Hi, While rebasing the logical replication patches on top of PG11, I've noticed that ReorderBufferSerializeChange claims this: case REORDER_BUFFER_CHANGE_TRUNCATE: ... /* ReorderBufferChange contains everything important */ That is not quite correct, though - the OIDs of truncated relations is stored in a separately palloc-ed array. So we only serialize the pointer to that array (which is part of ReorderBufferChange) and then read it back when restoring the change from disk. Now, this can't cause crashes, because the 'relids' array won't go away (more on this later), and so the pointer remains valid. But it's a memory leak - a quite small and not very common one, because people don't do TRUNCATE very often, particularly not with many tables. So I think we should fix and serialize/restore the OID array, just like we do for tuples, snapshots etc. See the attached fix. Another thing we should probably reconsider is where the relids is allocated - the pointer remains valid because we happen to allocate it in TopMemoryContext. It's not that bad because we don't free the other reorderbuffer contexts until the walsender exits anyway, but still. So I propose to allocate it in rb->context just like the other bits of data (snapshots, ...). Replacing the palloc() in DecodeTruncate() with something like: MemoryContextAlloc(ctx->reorder->context, xlrec->nrelids * sizeof(Oid)); should do the trick. The other places in decode.c don't allocate memory directly but call ReorderBufferGetTupleBuf() instead - perhaps we should introduce a similar wrapper here too. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index e2f59bf580..582bedf1de 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -412,10 +412,14 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change) } break; /* no data in addition to the struct itself */ + case REORDER_BUFFER_CHANGE_TRUNCATE: + if (change->data.truncate.relids != NULL) +pfree(change->data.truncate.relids); + change->data.truncate.relids = NULL; + break; case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM: case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID: case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID: - case REORDER_BUFFER_CHANGE_TRUNCATE: break; } @@ -2289,6 +2293,26 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn, break; } case REORDER_BUFFER_CHANGE_TRUNCATE: + { +Size size; +char *data; + +/* account for the OIDs of truncated relations */ +size = sizeof(Oid) * change->data.truncate.nrelids; +sz += size; + +/* make sure we have enough space */ +ReorderBufferSerializeReserve(rb, sz); + +data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange); +/* might have been reallocated above */ +ondisk = (ReorderBufferDiskChange *) rb->outbuf; + +memcpy(data, change->data.truncate.relids, size); +data += size; + +break; + } case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM: case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID: case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID: @@ -2569,6 +2593,16 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn, } /* the base struct contains all the data, easy peasy */ case REORDER_BUFFER_CHANGE_TRUNCATE: + { +Size size; + +size = change->data.truncate.nrelids * sizeof(Oid); + +change->data.truncate.relids = MemoryContextAllocZero(rb->context, size); + +memcpy(change->data.truncate.relids, data, size); +break; + } case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM: case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID: case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
Re: WAL prefetch
On 06/19/2018 02:33 PM, Konstantin Knizhnik wrote: On 19.06.2018 14:03, Tomas Vondra wrote: On 06/19/2018 11:08 AM, Konstantin Knizhnik wrote: ... >>> Also there are two points which makes prefetching into shared buffers more complex: 1. Need to spawn multiple workers to make prefetch in parallel and somehow distribute work between them. 2. Synchronize work of recovery process with prefetch to prevent prefetch to go too far and doing useless job. The same problem exists for prefetch in OS cache, but here risk of false prefetch is less critical. I think the main challenge here is that all buffer reads are currently synchronous (correct me if I'm wrong), while the posix_fadvise() allows a to prefetch the buffers asynchronously. Yes, this is why we have to spawn several concurrent background workers to perfrom prefetch. Right. My point is that while spawning bgworkers probably helps, I don't expect it to be enough to fill the I/O queues on modern storage systems. Even if you start say 16 prefetch bgworkers, that's not going to be enough for large arrays or SSDs. Those typically need way more than 16 requests in the queue. Consider for example [1] from 2014 where Merlin reported how S3500 (Intel SATA SSD) behaves with different effective_io_concurrency values: [1] https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azs...@mail.gmail.com Clearly, you need to prefetch 32/64 blocks or so. Consider you may have multiple such devices in a single RAID array, and that this device is from 2014 (and newer flash devices likely need even deeper queues). ISTM a small number of bgworkers is not going to be sufficient. It might be enough for WAL prefetching (where we may easily run into the redo-is-single-threaded bottleneck), but it's hardly a solution for bitmap heap scans, for example. We'll need to invent something else for that. OTOH my guess is that whatever solution we'll end up implementing for bitmap heap scans, it will be applicable for WAL prefetching too. Which is why I'm suggesting simply using posix_fadvise is not going to make the direct I/O patch significantly more complicated. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: WAL prefetch
On 06/19/2018 04:50 PM, Konstantin Knizhnik wrote: On 19.06.2018 16:57, Ants Aasma wrote: On Tue, Jun 19, 2018 at 4:04 PM Tomas Vondra mailto:tomas.von...@2ndquadrant.com>> wrote: Right. My point is that while spawning bgworkers probably helps, I don't expect it to be enough to fill the I/O queues on modern storage systems. Even if you start say 16 prefetch bgworkers, that's not going to be enough for large arrays or SSDs. Those typically need way more than 16 requests in the queue. Consider for example [1] from 2014 where Merlin reported how S3500 (Intel SATA SSD) behaves with different effective_io_concurrency values: [1] https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azs...@mail.gmail.com Clearly, you need to prefetch 32/64 blocks or so. Consider you may have multiple such devices in a single RAID array, and that this device is from 2014 (and newer flash devices likely need even deeper queues).' For reference, a typical datacenter SSD needs a queue depth of 128 to saturate a single device. [1] Multiply that appropriately for RAID arrays.So How it is related with results for S3500 where this is almost now performance improvement for effective_io_concurrency >8? Starting 128 or more workers for performing prefetch is definitely not acceptable... I'm not sure what you mean by "almost now performance improvement", but I guess you meant "almost no performance improvement" instead? If that's the case, it's not quite true - increasing the queue depth above 8 further improved the throughput by about ~10-20% (both by duration and peak throughput measured by iotop). But more importantly, this is just a single device - you typically have multiple of them in a larger arrays, to get better capacity, performance and/or reliability. So if you have 16 such drives, and you want to send at least 8 requests to each, suddenly you need at least 128 requests. And as pointed out before, S3500 is about 5-years old device (it was introduced in Q2/2013). On newer devices the difference is usually way more significant / the required queue depth is much higher. Obviously, this is a somewhat simplified view, ignoring various details (e.g. that there may be multiple concurrent queries, each sending I/O requests - what matters is the combined number of requests, of course). But I don't think this makes a huge difference. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On 12/23/2017 03:03 PM, Erikjan Rijkers wrote: > On 2017-12-23 05:57, Tomas Vondra wrote: >> Hi all, >> >> Attached is a patch series that implements two features to the logical >> replication - ability to define a memory limit for the reorderbuffer >> (responsible for building the decoded transactions), and ability to >> stream large in-progress transactions (exceeding the memory limit). >> > > logical replication of 2 instances is OK but 3 and up fail with: > > TRAP: FailedAssertion("!(last_lsn < change->lsn)", File: > "reorderbuffer.c", Line: 1773) > > I can cobble up a script but I hope you have enough from the assertion > to see what's going wrong... The assertion says that the iterator produces changes in order that does not correlate with LSN. But I have a hard time understanding how that could happen, particularly because according to the line number this happens in ReorderBufferCommit(), i.e. the current (non-streaming) case. So instructions to reproduce the issue would be very helpful. Attached is v2 of the patch series, fixing two bugs I discovered today. I don't think any of these is related to your issue, though. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch.gz Description: application/gzip 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch.gz Description: application/gzip 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch.gz Description: application/gzip 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch.gz Description: application/gzip 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch.gz Description: application/gzip 0006-Add-support-for-streaming-to-built-in-replication-v2.patch.gz Description: application/gzip
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On 12/23/2017 11:23 PM, Erik Rijkers wrote: > On 2017-12-23 21:06, Tomas Vondra wrote: >> On 12/23/2017 03:03 PM, Erikjan Rijkers wrote: >>> On 2017-12-23 05:57, Tomas Vondra wrote: >>>> Hi all, >>>> >>>> Attached is a patch series that implements two features to the logical >>>> replication - ability to define a memory limit for the reorderbuffer >>>> (responsible for building the decoded transactions), and ability to >>>> stream large in-progress transactions (exceeding the memory limit). >>>> >>> >>> logical replication of 2 instances is OK but 3 and up fail with: >>> >>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File: >>> "reorderbuffer.c", Line: 1773) >>> >>> I can cobble up a script but I hope you have enough from the assertion >>> to see what's going wrong... >> >> The assertion says that the iterator produces changes in order that does >> not correlate with LSN. But I have a hard time understanding how that >> could happen, particularly because according to the line number this >> happens in ReorderBufferCommit(), i.e. the current (non-streaming) case. >> >> So instructions to reproduce the issue would be very helpful. > > Using: > > 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch > 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch > 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch > 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch > 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch > 0006-Add-support-for-streaming-to-built-in-replication-v2.patch > > As you expected the problem is the same with these new patches. > > I have now tested more, and seen that it not always fails. I guess that > it here fails 3 times out of 4. But the laptop I'm using at the moment > is old and slow -- it may well be a factor as we've seen before [1]. > > Attached is the bash that I put together. I tested with > NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails > often. This same program run with HEAD never seems to fail (I tried a > few dozen times). > Thanks. Unfortunately I still can't reproduce the issue. I even tried running it in valgrind, to see if there are some memory access issues (which should also slow it down significantly). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On 12/24/2017 05:51 AM, Craig Ringer wrote: > On 23 December 2017 at 12:57, Tomas Vondra <tomas.von...@2ndquadrant.com > <mailto:tomas.von...@2ndquadrant.com>> wrote: > > Hi all, > > Attached is a patch series that implements two features to the logical > replication - ability to define a memory limit for the reorderbuffer > (responsible for building the decoded transactions), and ability to > stream large in-progress transactions (exceeding the memory limit). > > I'm submitting those two changes together, because one builds on the > other, and it's beneficial to discuss them together. > > > PART 1: adding logical_work_mem memory limit (0001) > --- > > Currently, limiting the amount of memory consumed by logical decoding is > tricky (or you might say impossible) for several reasons: > > * The value is hard-coded, so it's not quite possible to customize it. > > * The amount of decoded changes to keep in memory is restricted by > number of changes. It's not very unclear how this relates to memory > consumption, as the change size depends on table structure, etc. > > * The number is "per (sub)transaction", so a transaction with many > subtransactions may easily consume significant amount of memory without > actually hitting the limit. > > > Also, even without subtransactions, we assemble a ReorderBufferTXN > per transaction. Since transactions usually occur concurrently, > systems with many concurrent txns can face lots of memory use. > I don't see how that could be a problem, considering the number of toplevel transactions is rather limited (to max_connections or so). > We can't exclude tables that won't actually be replicated at the reorder > buffering phase either. So txns use memory whether or not they do > anything interesting as far as a given logical decoding session is > concerned. Even if we'll throw all the data away we must buffer and > assemble it first so we can make that decision. Yep. > Because logical decoding considers snapshots and cid increments even > from other DBs (at least when the txn makes catalog changes) the memory > use can get BIG too. I was recently working with a system that had > accumulated 2GB of snapshots ... on each slot. With 7 slots, one for > each DB. > > So there's lots of room for difficulty with unpredictable memory use. > Yep. > So the patch does two things. Firstly, it introduces logical_work_mem, a > GUC restricting memory consumed by all transactions currently kept in > the reorder buffer > > > Does this consider the (currently high, IIRC) overhead of tracking > serialized changes? > Consider in what sense? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On 12/24/2017 10:00 AM, Erik Rijkers wrote: >>>>> >>>>> logical replication of 2 instances is OK but 3 and up fail with: >>>>> >>>>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File: >>>>> "reorderbuffer.c", Line: 1773) >>>>> >>>>> I can cobble up a script but I hope you have enough from the assertion >>>>> to see what's going wrong... >>>> >>>> The assertion says that the iterator produces changes in order that >>>> does >>>> not correlate with LSN. But I have a hard time understanding how that >>>> could happen, particularly because according to the line number this >>>> happens in ReorderBufferCommit(), i.e. the current (non-streaming) >>>> case. >>>> >>>> So instructions to reproduce the issue would be very helpful. >>> >>> Using: >>> >>> 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch >>> 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch >>> 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch >>> 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch >>> 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch >>> 0006-Add-support-for-streaming-to-built-in-replication-v2.patch >>> >>> As you expected the problem is the same with these new patches. >>> >>> I have now tested more, and seen that it not always fails. I guess that >>> it here fails 3 times out of 4. But the laptop I'm using at the moment >>> is old and slow -- it may well be a factor as we've seen before [1]. >>> >>> Attached is the bash that I put together. I tested with >>> NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails >>> often. This same program run with HEAD never seems to fail (I tried a >>> few dozen times). >>> >> >> Thanks. Unfortunately I still can't reproduce the issue. I even tried >> running it in valgrind, to see if there are some memory access issues >> (which should also slow it down significantly). > > One wonders again if 2ndquadrant shouldn't invest in some old hardware ;) > Well, I've done tests on various machines, including some really slow ones, and I still haven't managed to reproduce the failures using your script. So I don't think that would really help. But I have reproduced it by using a custom stress test script. Turns out the asserts are overly strict - instead of Assert(prev_lsn < current_lsn); it should have been Assert(prev_lsn <= current_lsn); because some XLOG records may contain multiple rows (e.g. MULTI_INSERT). The attached v3 fixes this issue, and also a couple of other thinkos: 1) The AssertChangeLsnOrder assert check was somewhat broken. 2) We've been sending aborts for all subtransactions, even those not yet streamed. So downstream got confused and fell over because of an assert. 3) The streamed transactions were written to /tmp, using filenames using subscription OID and XID of the toplevel transaction. That's fine, as long as there's just a single replica running - if there are more, the filenames will clash, causing really strange failures. So move the files to base/pgsql_tmp where regular temporary files are written. I'm not claiming this is perfect, perhaps we need to invent another location. FWIW I believe the relation sync cache is somewhat broken by the streaming. I thought resetting it would be good enough, but it's more complicated (and trickier) than that. I'm aware of it, and I'll look into that next - but probably not before 2018. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v3.patch.gz Description: application/gzip 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v3.patch.gz Description: application/gzip 0003-Issue-individual-invalidations-with-wal_level-log-v3.patch.gz Description: application/gzip 0004-Extend-the-output-plugin-API-with-stream-methods-v3.patch.gz Description: application/gzip 0005-Implement-streaming-mode-in-ReorderBuffer-v3.patch.gz Description: application/gzip 0006-Add-support-for-streaming-to-built-in-replication-v3.patch.gz Description: application/gzip
Re: WIP: BRIN multi-range indexes
On 12/20/2017 03:37 AM, Mark Dilger wrote: > >> On Dec 19, 2017, at 5:16 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> >> wrote: >> >> >> >> On 12/19/2017 08:38 PM, Mark Dilger wrote: >>> >>>> On Nov 18, 2017, at 12:45 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> >>>> wrote: >>>> >>>> Hi, >>>> >>>> Apparently there was some minor breakage due to duplicate OIDs, so here >>>> is the patch series updated to current master. >>>> >>>> regards >>>> >>>> -- >>>> Tomas Vondra http://www.2ndQuadrant.com >>>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >>>> <0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz><0002-BRIN-bloom-indexes.patch.gz><0003-BRIN-multi-range-minmax-indexes.patch.gz><0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz> >>> >>> >>> After applying these four patches to my copy of master, the regression >>> tests fail for F_SATISFIES_HASH_PARTITION 5028 as attached. >>> >> >> D'oh! There was an incorrect OID referenced in pg_opclass, which was >> also used by the satisfies_hash_partition() function. Fixed patches >> attached. > > Your use of type ScanKey in src/backend/access/brin/brin.c is a bit > confusing. A > ScanKey is defined elsewhere as a pointer to ScanKeyData. When you define > an array of ScanKeys, you use pointer-to-pointer style: > > + ScanKey **keys, > + **nullkeys; > > But when you allocate space for the array, you don't treat it that way: > > + keys = palloc0(sizeof(ScanKey) * bdesc->bd_tupdesc->natts); > + nullkeys = palloc0(sizeof(ScanKey) * bdesc->bd_tupdesc->natts); > > But then again when you use nullkeys, you treat it as a two-dimensional array: > > + nullkeys[keyattno - 1][nnullkeys[keyattno - 1]] = key; > > and likewise when you allocate space within keys: > > +keys[keyattno - 1] = palloc0(sizeof(ScanKey) * scan->numberOfKeys); > > Could you please clarify? I have been awake a bit too long; hopefully, I am > not merely missing the obvious. > Yeah, that's wrong - it should be "sizeof(ScanKey *)" instead. It's harmless, though, because ScanKey itself is a pointer, so the size is the same. Attached is an updated version of the patch series, significantly reworking and improving the multi-minmax part (the rest of the patch is mostly as it was before). I've significantly refactored and cleaned up the multi-minmax part, and I'm much happier with it - no doubt there's room for further improvement but overall it's much better. I've also added proper sgml docs for this part, and support for more data types including variable-length ones (all integer types, numeric, float-based types including timestamps, uuid, and a couple of others). At the API level, I needed to add one extra support procedure that measures distance between two values of the data type. This is needed so because we only keep a limited number of intervals for each range, and once in a while we need to decide which of them to "merge" (and we simply merge the closest ones). I've passed the indexes through significant testing and fixed a couple of silly bugs / memory leaks. Let's see if there are more. Performance-wise, the CREATE INDEX seems a bit slow - it's about an order of magnitude slower than regular BRIN. Some of that is expected (we simply need to do more stuff to maintain multiple ranges), but perhaps there's room for additional improvements - that's something I'd like to work on next. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz Description: application/gzip 0002-BRIN-bloom-indexes.patch.gz Description: application/gzip 0003-BRIN-multi-range-minmax-indexes.patch.gz Description: application/gzip 0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz Description: application/gzip
Re: [HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means
On 01/05/2018 05:25 AM, Stephen Frost wrote: > Tomas, > > * Michael Paquier (michael.paqu...@gmail.com) wrote: >> On Sun, Nov 19, 2017 at 6:40 AM, Tomas Vondra >> <tomas.von...@2ndquadrant.com> wrote: >>> Thanks for looking into this. I agree your patch is more consistent and >>> generally cleaner. >> >> This is classified as a bug fix, and is marked as waiting on author. I >> am moving it to next CF as work continues. > > This is still in waiting-on-author state; it'd be great to get your > thoughts on where this patch is and what the next steps are to move > it forward. If you need additional feedback or there needs to be > more discussion (perhaps with Tom) then maybe this should be in > needs-review instead, but it'd be great if you could review the > discussion and patch again and at least provide an update on it (last > update from you was in November). > Hmmm, I'm not sure, TBH. As I already mentioned, Tom's updated patch is better than what I posted initially, and I agree with his approach to the remaining issues he pointed out. But I somehow assumed that he's already looking into that. Tom, do you plan to look into this patch soon, or should I? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means
On 01/08/2018 08:39 PM, Tom Lane wrote: > Tomas Vondra <tomas.von...@2ndquadrant.com> writes: >> As I already mentioned, Tom's updated patch is better than what I >> posted initially, and I agree with his approach to the remaining >> issues he pointed out. But I somehow assumed that he's already >> looking into that. Tom, do you plan to look into this patch soon, >> or should I? > > No, I thought you were going to run with those ideas. I have a lot > of other stuff on my plate ... > OK, will do. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PROPOSAL] Shared Ispell dictionaries
Hi Arthur, Sorry for the delay, I somehow missed this thread ... On 12/27/2017 10:20 AM, Arthur Zakirov wrote: > On Tue, Dec 26, 2017 at 07:03:48PM +0100, Pavel Stehule wrote: >> >> Tomas had some workable patches related to this topic >> > > Tomas, have you planned to propose it? > I believe Pavel was referring to this extension: https://github.com/tvondra/shared_ispell I wasn't going to submit that as in-core solution, but I'm happy you're making improvements in that direction. I'll take a look at your patch shortly. ragards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Built-in connection pooling
Hi Konstantin, On 01/18/2018 03:48 PM, Konstantin Knizhnik wrote: > On 17.01.2018 19:09, Konstantin Knizhnik wrote: >> Hi hackers, >> >> ... > I haven't looked at the code yet, but after reading your message I have a simple question - gow iss this going to work with SSL? If you're only passing a file descriptor, that does not seem to be sufficient for the backends to do crypto (that requires the SSL stuff from Port). Maybe I'm missing something and it already works, though ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Attached is v5, fixing a silly bug in part 0006, causing segfault when creating a subscription. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v5.patch.gz Description: application/gzip 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v5.patch.gz Description: application/gzip 0003-Issue-individual-invalidations-with-wal_level-log-v5.patch.gz Description: application/gzip 0004-Extend-the-output-plugin-API-with-stream-methods-v5.patch.gz Description: application/gzip 0005-Implement-streaming-mode-in-ReorderBuffer-v5.patch.gz Description: application/gzip 0006-Add-support-for-streaming-to-built-in-replication-v5.patch.gz Description: application/gzip 0007-Track-statistics-for-streaming-spilling-v5.patch.gz Description: application/gzip 0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as-v5.patch.gz Description: application/gzip 0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup-v5.patch.gz Description: application/gzip
Re: Built-in connection pooling
On 01/19/2018 05:53 PM, Konstantin Knizhnik wrote: > > > On 19.01.2018 19:28, Pavel Stehule wrote: >> >> >> When I've been thinking about adding a built-in connection >> pool, my >> rough plan was mostly "bgworker doing something like >> pgbouncer" (that >> is, listening on a separate port and proxying everything to >> regular >> backends). Obviously, that has pros and cons, and probably >> would not >> work serve the threading use case well. >> >> >> And we will get the same problem as with pgbouncer: one process >> will not be able to handle all connections... >> Certainly it is possible to start several such scheduling >> bgworkers... But in any case it is more efficient to multiplex >> session in backend themselves. >> >> >> pgbouncer hold all time client connect. When we implement the >> listeners, then all work can be done by worker processes not by listeners. >> > > Sorry, I do not understand your point. > In my case pgbench establish connection to the pgbouncer only once at > the beginning of the test. > And pgbouncer spends all time in context switches (CPU usage is 100% and > it is mostly in kernel space: top of profile are kernel functions). > The same picture will be if instead of pgbouncer you will do such > scheduling in one bgworker. > For the modern systems are not able to perform more than several > hundreds of connection switches per second. > So with single multiplexing thread or process you can not get speed more > than 100k, while at powerful NUMA system it is possible to achieve > millions of TPS. > It is illustrated by the results I have sent in the previous mail: by > spawning 10 instances of pgbouncer I was able to receive 7 times bigger > speed. > AFAICS making pgbouncer multi-threaded would not be hugely complicated. A simple solution would be a fixed number of worker threads, and client connections randomly assigned to them. But this generally is not a common bottleneck in practical workloads (of course, YMMV). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Built-in connection pooling
On 01/19/2018 10:52 AM, Konstantin Knizhnik wrote: > > > On 18.01.2018 18:02, Tomas Vondra wrote: >> Hi Konstantin, >> >> On 01/18/2018 03:48 PM, Konstantin Knizhnik wrote: >>> On 17.01.2018 19:09, Konstantin Knizhnik wrote: >>>> Hi hackers, >>>> >>>> ... >> I haven't looked at the code yet, but after reading your message I have >> a simple question - gow iss this going to work with SSL? If you're only >> passing a file descriptor, that does not seem to be sufficient for the >> backends to do crypto (that requires the SSL stuff from Port). >> >> Maybe I'm missing something and it already works, though ... >> >> >> regards >> > Ooops, I missed this aspect with SSL. Thank you. > New version of the patch which correctly maintain session context is > attached. > Now each session has its own allocator which should be used instead > of TopMemoryAllocator. SSL connections work now. > OK. I've looked at the code, but I have a rather hard time understanding it, because there are pretty much no comments explaining the intent of the added code :-( I strongly suggest improving that, to help reviewers. The questions I'm asking myself are mostly these: 1) When assigning a backend, we first try to get one from a pool, which happens right at the beginning of BackendStartup. If we find a usable backend, we send the info to the backend (pg_send_sock/pg_recv_sock). But AFAICS this only only happens at connection time, right? But it your initial message you say "Rescheduling is done at transaction level," which in my understanding means "transaction pooling". So, how does that part work? 2) How does this deal with backends for different databases? I don't see any checks that the requested database matches the backend database (not any code switching the backend from one db to another - which would be very tricky, I think). 3) Is there any sort of shrinking the pools? I mean, if the backend is idle for certain period of time (or when we need backends for other databases), does it get closed automatically? Furthermore, I'm rather confused about the meaning of session_pool_size. I mean, that GUC determines the number of backends in the pool, it has nothing to do with sessions per se, right? Which would mean it's a bit misleading to name it "session_..." (particularly if the pooling happens at transaction level, not session level - which is question #1). When I've been thinking about adding a built-in connection pool, my rough plan was mostly "bgworker doing something like pgbouncer" (that is, listening on a separate port and proxying everything to regular backends). Obviously, that has pros and cons, and probably would not work serve the threading use case well. But it would have some features that I find valuable - for example, it's trivial to decide which connection requests may or may not be served from a pool (by connection to the main port or pool port). That is not to say the bgworker approach is better than what you're proposing, but I wonder if that would be possible with your approach. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Built-in connection pooling
On 01/19/2018 06:07 PM, Konstantin Knizhnik wrote: > > ... > >>>> 3) Is there any sort of shrinking the pools? I mean, if the backend is >>>> idle for certain period of time (or when we need backends for other >>>> databases), does it get closed automatically? >>>> >>> When client is disconnected, client session is closed. But backen is not >>> terminated even if there are no more sessions at this backend. >>> It was done intentionally, to avoid permanent spawning of new processes >>> when there is one or few clients which frequently connect/disconnect to >>> the database. >>> >> Sure, but it means a short peak will exhaust the backends indefinitely. >> That's acceptable for a PoC, but I think needs to be fixed eventually. >> > Sorry, I do not understand it. > You specify size of backends pool which will server client session. > Size of this pool is chosen to provide the best performance at the > particular system and workload. > So number of backends will never exceed this optimal value even in case > of "short peak". > From my point of view terminating backends when there are no active > sessions is wrong idea in any case, it was not temporary decision just > for PoC. > That is probably true when there is just a single pool (for one database/user). But when there are multiple such pools, it forces you to keep the sum(pool_size) below max_connections. Which seems strange. I do think the ability to evict backends after some timeout, or when there is pressure in other pools (different user/database) is rather useful. >> >> Well, I haven't said it has to be single-threaded like pgbouncer. I >> don't see why the bgworker could not use multiple threads internally (of >> course, it'd need to be not to mess the stuff that is not thread-safe). >> > > Certainly architecture with N multiple scheduling bgworkers and M > executors (backends) may be more flexible > than solution when scheduling is done in executor itself. But we will > have to pay extra cost for redirection. > > I am not sure that finally it will allow to reach better performance. > More flexible solution in many cases doesn't mean more efficient solution. > Sure, I wasn't really suggesting it's a clear win. I was responding to your argument that pgbouncer in some cases reaches 100% CPU utilization - that can be mitigated to a large extent by adding threads. Of course, the cost for extra level of indirection is not zero. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Built-in connection pooling
On 01/19/2018 07:35 PM, Claudio Freire wrote: > > > On Fri, Jan 19, 2018 at 2:22 PM, Tomas Vondra > <tomas.von...@2ndquadrant.com <mailto:tomas.von...@2ndquadrant.com>> wrote: > > > > On 01/19/2018 06:13 PM, Claudio Freire wrote: > > > > > > On Fri, Jan 19, 2018 at 2:07 PM, Konstantin Knizhnik > > <k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru> > <mailto:k.knizh...@postgrespro.ru > <mailto:k.knizh...@postgrespro.ru>>> wrote: > > > > > > > > > > Well, I haven't said it has to be single-threaded like > pgbouncer. I > > don't see why the bgworker could not use multiple threads > > internally (of > > course, it'd need to be not to mess the stuff that is not > > thread-safe). > > > > > > Certainly architecture with N multiple scheduling bgworkers and M > > executors (backends) may be more flexible > > than solution when scheduling is done in executor itself. But we > > will have to pay extra cost for redirection. > > I am not sure that finally it will allow to reach better > performance. > > More flexible solution in many cases doesn't mean more efficient > > solution. > > > > > > I think you can take the best of both worlds. > > > > You can take your approach of passing around fds, and build a "load > > balancing protocol" in a bgworker. > > > > The postmaster sends the socket to the bgworker, the bgworker > waits for > > a command as pgbouncer does, but instead of proxying everything, when > > commands arrive, it passes the socket to a backend to handle. > > > > That way, the bgworker can do what pgbouncer does, handle different > > pooling modes, match backends to databases, etc, but it doesn't > have to > > proxy all data, it just delegates handling of a command to a backend, > > and forgets about that socket. > > > > Sounds like it could work. > > > > How could it do all that without actually processing all the data? For > example, how could it determine the statement/transaction boundaries? > > > It only needs to determine statement/transaction start. > > After that, it hands off the connection to a backend, and the > backend determines when to give it back. > > So instead of processing all the data, it only processes a tiny part of it. > How exactly would the backend "give back" the connection? The only way for the backend and pgbouncer to communicate is by embedding information in the data stream. Which means pgbouncer still has to parse it. Furthermore, those are not the only bits of information pgbouncer may need. For example, if pgbouncer gets improved to handle prepared statements (which is likely) it'd need to handle PARSE/BIND/EXECUTE. And it already needs to handle SET parameters. And so on. In any case, this discussion is somewhat off topic in this thread, so let's not hijack it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services