Re: Optimizing PostgreSQL with LLVM's PGO+LTO

2023-01-27 Thread Komяpa
Hi,

We have implemented LTO in PostGIS's build system a couple releases ago. It
definitely gives +10% on heavy maths. Unfortunately we did not manage to
get it running under FreeBSD because of default system linker issues so we
had to hide it under --with-lto switch which we recommend to everyone.

I did not experiment with Postgres itself but there are definitely traces
of numerous LTO-enabled private builds on the web.

On Fri, Jan 27, 2023 at 8:05 PM João Paulo Labegalini de Carvalho <
jaopaul...@gmail.com> wrote:

> Hi all,
>
> I am investigating the benefits of different profile-guided optimizations
> (PGO) and link-time optimizations (LTO) versus binary optimizers (e.g.
> BOLT) for applications such as PostgreSQL.
>
> I am facing issues when applying LTO to PostgreSQL as the produced binary
> seems broken (the server dies quickly after it has started). This is
> definitely a compiler bug, but I was wondering if anyone here  have
> experimented with LTO for PostgreSQL.
>
> Thanks,
>
> --
> João Paulo L. de Carvalho
> Ph.D Computer Science |  IC-UNICAMP | Campinas , SP - Brazil
> Postdoctoral Research Fellow | University of Alberta | Edmonton, AB -
> Canada
> joao.carva...@ic.unicamp.br
> joao.carva...@ualberta.ca
>


Re: [PATCH] reduce page overlap of GiST indexes built using sorted method

2022-01-18 Thread Komяpa
Hello hackers,

On Tue, Jan 18, 2022 at 11:26 PM sergei sh.  wrote:

> Hi,
>
> I've addressed Andrey Borodin's concerns about v2 of this patch by
> Aliaksandr
> Kalenik in attached version.
>

[snip]

This patchset got some attention in the PostGIS development channel, as it
is important to really enable the fast GiST build there for the end user.
The reviews are positive, it saves build time and performs as well as
original non-sorting build on tested workloads.

Test by Giuseppe Broccolo:
https://lists.osgeo.org/pipermail/postgis-devel/2022-January/029330.html

Test by Regina Obe:
https://lists.osgeo.org/pipermail/postgis-devel/2022-January/029335.html

I believe this patch is an important addition to Postgres 15 and will make
a lot of GIS users happier.

Thanks,
Darafei.


Re: BUG #17302: gist index prevents insertion of some data

2021-12-01 Thread Komяpa
On Thu, Dec 2, 2021 at 1:14 AM Tom Lane  wrote:

> Alexander Korotkov  writes:
> > I think losing precision in the gist penalty is generally OK.  Thus,
> > it shouldn't be a problem to round a very small value as zero.
>
> Check.
>
> > Probably, we could even tolerate overflow in the gist penalty.
>
> As long as overflow -> infinity, yeah I think so.  Seems like it
> was a mistake to insert the overflow-testing functions in this code
> at all, and we should simplify it down to plain C addition/subtraction/
> multiplication.
>

The underflow should not throw an interrupting exception ever, even on
plain SQL-level calculations.

The code to implement was added in error by a series of misunderstandings
and gets in the way of simple things too often. I dug into the history here:


https://www.postgresql.org/message-id/CAC8Q8t%2BXJH68WB%2BsKN0BV0uGc3ZjA2DtbQuoJ5EhB4JAcS0C%2BQ%40mail.gmail.com





>
> regards, tom lane
>
>
>


Re: Couldn't we mark enum_in() as immutable?

2021-09-28 Thread Komяpa
PostGIS has a very similar thing: ST_Transform is marked as immutable but
does depend on contents of spatial_ref_sys table. Although it is shipped
with extension and almost never changes incompatibly, there are scenarios
where it breaks: dump/restore + index or generated column can fail the
import if data gets fed into the immutable function before the contents of
spatial_ref_sys is restored. I'd love this issue to be addressed at the
core level as benefits of having it as immutable outweigh even this
unfortunate issue.



On Tue, Sep 28, 2021 at 6:04 PM Tom Lane  wrote:

> Andrew Dunstan  writes:
> > On 9/27/21 5:54 PM, Tom Lane wrote:
> >> Currently enum_in() is marked as stable, on the reasonable grounds
> >> that it depends on system catalog contents.  However, after the
> >> discussion at [1] I'm wondering why it wouldn't be perfectly safe,
> >> and useful, to mark it as immutable.
>
> > The value returned depends on the label values in pg_enum, so if someone
> > decided to rename a label that would affect it, no? Same for enum_out.
>
> Hm.  I'd thought about this to the extent of considering that if we
> rename label A to B, then stored values of "A" would now print as "B",
> and const-folding "A" earlier would track that which seems OK.
> But you're right that then introducing a new definition of "A"
> (via ADD or RENAME) would make things messy.
>
> >> Moreover, if it's *not* good enough, then our existing practice of
> >> folding enum literals to OID constants on-sight must be unsafe too.
>
> I'm still a little troubled by this angle.  However, we've gotten away
> with far worse instability for datetime literals, so maybe it's not a
> problem in practice.
>
> regards, tom lane
>
>
>


Re: Questions about support function and abbreviate

2021-06-12 Thread Komяpa
Hello,

the abbrev_converter is applied whenever it is defined. The values are
sorted using the abbreviated comparator first using the shortened version,
and if there is a tie the system asks the real full comparator to resolve
it.

This article seems to be rather comprehensive:
https://brandur.org/sortsupport

On Sat, Jun 12, 2021 at 9:51 AM Han Wang  wrote:

> Hi all,
>
> I am trying to implement a sort support function for geometry data types
> in PostGIS with the new feature `SortSupport`. However, I have a question
> about this.
>
> I think it is hardly to apply a sort support function to a complex data
> type without the `abbrev_converter` to simply the data structure into a
> single `Datum`. However, I do not know how the system determines when to
> apply the converter.
>
> I appreciate any answers or suggestions. I am looking forward to hearing
> from you.
>
> Best regards,
> Han
>


-- 
Darafei "Komяpa" Praliaskouski
OSM BY Team - http://openstreetmap.by/


Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-06 Thread Komяpa
I have similar problems and what is really needed is a way to get a file
from client side into a server side object that can be dealt with later.
The most popular way is COPY and it is built into the psql tool. In general
it supports \copy wrapper, and there is COPY FROM STDIN. However, it is not
available to the files that are not following the csv-like structure. I had
to use it for XML and huge JSON files before, and it's always `sed` before
the import and a replace() after.

pg_read_file does not help on cloud and managed installs of postgres here.

What I would prefer is some new COPY mode like RAW that will just push
whatever it gets on the stdin/input into the cell on the server side. This
way it can be proxied by psql, utilize existing infra for passing streams
and be used in shell scripting.



On Thu, May 6, 2021 at 9:14 AM Joel Jacobson  wrote:

> On Wed, May 5, 2021, at 20:45, Tom Lane wrote:
>
> "Joel Jacobson"  writes:
> > I think you misunderstood the problem.
> > I don't want the entire file to be considered a single value.
> > I want each line to become its own row, just a row with a single column.
>
> > So I actually think COPY seems like a perfect match for the job,
> > since it does precisely that, except there is no delimiter in this case.
>
> Well, there's more to it than just the column delimiter.
>
> * What about \N being converted to NULL?
> * What about \. being treated as EOF?
> * Do you want to turn off the special behavior of backslash (ESCAPE)
>   altogether?
> * What about newline conversions (\r\n being seen as just \n, etc)?
>
> I'm inclined to think that "use pg_read_file and then split at newlines"
> might be a saner answer than delving into all these fine points.
> Not least because people yell when you add cycles to the COPY
> inner loops.
>
>
> Thanks for providing strong arguments why the COPY approach is a dead-end,
> I agree.
>
> However, as demonstrated in my previous email, using
>
>string_to_table(pg_read_file( filename ), E'\n')
>
> has its performance as well as max size issues.
>
> Maybe these two problems could be solved by combining the two functions
> into one?
>
>file_to_table ( filename text, delimiter text [, null_string text ] ) →
> setof text
>
> I'm thinking thanks to returning "setof text", such a function could read
> a stream,
> and return a line as soon as a delimiter is encountered, not having to keep
> the entire file in memory at any time.
>
> /Joel
>


-- 
Darafei "Komяpa" Praliaskouski
OSM BY Team - http://openstreetmap.by/


Re: REINDEX backend filtering

2021-02-24 Thread Komяpa
Hello,

The PostGIS project needed this from time to time. Would be great if
reindex by opclass can be made possible.

We changed the semantics of btree at least twice (in 2.4 and 3.0), fixed
some ND mixed-dimension indexes semantics in 3.0, fixed hash index on 32
bit arch in 3.0.

On Thu, Dec 3, 2020 at 12:32 PM Julien Rouhaud  wrote:

> Hello,
>
> Now that we have the infrastructure to track indexes that might be
> corrupted
> due to changes in collation libraries, I think it would be a good idea to
> offer
> an easy way for users to reindex all indexes that might be corrupted.
>
> I'm attaching a POC patch as a discussion basis.  It implements a new
> "COLLATION" option to reindex, with "not_current" being the only accepted
> value.  Note that I didn't spent too much efforts on the grammar part yet.
>
> So for instance you can do:
>
> REINDEX (COLLATION 'not_current') DATABASE mydb;
>
> The filter is also implemented so that you could cumulate multiple
> filters, so
> it could be easy to add more filtering, for instance:
>
> REINDEX (COLLATION 'libc', COLLATION 'not_current') DATABASE mydb;
>
> to only rebuild indexes depending on outdated libc collations, or
>
> REINDEX (COLLATION 'libc', VERSION 'X.Y') DATABASE mydb;
>
> to only rebuild indexes depending on a specific version of libc.
>


-- 
Darafei "Komяpa" Praliaskouski
OSM BY Team - http://openstreetmap.by/


Re: Tightening up allowed custom GUC names

2021-02-11 Thread Komяpa
чц, 11 лют 2021, 21:33 карыстальнік Tom Lane  напісаў:

> Noah Misch  writes:
> > On Tue, Feb 09, 2021 at 05:34:37PM -0500, Tom Lane wrote:
> >> * A case could be made for tightening things up a lot more, and not
> >> allowing anything that doesn't look like an identifier.  I'm not
> >> pushing for that, as it seems more likely to break existing
> >> applications than the narrow restriction proposed here.  But I could
> >> live with it if people prefer that way.
>
> > I'd prefer that.  Characters like backslash, space, and double quote have
> > significant potential to reveal bugs, while having negligible application
> > beyond revealing bugs.
>
> Any other opinions here?  I'm hesitant to make such a change on the
> basis of just one vote.
>

+1 for the change. I have not seen usage of = and - in the wild in GUC
names but can see a harm of mis-interpretation of these.




> regards, tom lane
>
>
>


Re: Freeze the inserted tuples during CTAS?

2021-01-27 Thread Komяpa
Hi,

I confirm that my analytic workflows often do the CTAS and VACUUM of the
relation right after, before the index creation, to mark stuff as
all-visible for IOS to work. Freezing and marking as visible will help.

On Wed, Jan 27, 2021 at 12:29 PM Paul Guo  wrote:

> Here is the simple patch,
>
> diff --git a/src/backend/commands/createas.c
> b/src/backend/commands/createas.c
> index dce882012e..0391699423 100644
> --- a/src/backend/commands/createas.c
> +++ b/src/backend/commands/createas.c
> @@ -552,7 +552,7 @@ intorel_startup(DestReceiver *self, int operation,
> TupleDesc typeinfo)
> myState->rel = intoRelationDesc;
> myState->reladdr = intoRelationAddr;
> myState->output_cid = GetCurrentCommandId(true);
> -   myState->ti_options = TABLE_INSERT_SKIP_FSM;
> +   myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
>
> MatView code already does this and COPY does this if specified. I’m not
> sure how
> does the community think about this. Actually personally I expect more
> about the
> all-visible setting due to TABLE_INSERT_FROZEN since I could easier use
> index only scan
> if we create an index and table use CTAS, else people have to use index
> only scan
> after vacuum. If people do not expect freeze could we at least introduce a
> option to
> specify the visibility during inserting?
>
> Regards,
> Paul



-- 
Darafei "Komяpa" Praliaskouski
OSM BY Team - http://openstreetmap.by/


Re: CTAS command tags

2021-01-22 Thread Komяpa
Having row count right away is very useful in CTAS in analytical and GIS
usage scenarios.

пт, 22 сту 2021, 16:14 карыстальнік Vik Fearing 
напісаў:

> I was recently surprised by the following inconsistencies in returned
> command tags for CTAS:
>
>
> postgres=# create table a as select 123;
> SELECT 1
>
> postgres=# create table b as select 123 with data;
> SELECT 1
>
> postgres=# create table c as select 123 with no data;
> CREATE TABLE AS
>
>
> Shouldn't we make the first two tags (which are likely the same code
> path; I haven't looked) the same as the third?  I can look into writing
> the patch if desired.
> --
> Vik Fearing
>
>
>


Re: Yet another fast GiST build

2020-09-09 Thread Komяpa
On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas  wrote:

> On 09/09/2020 13:28, Andrey M. Borodin wrote:
> > Thanks Darafei!
> >
> >> 9 сент. 2020 г., в 12:05, Darafei Komяpa Praliaskouski
> >>  написал(а):
> >>
> >>> How does the 'sortsupport' routine interact with
> >>> 'compress'/'decompress'? Which representation is passed to the
> >>> comparator routine: the original value from the table, the
> >>> compressed representation, or the decompressed representation? Do
> >>> the comparetup_index_btree() and readtup_index() routines agree
> >>> with that?
> >>
> 
>
> Come to think of it, the point z-order comparator could benefit a lot
> from key abbreviation, too. You could do the point -> zorder conversion
> in the abbreviation routine.
>

That's how it works in PostGIS, only that we moved to more
effecient Hilbert curve:
https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171


Re: Yet another fast GiST build

2020-09-09 Thread Komяpa
Hi,


On Wed, Sep 9, 2020 at 9:43 AM Andrey M. Borodin 
wrote:

>
>
> > 9 сент. 2020 г., в 00:05, Heikki Linnakangas 
> написал(а):
> >
> > I've been reviewing the patch today. The biggest changes I've made have
> been in restructuring the code in gistbuild.c for readability, but there
> are a bunch of smaller changes throughout. Attached is what I've got so
> far, squashed into one patch.
> Thanks!
>
> > I'm continuing to review it, but a couple of questions so far:
> >
> > In the gistBuildCallback(), you're skipping the tuple if 'tupleIsAlive
> == false'. That seems fishy, surely we need to index recently-dead tuples,
> too. The normal index build path isn't skipping them either.
> That's an oversight.
> >
> > How does the 'sortsupport' routine interact with
> 'compress'/'decompress'? Which representation is passed to the comparator
> routine: the original value from the table, the compressed representation,
> or the decompressed representation? Do the comparetup_index_btree() and
> readtup_index() routines agree with that?
>
> Currently we pass compressed values, which seems not very good.
> But there was a request from PostGIS maintainers to pass values before
> decompression.
> Darafei, please, correct me if I'm wrong. Also can you please provide link
> on PostGIS B-tree sorting functions?
>

We were expecting to reuse btree opclass for this thing. This way
btree_gist extension will become a lot thinner. :)

Core routine for current sorting implementation is Hilbert curve, which is
based on 2D center of a box - and used for abbreviated sort:
https://github.com/postgis/postgis/blob/2a7ebd0111b02aed3aa24752aad0ba89aef5d431/liblwgeom/gbox.c#L893


All the btree functions are wrappers around gserialized_cmp which just adds
a bunch of tiebreakers that don't matter in practice:
https://github.com/postgis/postgis/blob/2a7ebd0111b02aed3aa24752aad0ba89aef5d431/liblwgeom/gserialized.c#L313

Base representation for index compressed datatype is GIDX, which is also a
box. We can make it work on top of it instead of the original
representation.
There is no such thing as "decompressed representation" unfortunately as
compression is lossy.


Re: CUBE_MAX_DIM

2020-06-25 Thread Komяpa
Hello,

The problem with higher dimension cubes is that starting with
dimensionality of ~52 the "distance" metrics in 64-bit float have less than
a single bit per dimension in mantissa, making cubes indistinguishable.
Developers for facial recognition software had a chat about that on russian
postgres telegram group https://t.me/pgsql. Their problem was that they had
128-dimensional points, recompiled postgres - distances weren't helpful,
and GIST KNN severely degraded to almost full scans. They had to change the
number of facial features to smaller in order to make KNN search work.

Floating point overflow isn't that much of a risk per se, worst
case scenario it becomes an Infinity or 0 which are usually acceptable in
those contexts.

While mathematically possible, there are implementation issues with higher
dimension cubes. I'm ok with raising the limit if such nuances get a
mention in docs.

On Thu, Jun 25, 2020 at 1:01 PM Devrim Gündüz  wrote:

>
> Hi,
>
> Someone contacted me about increasing CUBE_MAX_DIM
> in contrib/cube/cubedata.h (in the community RPMs). The current value
> is 100 with the following comment:
>
> * This limit is pretty arbitrary, but don't make it so large that you
> * risk overflow in sizing calculations.
>
>
> They said they use 500, and never had a problem. I never added such
> patches to the RPMS, and will not -- but wanted to ask if we can safely
> increase it in upstream?
>
> Regards,
>
> --
> Devrim Gündüz
> Open Source Solution Architect, Red Hat Certified Engineer
> Twitter: @DevrimGunduz , @DevrimGunduzTR
>


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: exp() versus the POSIX standard

2020-06-11 Thread Komяpa
Hi,

On Fri, Jun 12, 2020 at 4:25 AM Tom Lane  wrote:

> =?UTF-8?Q?Darafei_=22Kom=D1=8Fpa=22_Praliaskouski?= 
> writes:
> > I've had the same issue with multiplying two tiny numbers. Select
> > 2e-300::float * 2e-300::float gives an underflow, and it is not a wanted
> > thing. This looks like handmade implementation of IEEE754's underflow
> > exception that should be an optional return flag in addition to well
> > defined number, but became a stop-the-world exception instead.
>
> Solving that problem is very far outside the scope of what I'm interested
> in here.


They're essentially the same issue.

Generally, it exists from the very beginning of git and seems to be a
series of misunderstandings.

Initially, somewhere around 1996, someone thought that a double goes only
from DBL_MIN to DBL_MAX, just like INT_MIN and INT_MAX, while they aren't
exactly that:
https://github.com/postgres/postgres/blame/8fecd4febf8357f3cc20383ed29ced484877d5ac/src/backend/utils/adt/float.c#L525

That logic seems to be sane in float4 case (where computation is done in
64bit and then checked to fit into 32bit without an overflow).
It feels like the float8 case got there just by copy-paste, but maybe it
was also used to not handle NaNs - it's not there in cmp's yett.

Later in 2007 Bruce Momjian removed the limitation on Infinities, but kept
the general structure - now subnormals are accepted, as DBL_MIN is no
longer used, but there is still a check that underflow occurred.
https://github.com/postgres/postgres/commit/f9ac414c35ea084ff70c564ab2c32adb06d5296f#diff-7068290137a01263be83308699042f1fR58



> I think that we'd probably regret it if we try to support IEEE
> subnormals, for example --- I know that all modern hardware is probably
> good with those, but I'd bet against different platforms' libc functions
> all behaving the same.


You don't need to support them. You just have them already.


> I don't see a sane way to offer user control over
> whether we throw underflow errors or not, either.


IEEE754 talks about CPU design. "Exception" there is not a postgres
exception, that's an exceptional case in computation that may raise a flag.
For all those exceptional cases there is a well defined description of what
value should be returned.
https://en.wikipedia.org/wiki/IEEE_754#Exception_handling

Current code looks like a misreading of what IEEE754 exception is, but upon
closer look it looks like a mutation of misunderstanding of what FLT_MIN is
for (FLT_TRUE_MIN that would fit there appeared only in C11 unfortunately).


> (Do you really want "+" to stop being immutable?)


No, no kind of GUC switch is needed. Just drop underflow/overflow checks.
You'll get 0 or Infinity in expected places, and infinities are okay since
2007.

The darker corners of IEEE754, like inexactness
> exceptions, are even less likely to be implemented consistently
> everywhere.
>
> regards, tom lane
>


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: exp() versus the POSIX standard

2020-06-11 Thread Komяpa
пт, 12 чэр 2020, 02:57 карыстальнік Tom Lane  напісаў:

> I wrote:
> > The POSIX standard says this about the exp(3) function:
> >   If x is -Inf, +0 shall be returned.
> > At least on my Linux box, our version does no such thing:
> > regression=# select exp('-inf'::float8);
> > ERROR:  value out of range: underflow
>
> Now that I look, power() has similar issues:
>
> regression=# select power('1.1'::float8, '-inf');
> ERROR:  value out of range: underflow
> regression=# select power('0.1'::float8, 'inf');
> ERROR:  value out of range: underflow
> regression=# select power('-inf'::float8, '-3');
> ERROR:  value out of range: underflow
> regression=# select power('-inf'::float8, '-4');
> ERROR:  value out of range: underflow
>
> contradicting POSIX which says
>
> For |x| > 1, if y is -Inf, +0 shall be returned.
>
> For |x| < 1, if y is +Inf, +0 shall be returned.
>
> For y an odd integer < 0, if x is -Inf, -0 shall be returned.
>
> For y < 0 and not an odd integer, if x is -Inf, +0 shall be returned.
>


I've had the same issue with multiplying two tiny numbers. Select
2e-300::float * 2e-300::float gives an underflow, and it is not a wanted
thing. This looks like handmade implementation of IEEE754's underflow
exception that should be an optional return flag in addition to well
defined number, but became a stop-the-world exception instead. Had to build
custom Postgres with that logic ripped off in the past to be able to
multiply numbers. Will be happy if that "underflow" (and overflow) thing is
removed.

If in doubt whether this exception should be removed, to follow the spec
fully in this way you have to also raise exception on any inexact result of
operations on floats.







> regards, tom lane
>
>
>


Re: feature idea: use index when checking for NULLs before SET NOT NULL

2020-05-29 Thread Komяpa
>
>
>
> John, I think it's worth pointing out that Postgres most likely does a
> full table scan to validate a constraint by design and not in optimization
> oversight.  Think of what's gonna happen if the index used for checking is
> corrupted?
>
>
This can't be true: a corrupted index is a failure mode, and failure modes
are not expected in normal flow. Think of it this way: we must never use
index scan, because if index is corrupted the results are going to be
disastrous, so we will always do Seq Scans.

It's ok to assume index is not corrupted.



-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Parallel GiST build on Cube

2020-04-27 Thread Komяpa
Hello,

These things for GIST I know that can help:
 - Fast sorting GIST build commitfest entry by Andrey Borodin, not parallel
but faster -
https://www.postgresql.org/message-id/flat/1A36620E-CAD8-4267-9067-FB31385E7C0D%40yandex-team.ru

 - Fast sorting GIST build by Nikita Glukhov, reuses btree code so also
parallel -
https://github.com/postgres/postgres/compare/master...glukhovn:gist_btree_build

 - "Choose Subtree" routine is needed, as current "penalty" is very
inefficient -
https://www.postgresql.org/message-id/flat/CAPpHfdssv2i7CXTBfiyR6%3D_A3tp19%2BiLo-pkkfD6Guzs2-tvEw%40mail.gmail.com#eaa98342462a4713c0d3a94be636e259

These are very wanted for PostGIS which also indexes everything by 2-4
dimensional cubes and require improvements in core infrastructure and
opclass.




On Mon, Apr 27, 2020 at 8:57 PM Shyam Saladi  wrote:

> Hello --
>
> I regularly build GiST indexes on large databases. In recent times, as the
> size of the database has ballooned (30 million rows) along with the build
> time on a column of points in 3- and 8-dimensional space (0-volume cube).
>
> Is anyone working on (or already implemented) a parallel GiST build on
> Cube? If not, I'd like to try contributing this--any pointers from folks
> familiar with the backend of Cube? Any input would be great.
>
> Thanks,
> Shyam
>
> --
> Shyam Saladi 
> NSF Graduate Research Fellow - Clemons Lab
> Biochemistry and Molecular Biophysics
> California Institute of Technology
>


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Yet another fast GiST build

2020-04-05 Thread Komяpa
Hello Yuri,

PDEP is indeed first thing that comes up when you start googling
z-curve and bit interleaving :)
We had the code with z-curve generating PDEP instruction in PostGIS,
and dropped it since. In sorting, we now utilize sort support / prefix
search, and key generated as Hilbert curve, with fine tuning it for
different projections' geometric properties.

>From this patch the most valuable thing for us is the sorting build
infrastructure itself. Maybe to get it a bit more understandable for
people not deep in geometry it makes sense to first expose the
btree_gist datatypes to this thing? So that btree_gist index on
integer will be built exactly the same way the btree index on integer
is built. This will also get everyone a reference point on the
bottlenecks and optimality of patch.

On Fri, Apr 3, 2020 at 10:56 AM Yuri Astrakhan  wrote:
>
> Awesome addition!  Would it make sense to use x86's BMI2's PDEP instruction, 
> or is the interleave computation too small of a percentage to introduce 
> not-so-easy-to-port code?  Also, I think it needs a bit more documentation to 
> explain the logic, i.e. a link to 
> https://stackoverflow.com/questions/39490345/interleave-bits-efficiently ?  
> Thx for making it faster :)



-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa




Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-19 Thread Komяpa
> > According to my reckoning, that is the remaining objection to the patch
> > as it is (with ordinary freezing behavior).
> >
> > How about a scale_factor od 0.005?  That will be high enough for large
> > tables, which seem to be the main concern here.
>
> Seems low on a first blush. On a large-ish table with 1 billion tuples,
> we'd vacuum every 5 million inserts. For many ETL workloads this will
> result in a vacuum after every bulk operation. Potentially with an index
> scan associated (even if there's no errors, a lot of bulk loads use ON
> CONFLICT INSERT leading to the occasional update).

This is a good and wanted thing. Upthread it was already suggested
that "everyone knows to vacuum after bulk operations". This will go and vacuum
the data while it's hot and in caches, not afterwards, reading from disk.


> > I am not sure about b).  In my mind, the objective is not to prevent
> > anti-wraparound vacuums, but to see that they have less work to do,
> > because previous autovacuum runs already have frozen anything older than
> > vacuum_freeze_min_age.  So, assuming linear growth, the number of tuples
> > to freeze during any run would be at most one fourth of today's number
> > when we hit autovacuum_freeze_max_age.
>
> This whole chain of arguments seems like it actually has little to do
> with vacuuming insert only/mostly tables. The same problem exists for
> tables that aren't insert only/mostly. Instead it IMO is an argument for
> a general change in logic about when to freeze.
>
> What exactly is it that you want to achieve by having anti-wrap vacuums
> be quicker? If the goal is to reduce the window in which autovacuums
> aren't automatically cancelled when there's a conflicting lock request,
> or in which autovacuum just schedules based on xid age, then you can't
> have wraparound vacuums needing to do substantial amount of work.

The problem hit by Mandrill is simple: in modern cloud environments
it's sometimes simply impossible to read all the data on disk because
of different kinds of throttling.
At some point your production database just shuts down and asks to
VACUUM in single user mode for 40 days.

You want vacuum to happen long before that, preferably when the data
is still in RAM, or, at least, fits your cloud provider's disk burst
performance budget, where performance of block device resembles that
of an SSD and not of a Floppy Disk.

Some more reading on how that works:
https://aws.amazon.com/ru/blogs/database/understanding-burst-vs-baseline-performance-with-amazon-rds-and-gp2/

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa




Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-13 Thread Komяpa
On Fri, Mar 13, 2020 at 3:19 AM Laurenz Albe  wrote:
>
> On Fri, 2020-03-13 at 09:10 +1300, David Rowley wrote:
> > So you're suggesting we drive the insert-vacuums from existing
> > scale_factor and threshold?  What about the 1 billion row table
> > example above?
>
> I am still not 100% certain if that is really realistic.
> Transactions that insert only a single row are probably the
> exception in large insert-only tables.
>
> But I think that we probably always can find a case where any given
> parameter setting is not so great, so in order to get ahead
> let's decide on something that is not right out stupid.
> Changing the defaults later is always an option.
>
> So the three options are:
>
> 1. introduce no new parameters and trigger autovacuum if the number
>of inserts exceeds the regular vacuum threshold.
>
> 2. introduce the new parameters with high base threshold and zero scale 
> factor.

Both of these look good to me. 1 is approach in my initial patch
sketch, 2 is approach taken by Laurenz.
Values I think in when considering vacuum is "how many megabytes of
table aren't frozen/visible" (since that's what translates into
processing time knowing io limits of storage), and "how many pages
aren't yet vacuumed".

Threshold in Laurenz's patch was good enough for my taste - it's
basically "vacuum after every gigabyte", and that's exactly what we
implemented when working around this issue manually. There's enough
chance that latest gigabyte is in RAM and vacuum will be super fast on
it; reading a gigabyte of data is not a showstopper for most
contemporary physical and cloud environments I can think of. If
reading a gigabyte is a problem already then wraparound is a
guaranteed disaster.

About index only scan, this threshold seems good enough too. There's a
good chance last gig is already in RAM, and previous data was
processed with previous vacuum. Anyway - with this patch Index Only
Scan starts actually working :)

I'd vote for 2 with a note "rip it off all together later and redesign
scale factors and thresholds system to something more easily
graspable". Whoever needs to cancel the new behavior for some reason
will have a knob then, and patch is laid out already.

> 3. introduce the new parameters with low base threshold and high scale factor.

This looks bad to me. "the bigger the table, the longer we wait" does
not look good for me for something designed as a measure preventing
issues with big tables.

> I think all three are viable.
> If nobody else wants to weigh in, throw a coin.
>
> Yours,
> Laurenz Albe
>


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa




Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-05 Thread Komяpa
Hi,

Thanks Laurenz for taking action on this and writing a better patch
than my initial.
This will help avoid both Mandrill-like downtimes and get Index Only
Scan just work on large telemetry databases like the one I was
responsible for back when I was in Juno.

On Thu, Mar 5, 2020 at 9:40 AM David Rowley  wrote:
>
> On Thu, 5 Mar 2020 at 04:15, Laurenz Albe  wrote:
> > I just realized that the exercise is pointless unless that
> > autovacuum also runs with FREEZE on.

> 8. Should we care when setting the insert counter back to 0 if
> auto-vacuum has skipped pages?

I believe it would be enough just to leave a comment about this in code.

> 10. I'm slightly worried about the case where we don't quite trigger a
> normal vacuum but trigger a vacuum due to INSERTs then skip cleaning
> up the indexes but proceed to leave dead index entries causing indexes
> to become bloated.  It does not seem impossible that given the right
> balance of INSERTs and UPDATE/DELETEs that this could happen every
> time and the indexes would just become larger and larger.

Can we not reset statistics about dead tuples upon index-skipping
vacuum, since we didn't really take care of them?

> 11. We probably do also need to debate if we want this on or off by
> default.   I'd have leaned towards enabling by default if I'd not
> personally witnessed the fact that people rarely* increase auto-vacuum
> to run faster than the standard cost settings. I've seen hundreds of
> servers over the years with all workers busy for days on something
> they'll never finish quickly enough.  We increased those settings 10x
> in PG12, so there will be fewer people around suffering from that now,
> but even after having reduced the vacuum_cost_delay x10 over the PG11
> settings, it's by no means fast enough for everyone.  I've mixed
> feelings about giving auto-vacuum more work to do for those people, so
> perhaps the best option is to keep this off by default so as not to
> affect the people who don't tune auto-vacuum.  They'll just suffer the
> pain all at once when they hit max freeze age instead of more
> gradually with the additional load on the workers.   At least adding
> this feature gives the people who do tune auto-vacuum some ability to
> handle read-only tables in some sane way.

That's exactly the situation we're trying to avoid with this patch.
Suffering all at once takes large production deployments down for
weeks, and that gets into news.
In current cloud setups it's plain impossible to read the whole
database at all, let alone rewrite, with IO budgets.
I say we should enable this setting by default.
If my calculations are correct, that will freeze the table once each
new gigabyte of data is written, which is usually fitting into burst
read thresholds.




-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa




Re: Yet another fast GiST build

2020-03-01 Thread Komяpa
Hello,

Thanks for the patch and working on GiST infrastructure, it's really
valuable for PostGIS use cases and I wait to see this improvement in
PG13.

On Sat, Feb 29, 2020 at 3:13 PM Andrey M. Borodin  wrote:

> Thomas, I've used your wording almost exactly with explanation how
> point_zorder_internal() works. It has more explanation power than my attempts
> to compose good comment.

PostGIS uses this trick to ensure locality. In PostGIS 3 we enhanced
that trick to have the Hilbert curve instead of Z Order curve.

For visual representation have a look at these links:
 - http://blog.cleverelephant.ca/2019/08/postgis-3-sorting.html - as
it's implemented in PostGIS btree sorting opclass
 - https://observablehq.com/@mourner/hilbert-curve-packing - to
explore general approach.

Indeed if it feels insecure to work with bit magic that implementation
can be left out to extensions.

> There is one design decision that worries me most:
> should we use opclass function or index option to provide this sorting 
> information?
> It is needed only during index creation, actually. And having extra i-class 
> only for fast build
> seems excessive.
> I think we can provide both ways and let opclass developers decide?

Reloption variant looks dirty. It won't cover an index on (id uuid,
geom geometry) where id is duplicated (say, tracked car identifier)
but present in every query - no way to pass such thing as reloption.
I'm also concerned about security of passing a sortsupport function
manually during index creation (what if that's not from the same
extension or even (wrong-)user defined something).

We know for sure it's a good idea for all btree_gist types and
geometry and I don't see a case where user would want to disable it.
Just make it part of operator class, and that would also allow fast
creation of multi-column index.

Thanks.

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa




Re: [postgis-devel] About EXTENSION from UNPACKAGED on PostgreSQL 13

2020-02-25 Thread Komяpa
Hi,

PostGIS 2.5 had raster and vector blended together in single extension.
In PostGIS 3, they were split out into postgis and postgis_raster extensions.
To upgrade, there is now postgis_extensions_upgrade() function, that
unpackages the raster part out of postgis extensions, upgrades it, and
packages raster functions back into postgis_raster by utilizing FROM
UNPACKAGED.
Removal of FROM UNPACKAGED breaks PostGIS 2.5 -> 3.0 upgrade path, and
we haven't yet found a proper replacement since such removal wasn't
something we were expecting.

On Tue, Feb 25, 2020 at 11:37 PM Stephen Frost  wrote:
>
> Greetings,
>
> * Darafei "Komяpa" Praliaskouski (m...@komzpa.net) wrote:
> > can it be raised on pgsql-hackers as a thing impacting PostGIS upgrade path?
>
> Why is it impacting the PostGIS upgrade path?  The FROM UNPACKAGED was
> never intended to be used as an upgrade path..
>
> Thanks,
>
> Stephen
> ___
> postgis-devel mailing list
> postgis-de...@lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-devel



-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa




Re: BRIN cost estimate breaks geometric indexes

2020-02-14 Thread Komяpa
Hi,

Patch may look as simple as this one:
https://patch-diff.githubusercontent.com/raw/postgres/postgres/pull/49.diff

Previous mention in -hackers is available at
https://postgrespro.com/list/id/cakjs1f9n-wapop5xz1dtgdpdqmzegqqk4sv2mk-zzugfc14...@mail.gmail.com
-
seems everyone overlooked that patch there breaks geometric indexing back
then.

On Tue, Jan 21, 2020 at 2:07 AM Egor Rogov  wrote:

> On 21.01.2020 0:00, Darafei "Komяpa" Praliaskouski wrote:
> > Hi,
> >
> > Found out today that BRIN indexes don't really work for PostGIS and
> > box datatypes.
> >
> > Since
> >
> https://github.com/postgres/postgres/commit/7e534adcdc70866e7be74d626b0ed067c890a251
>  Postgres
>
> > requires datatype to provide correlation statistics. Such statistics
> > wasn't provided by PostGIS and box types.
> >
> > Today I tried to replace a 200gb gist index with 8mb brin index and
> > queries didn't work as expected - it was never used. set
> > enable_seqscan=off helped for a bit but that's not a permanent solution.
> > Plans for context:
> > https://gist.github.com/Komzpa/2cd396ec9b65e2c93341e9934d974826
> >
> > Debugging session on #postgis IRC channel leads to this ticket to
> > create a (not that meaningful) correlation statistics for geometry
> > datatype: https://trac.osgeo.org/postgis/ticket/4625#ticket
> >
> > Postgres Professional mentioned symptoms of the issue in their
> > in-depth manual:
> > https://habr.com/ru/company/postgrespro/blog/346460/ - box datatype
> > showed same unusable BRIN symptoms for them.
>
>
> (Translated to English:
> https://habr.com/en/company/postgrespro/blog/452900/)
>
>
> > A reasonable course of action on Postgres side seems to be to not
> > assume selectivity of 1 in absence of correlation statistics, but
> > something that would prefer such an index to a parallel seq scan, but
> > higher than similar GIST.
> >
> > Any other ideas?
>
>
> As far as I understand, correlation is computed only for sortable types,
> which means that the current concept of correlation works as intended
> only for B-tree indexes.
>
> Ideally, correlation should be computed for (attribute, index) pair,
> taking into account order of values returned by the index scan. Less
> ideal but more easier approach can be to ignore the computed correlation
> for any index access except B-tree, and just assume some predefined
> constant.
>
>
>
>
>

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Marking some contrib modules as trusted extensions

2020-01-29 Thread Komяpa
Hello,


> btree_gin
> btree_gist


I would even ask btree_gin and btree_gist to be moved to core.

btree_gist is shipping opclasses for built in types to be used in gist
indexes. btree_* is confusing part in the name pretending there's some
magic happening linking btree and gist.

gist is the most popular way to get geometric indexes, and these often need
to be combined with some class identifier that's used in lookups together.
CREATE INDEX on geom_table using gist (zooom_level, geom); fails for no
reason without btree_gist - types are shipped in core,
gist itself is not an extension, but letting to use one core mechanism with
another in an obvious way is for some reason split out.


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


BRIN cost estimate breaks geometric indexes

2020-01-20 Thread Komяpa
Hi,

Found out today that BRIN indexes don't really work for PostGIS and box
datatypes.

Since
https://github.com/postgres/postgres/commit/7e534adcdc70866e7be74d626b0ed067c890a251
Postgres
requires datatype to provide correlation statistics. Such statistics wasn't
provided by PostGIS and box types.

Today I tried to replace a 200gb gist index with 8mb brin index and queries
didn't work as expected - it was never used. set enable_seqscan=off helped
for a bit but that's not a permanent solution.
Plans for context:
https://gist.github.com/Komzpa/2cd396ec9b65e2c93341e9934d974826

Debugging session on #postgis IRC channel leads to this ticket to create a
(not that meaningful) correlation statistics for geometry datatype:
https://trac.osgeo.org/postgis/ticket/4625#ticket

Postgres Professional mentioned symptoms of the issue in their in-depth
manual: https://habr.com/ru/company/postgrespro/blog/346460/ - box datatype
showed same unusable BRIN symptoms for them.

A reasonable course of action on Postgres side seems to be to not assume
selectivity of 1 in absence of correlation statistics, but something that
would prefer such an index to a parallel seq scan, but higher than similar
GIST.

Any other ideas?

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: cost based vacuum (parallel)

2019-11-03 Thread Komяpa
>
>
> This is somewhat similar to a memory usage problem with a
> parallel query where each worker is allowed to use up to work_mem of
> memory.  We can say that the users using parallel operation can expect
> more system resources to be used as they want to get the operation
> done faster, so we are fine with this.  However, I am not sure if that
> is the right thing, so we should try to come up with some solution for
> it and if the solution is too complex, then probably we can think of
> documenting such behavior.
>

In cloud environments (Amazon + gp2) there's a budget on input/output
operations. If you cross it for long time, everything starts looking like
you work with a floppy disk.

For the ease of configuration, I would need a "max_vacuum_disk_iops" that
would limit number of input-output operations by all of the vacuums in the
system. If I set it to less than value of budget refill, I can be sure than
that no vacuum runs too fast to impact any sibling query.

There's also value in non-throttled VACUUM for smaller tables. On gp2 such
things will be consumed out of surge budget, and its size is known to
sysadmin. Let's call it "max_vacuum_disk_surge_iops" - if a relation has
less blocks than this value and it's a blocking in any way situation
(antiwraparound, interactive console, ...) - go on and run without
throttling.

For how to balance the cost: if we know a number of vacuum processes that
were running in the previous second, we can just divide a slot for this
iteration by that previous number.

To correct for overshots, we can subtract the previous second's overshot
from next one's. That would also allow to account for surge budget usage
and let it refill, pausing all autovacuum after a manual one for some time.

Precision of accounting limiting count of operations more than once a
second isn't beneficial for this use case.

Please don't forget that processing one page can become several iops (read,
write, wal).

Does this make sense? :)


Re: Unwanted expression simplification in PG12b2

2019-09-22 Thread Komяpa
Hi,

On Fri, Sep 20, 2019 at 11:14 PM Robert Haas  wrote:

> On Wed, Jul 17, 2019 at 5:20 PM Darafei "Komяpa" Praliaskouski
>  wrote:
> > Indeed, it seems I failed to minimize my example.
> >
> > Here is the actual one, on 90GB table with 16M rows:
> > https://gist.github.com/Komzpa/8d5b9008ad60f9ccc62423c256e78b4c
> >
> > I can share the table on request if needed, but hope that plan may be
> enough.
>
> What I taught the planner to do here had to do with making the costing
> more accurate for cases like this. It now figures out that if it's
> going to stick a Gather in at that point, computing the expressions
> below the Gather rather than above the Gather makes a difference to
> the cost of that plan vs. other plans. However, it still doesn't
> consider any more paths than it did before; it just costs them more
> accurately. In your first example, I believe that the planner should
> be able to consider both GroupAggregate -> Gather Merge -> Sort ->
> Parallel Seq Scan and GroupAggregate -> Sort -> Gather -> Parallel Seq
> Scan, but I think it's got a fixed idea about which fields should be
> fed into the Sort. In particular, I believe it thinks that sorting
> more data is so undesirable that it doesn't want to carry any
> unnecessary baggage through the Sort for any reason. To solve this
> problem, I think it would need to cost the second plan with projection
> done both before the Sort and after the Sort and decide which one was
> cheaper.
>
> This class of problem is somewhat annoying in that the extra planner
> cycles and complexity to deal with getting this right would be useless
> for many queries, but at the same time, there are a few cases where it
> can win big. I don't know what to do about that.
>

A heuristic I believe should help my case (and I hardly imagine how it can
break others) is that in presence of Gather, all the function calls that
are parallel safe should be pushed into it.
In a perfect future this query shouldn't even have a subquery that I have
extracted for the sake of OFFSET 0 demo. Probably as a single loop that in
case of presence of a Gather tries to push down all the inner part of the
nested functions call that is Parallel Safe.
If we go as far as starting more workers, it really makes sense to load
them with actual work and not only wait for the master process.


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Yet another fast GiST build

2019-08-26 Thread Komяpa
Hello,

This is very interesting. In my pipeline currently GiST index rebuild is
the biggest time consuming step.

I believe introducing optional concept of order in the GiST opclass will be
beneficial not only for fast build, but for other tasks later:
 - CLUSTER can order the table using that notion, in parallel way.
 - btree_gist can be even closer to btree by getting the tuples sorted
inside page.
 - tree descend on insertion in future can traverse the list in more
opportunistic way, calculating penalty for siblings-by-order first.

I believe everywhere the idea of ordering is needed it's provided by giving
a btree opclass.

How about giving a link to btree opclass inside a gist opclass?


On Mon, Aug 26, 2019 at 10:59 AM Andrey Borodin 
wrote:

> Hi!
>
> In many cases GiST index can be build fast using z-order sorting.
>
> I've looked into proof of concept by Nikita Glukhov [0] and it looks very
> interesting.
> So, I've implemented yet another version of B-tree-like GiST build.
> It's main use case and benefits can be summarized with small example:
>
> postgres=# create table x as select point (random(),random()) from
> generate_series(1,300,1);
> SELECT 300
> Time: 5061,967 ms (00:05,062)
> postgres=# create index ON x using gist (point ) with
> (fast_build_sort_function=gist_point_sortsupport);
> CREATE INDEX
> Time: 6140,227 ms (00:06,140)
> postgres=# create index ON x using gist (point );
> CREATE INDEX
> Time: 32061,200 ms (00:32,061)
>
> As you can see, Z-order build is on order of magnitude faster. Select
> performance is roughly the same. Also, index is significantly smaller.
>
> Nikita's PoC is faster because it uses parallel build, but it intervenes
> into B-tree code a lot (for reuse). This patchset is GiST-isolated.
> My biggest concern is that passing function to relation option seems a bit
> hacky. You can pass there any function matching sort support signature.
> Embedding this function into opclass makes no sense: it does not affect
> scan anyhow.
>
> In current version, docs and tests are not implemented. I want to discuss
> overall design. Do we really want yet another GiST build, if it is 3-10
> times faster?
>
> Thanks!
>
> Best regards, Andrey Borodin.
>
> [0]
> https://github.com/postgres/postgres/compare/master...glukhovn:gist_btree_build
>
>

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: max_parallel_workers can't actually be set?

2019-08-17 Thread Komяpa
Hi,

On my PG11 I have set it to 64 upon setup and it propogated to
postgresql.auto.conf and is set after restart. I've upgraded to PG12 since
then, and parameter is read from postgresql.auto.conf correctly and is
displayed via SHOW (just checked on 12beta3).

I also spent some time trying to get a plan that will give me 32 workers.
Largest I ever got without taking a hammer was 16, which is half of
available cores, or all non-HT ones. I still haven't found a way to set
costs and limits to load all the system with a query.

ALTER TABLE ... SET (parallel_workers=32); is currently my most favorite
hammer. I set max_worker_processes to 512 and letting OS scheduler resolve
the hours when four queries run 128 CPU-bound processes on 32-core machine,
it's not as good as if the limits were adjusted dynamically after the query
start but much better than running a second query with just 1 worker even
after first one finishes.

On Sat, Aug 17, 2019 at 8:41 PM Tom Lane  wrote:

> Try this:
> alter system set max_parallel_workers = 20;
> and restart the system.
>
> max_parallel_workers is still 8, according to both SHOW and
> pg_controldata, nor can you launch more than 8 workers.
>
> Even odder, if you just do
>
> regression=# set max_parallel_workers = 200;
> SET
> regression=# show max_parallel_workers;
>  max_parallel_workers
> --
>  200
> (1 row)
>
> which should certainly not happen for a PGC_POSTMASTER parameter.
>
> We seem to have an awful lot of mechanism that's concerned with
> adjustments of max_parallel_workers, for something that apparently
> might as well be a compile-time #define ... so I assume it's supposed
> to be changeable at restart and somebody broke it.  But it's not
> working as I'd expect in any branch from 10 onwards.
>
> regards, tom lane
>
>
>

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Rethinking opclass member checks and dependency strength

2019-08-09 Thread Komяpa
>
> But none of our contrib modules do it like that, and I'd lay long odds
> against any third party code doing it either.


Thoughts?
>


PostGIS has some rarely used box operations as part of GiST opclass, like
"overabove".

These are source of misunderstanding, as it hinges on the fact that
non-square geometry will be coerced into a box on a call, which is not
obvious when you call it on something like diagonal linestrings.
It may happen that we will decide to remove them. On such circumstances, I
expect that ALTER OPERATOR CLASS DROP OPERATOR will work.

Other thing that I would expect is that DROP FUNCTION ... CASCADE will
remove the operator and unregister the operator from operator class without
dropping operator class itself.


Re: Unwanted expression simplification in PG12b2

2019-07-17 Thread Komяpa
Hi,

On Wed, Jul 17, 2019 at 11:58 PM Tom Lane  wrote:

> =?UTF-8?Q?Darafei_=22Kom=D1=8Fpa=22_Praliaskouski?= 
> writes:
> > Many thanks for the parallel improvements in Postgres 12. Here is one of
> > cases where a costy function gets moved from a parallel worker into main
> > one, rendering spatial processing single core once again on some queries.
> > Perhaps an assumption "expressions should be mashed together as much as
> > possible" should be reviewed and something along "biggest part of
> > expression should be pushed down into parallel worker"?
>
> I don't see anything in your test case that proves what you think it does.
> The expensive calculation *is* being done in the worker in the first
> example.  It's not real clear to me why the first example is only choosing
> to use one worker rather than 3, but probably with a larger test case
> (ie bigger table) that decision would change.
>

Indeed, it seems I failed to minimize my example.

Here is the actual one, on 90GB table with 16M rows:
https://gist.github.com/Komzpa/8d5b9008ad60f9ccc62423c256e78b4c

I can share the table on request if needed, but hope that plan may be
enough.

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Unwanted expression simplification in PG12b2

2019-07-17 Thread Komяpa
Hi,

Many thanks for the parallel improvements in Postgres 12. Here is one of
cases where a costy function gets moved from a parallel worker into main
one, rendering spatial processing single core once again on some queries.
Perhaps an assumption "expressions should be mashed together as much as
possible" should be reviewed and something along "biggest part of
expression should be pushed down into parallel worker"?

PostgreSQL 12beta2 (Ubuntu 12~beta2-1.pgdg19.04+1) on x86_64-pc-linux-gnu,
compiled by gcc (Ubuntu 8.3.0-6ubuntu1) 8.3.0, 64-bit

Here is a reproducer:


-- setup
create extension postgis;
create table postgis_test_table (a geometry, b geometry, id int);
set force_parallel_mode to on;
insert into postgis_test_table (select 'POINT EMPTY', 'POINT EMPTY',
generate_series(0,1000) );


-- unwanted inlining moves difference and unary union calculation into
master worker
21:43:06 [gis] > explain verbose select ST_Collect(geom), id from
(select ST_Difference(a,ST_UnaryUnion(b)) as geom, id from
postgis_test_table) z group by id;
┌─┐
│ QUERY PLAN
   │
├─┤
│ Gather  (cost=159.86..42668.93 rows=200 width=36)
   │
│   Output: (st_collect(st_difference(postgis_test_table.a,
st_unaryunion(postgis_test_table.b, postgis_test_table.id │
│   Workers Planned: 1
   │
│   Single Copy: true
   │
│   ->  GroupAggregate  (cost=59.86..42568.73 rows=200 width=36)
   │
│ Output: st_collect(st_difference(postgis_test_table.a,
st_unaryunion(postgis_test_table.b))), postgis_test_table.id │
│ Group Key: postgis_test_table.id
   │
│ ->  Sort  (cost=59.86..61.98 rows=850 width=68)
   │
│   Output: postgis_test_table.id, postgis_test_table.a,
postgis_test_table.b │
│   Sort Key: postgis_test_table.id
   │
│   ->  Seq Scan on public.postgis_test_table
(cost=0.00..18.50 rows=850 width=68)   │
│ Output: postgis_test_table.id,
postgis_test_table.a, postgis_test_table.b
  │
└─┘
(12 rows)

-- when constrained by OFFSET 0, costy calculation is kept in parallel workers
21:43:12 [gis] > explain verbose select ST_Collect(geom), id from
(select ST_Difference(a,ST_UnaryUnion(b)) as geom, id from
postgis_test_table offset 0) z group by id;
┌───┐
│QUERY
PLAN │
├───┤
│ GroupAggregate  (cost=13863.45..13872.33 rows=200 width=36)
 │
│   Output: st_collect(z.geom), z.id
 │
│   Group Key: z.id
 │
│   ->  Sort  (cost=13863.45..13865.58 rows=850 width=36)
 │
│ Output: z.id, z.geom
 │
│ Sort Key: z.id
 │
│ ->  Subquery Scan on z  (cost=100.00..13822.09 rows=850
width=36) │
│   Output: z.id, z.geom
 │
│   ->  Gather  (cost=100.00..13813.59 rows=850 width=36)
 │
│ Output: (st_difference(postgis_test_table.a,
st_unaryunion(postgis_test_table.b))), postgis_test_table.id │
│ Workers Planned: 3
 │
│ ->  Parallel Seq Scan on
public.postgis_test_table  (cost=0.00..13712.74 rows=274 width=36)
  │
│   Output:
st_difference(postgis_test_table.a,
st_unaryunion(postgis_test_table.b)), 

Re: GiST "choose subtree" support function to inline penalty

2019-06-27 Thread Komяpa
On Thu, Jun 27, 2019 at 6:00 AM Tom Lane  wrote:

> =?UTF-8?Q?Darafei_=22Kom=D1=8Fpa=22_Praliaskouski?= 
> writes:
> > I'm looking at PostGIS geometry GiST index build times and try to
> optimize
> > withing the current GiST framework. The function that shows a lot on my
> > flame graphs is penalty.
>
> > I spent weekend rewriting PostGIS penalty to be as fast as possible.
> > (FYI https://github.com/postgis/postgis/pull/425/files)
>
> > However I cannot get any meaningfully faster build time. Even when I
> strip
> > it to "just return edge extension" index build time is the same.
>
> TBH this makes me wonder whether the real problem isn't so much "penalty
> function is too slow" as "penalty function is resulting in really bad
> index splits".
>

As an extension writer I don't have much control on how Postgres calls
penalty function. PostGIS box is using floats instead of doubles, so its
size is twice as small as postgres builtin box, meaning penalty is called
even more often on better packed pages.

I can get index construction speed to be much faster if I break penalty to
actually result in horrible splits: index size grows 50%, construction is
30% faster.


>
> It might be that giving the opclass higher-level control over the split
> decision can help with both aspects.


Please note the question is not about split. Korotkov's split is working
fine. Issue is with penalty and computations required for choosing the
subtree before split happens.

Andrey Borodin proposed off-list that we can provide our own index type
that is a copy of GiST but with penalty inlined into "choose subtree" code
path, as that seems to be the only way to do it in PG12. Is there a more
humane option than forking GiST?



> But never start micro-optimizing
> an algorithm until you're sure it's the right algorithm.
>

That's exactly the reason I write original letter. I don't see any option
for further optimization in existing GiST framework, but this optimization
is needed: waiting 10 hours for GiST to build after an hour of ingesting
the dataset is frustrating, especially when you see a nearby b-tree done in
an hour.


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


GiST "choose subtree" support function to inline penalty

2019-06-24 Thread Komяpa
Hi,

I'm looking at PostGIS geometry GiST index build times and try to optimize
withing the current GiST framework. The function that shows a lot on my
flame graphs is penalty.

I spent weekend rewriting PostGIS penalty to be as fast as possible.
(FYI https://github.com/postgis/postgis/pull/425/files)

However I cannot get any meaningfully faster build time. Even when I strip
it to "just return edge extension" index build time is the same.

Is there a way to inline the penalty into above "choose subtree" loop
somehow? That would also let us stop bit-fiddling floats to simulate a more
complex choosing scheme.

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: How do we support FULL JOIN on PostGIS types?

2019-06-03 Thread Komяpa
Hi,

Thanks a lot RhodiumToad on IRC for suggestion of setting HASHES, MERGES on
OPERATOR =.

Now we have other problem: how do we set these flags on upgrade to new
version of extension? Dropping an OPERATOR = will drop all indexes an views
depending on it so isn't really an option.

Also, if someone can sneak "ERROR: FULL JOIN is only supported with
merge-joinable or hash-joinable join conditions" keywords into
https://www.postgresql.org/docs/current/xoper-optimization.html#id-1.8.3.17.8
it would greatly help future extension writers - it's not possible to
google this page out by the error message.

On Thu, May 16, 2019 at 7:05 PM Darafei "Komяpa" Praliaskouski <
m...@komzpa.net> wrote:

> Hi!
>
> Greetings from OSGeo Code sprint in Minneapolis :)
>
> We're trying to make FULL JOIN on equality of geometry and can't figure
> out why it doesn't work.
>
> Here's reproducer, it works on bytea but not on PostGIS geometry throwing
> out
>
> ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable
> join conditions
>
> https://trac.osgeo.org/postgis/ticket/4394
>
> We already have a btree opclass with equality:
>
> https://github.com/postgis/postgis/blob/svn-trunk/postgis/postgis.sql.in#L420
>
>
> We also have hash equality opclass:
>
> https://github.com/postgis/postgis/blob/svn-trunk/postgis/postgis.sql.in#L440
>
>
> Reading through Postgres documentation I can't figure out what else shall
> we do for this join to work. How do we make it work?
>
> --
> Darafei Praliaskouski
> Support me: http://patreon.com/komzpa
>


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


How do we support FULL JOIN on PostGIS types?

2019-05-16 Thread Komяpa
Hi!

Greetings from OSGeo Code sprint in Minneapolis :)

We're trying to make FULL JOIN on equality of geometry and can't figure out
why it doesn't work.

Here's reproducer, it works on bytea but not on PostGIS geometry throwing
out

ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable
join conditions

https://trac.osgeo.org/postgis/ticket/4394

We already have a btree opclass with equality:
https://github.com/postgis/postgis/blob/svn-trunk/postgis/postgis.sql.in#L420


We also have hash equality opclass:
https://github.com/postgis/postgis/blob/svn-trunk/postgis/postgis.sql.in#L440


Reading through Postgres documentation I can't figure out what else shall
we do for this join to work. How do we make it work?

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Berserk Autovacuum (let's save next Mandrill)

2019-04-14 Thread Komяpa
>
>
> >I don't think it's helpful to force emergency vacuuming more
> >frequently;
> >quite the contrary, it's likely to cause even more issues.  We should
> >tweak autovacuum to perform freezing more preemtively instead.
>
> I still think the fundamental issue with making vacuum less painful is
> that the all indexes have to be read entirely. Even if there's not much
> work (say millions of rows frozen, hundreds removed). Without that issue we
> could vacuum much more frequently. And do it properly in insert only
> workloads.
>

Deletion of hundreds of rows on default settings will cause the same
behavior now.
If there was 0 updates currently the index cleanup will be skipped.

https://commitfest.postgresql.org/22/1817/ got merged. This means
Autovacuum can have two separate thresholds - the current, on dead tuples,
triggering the VACUUM same way it triggers it now, and a new one, on
inserted tuples only, triggering VACUUM (INDEX_CLEANUP FALSE)?

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Berserk Autovacuum (let's save next Mandrill)

2019-04-14 Thread Komяpa
On Wed, Apr 10, 2019 at 6:13 PM Alvaro Herrera 
wrote:

> On 2019-Mar-31, Darafei "Komяpa" Praliaskouski wrote:
>
> > Alternative point of "if your database is super large and actively
> written,
> > you may want to set autovacuum_freeze_max_age to even smaller values so
> > that autovacuum load is more evenly spread over time" may be needed.
>
> I don't think it's helpful to force emergency vacuuming more frequently;
> quite the contrary, it's likely to cause even more issues.  We should
> tweak autovacuum to perform freezing more preemtively instead.
>

Okay. What would be your recommendation for the case of Mandrill running
current Postgres 11? Which parameters shall they tune and to which values?



>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Berserk Autovacuum (let's save next Mandrill)

2019-04-06 Thread Komяpa
>
> The invoking autovacuum on table based on inserts, not only deletes
> and updates, seems good idea to me. But in this case, I think that we
> can not only freeze tuples but also update visibility map even when
> setting all-visible. Roughly speaking  I think vacuum does the
> following operations.
>
> 1. heap vacuum
> 2. HOT pruning
> 3. freezing tuples
> 4. updating visibility map (all-visible and all-frozen)
> 5. index vacuum/cleanup
> 6. truncation
>
> With the proposed patch[1] we can control to do 5 or not. In addition
> to that, another proposed patch[2] allows us to control 6.
>

[1] is committed, [2] nears commit. Seems we have now all the infra to
teach autovacuum to run itself based on inserts and not hurt anybody?

...

> [1] https://commitfest.postgresql.org/22/1817/
> [2] https://commitfest.postgresql.org/22/1981/
>


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-04-04 Thread Komяpa
On Fri, Apr 5, 2019 at 6:58 AM Tom Lane  wrote:

> Andres Freund  writes:
> > I think the right approach would be to do all of this in heap_insert and
> > heap_multi_insert. Whenever starting to work on a page, if INSERT_FROZEN
> > is specified, remember whether it is either currently empty, or is
> > already marked as all-visible. If previously empty, mark it as all
> > visible at the end. If already all visible, there's no need to change
> > that. If not yet all-visible, no need to do anything, since it can't
> > have been inserted with COPY FREEZE.  Do you see any problem doing it
> > that way?
>
> Do we want to add overhead to these hot-spot routines for this purpose?
>

Sizing the overhead: workflows right now don't end with COPY FREEZE - you
need another VACUUM to set maps.
Anything that lets you skip that VACUUM (and faster than that VACUUM
itself) is helping. You specifically asked for it to be skippable with
FREEZE anyway.


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Compressed TOAST Slicing

2019-04-02 Thread Komяpa
Hi!


> I'll plan to push this tomorrow with the above change (and a few
> additional comments to explain what all is going on..).


Is everything ok? Can it be pushed?

I'm looking here, haven't found it pushed and worry about this.
https://github.com/postgres/postgres/commits/master



-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Berserk Autovacuum (let's save next Mandrill)

2019-03-31 Thread Komяpa
>
>
> > Idea: look not on dead tuples, but on changes, just like ANALYZE does.
> > It's my first patch on Postgres, it's probably all wrong but I hope it
> > helps you get the idea.
>
> This was suggested and rejected years ago:
>
> https://www.postgresql.org/message-id/b970f20f-f096-2d3a-6c6d-ee887bd30...@2ndquadrant.fr


Thank you for sharing the link. I've read through the thread and see you
posted two patches, first being similar but different from mine, and second
being about a different matter.

I don't see "rejected" there, just a common distraction of "you should also
consider this" and time-out leading to "returned with feedback" at the end.

Thing is, we have dead large productions and post-mortems now as your patch
wasn't pushed back in 2016, so situation is different. Let's push at least
first of two patches of yours, or mine.

Which one is better and why?

I believe mine, as it just follows a pattern already established and proven
in autoanalyze. If vacuum comes and unable to harvest some dead tuples, it
will come over again in your case, and just sleep until it gets new dead
tuples in mine, which looks better to me - there's no dead loop in case
some dead tuples are stuck forever.
If someone thinks yours is better we may also consider it for autoanalyze?


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Berserk Autovacuum (let's save next Mandrill)

2019-03-31 Thread Komяpa
>
> If it's months, we probably want limit vacuum to working at a pretty
> slow rate, say 1% of the table size per hour or something.  If it's in
> hours, we need to be a lot more aggressive.  Right now we have no
> information to tell us which of those things is the case, so we'd just
> be shooting in the dark.


Thing is, you don't need to spread out your vacuum in time if the rate of
vacuuming matches rate of table growth. Can we mark tuples/pages as
all-visible and all-frozen say, the moment they're pushed out of
shared_buffers?


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Berserk Autovacuum (let's save next Mandrill)

2019-03-31 Thread Komяpa
>
> By the way, the Routine Vacuuming chapter of the documentation says:
>
> "The sole disadvantage of increasing autovacuum_freeze_max_age (and
> vacuum_freeze_table_age along with it) is that the pg_xact and
> pg_commit_ts subdirectories of the database cluster will take more space
>
> [...]
>
> If [pg_xact and pg_commit_ts taking 0.5 and 20 GB, respectively]
> is trivial compared to your total database size, setting
> autovacuum_freeze_max_age to its maximum allowed value is recommended."
>
> Maybe this should be qualified with "unless you have trouble with your
> autovacuum keeping up" or so; or generally reworded?


This recommendation is in the mindset of "wraparound never happens".
If your database is large, you have more chances to hit it painfully, and
if it's append-only even more so.

Alternative point of "if your database is super large and actively written,
you may want to set autovacuum_freeze_max_age to even smaller values so
that autovacuum load is more evenly spread over time" may be needed.




-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Berserk Autovacuum (let's save next Mandrill)

2019-03-31 Thread Komяpa
On Thu, Mar 28, 2019 at 6:43 PM Masahiko Sawada 
wrote:

> >> 1. heap vacuum
> >>
> >> 2. HOT pruning
> >
> > Is it worth skipping it if we're writing a page anyway for the sake of
> hint bits and new xids? This will all be no-op anyway on append-only tables
> and happen only when we actually need something?
> >
>
> Yeah, these operations are required only when the table has actual
> garbage. IOW, append-only tables never require them.
>
> >>
> >> 3. freezing tuples
> >> 4. updating visibility map (all-visible and all-frozen)
> >
> > These two are needed, and current autovacuum launch process does not
> take into account that this is also needed for non-dead tuples.
> >
> >>
> >> 5. index vacuum/cleanup
> >
> > There is a separate patch for that. But, since
> https://commitfest.postgresql.org/16/952/ for almost a year already
> Postgres skips index cleanup on tables without new dead tuples, so this
> case is taken care of already?
>
> I think that's not enough. The feature "GUC for cleanup index
> threshold" allows us to skip only index cleanup when there are less
> insertion than the fraction of the total number of heap tuples since
> last index cleanup. Therefore it helps only append-only tables (and
> supporting only btree index for now). We still have to do index
> vacuuming even if the table has just a few dead tuple. The proposed
> patch[1] helps this situation; vacuum can run while skipping index
> vacuuming and index cleanup.
>

So, the patch I posted can be technically applied after
https://commitfest.postgresql.org/22/1817/ gets merged?

The change with my patch is that a table with 49 insertions and one delete:
 - previously will wait for 49 more deletes by default (and ignore
insertions), and only then clean up both table and indexes.
 - with patch will freeze/update VM for insertions, and scan the index.

In my experience only btree index is requiring a slow full index scan,
that's why only it was in the "GUC for cleanup index
threshold" patch. Is it wrong and more index types do a full index scan on
vacuum after deletion of a single tuple?



> >> 6. truncation
> >
> > This shouldn't be a heavy operation?
> >
>
> I don't think so. This could take AccessExclusiveLock on the table and
> take a long time with large shared buffer as per reported on that
> thread[2].
>

While this can be a useful optimization, I believe it is out of scope for
this patch. I want to fix vacuum never coming to append only tables without
breaking other behaviors. Truncation is likely a case of enough dead tuples
to trigger a vacuum via currently existing mechanisms.


> >>
> >>
> >> With the proposed patch[1] we can control to do 5 or not. In addition
> >> to that, another proposed patch[2] allows us to control 6.
> >>
> >> For append-only tables (and similar tables), what we periodically want
> >> to do would be 3 and 4 (possibly we can do 2 as well). So maybe we
> >> need to have both an option of (auto)vacuum to control whether to do 1
> >> and something like a new autovacuum threshold (or an option) to invoke
> >> the vacuum that disables 1, 5 and 6. The vacuum that does only 2, 3
> >> and 4 would be much cheaper than today's vacuum and anti-wraparound
> >> vacuum would be able to skip almost pages.
> >
> >
> > Why will we want to get rid of 1? It's a noop from write perspective and
> saves a scan to do it if it's not noop.
> >
>
> Because that's for tables that have many inserts but have some
> updates/deletes. I think that this strategy would help not only
> append-only tables but also such tables.
>

How much do we save by skipping a heap vacuum on almost-append-only table,
where amount of updates is below 50 which is current threshold?


>
> > Why make it faster in emergency situations when situation can be made
> non-emergency from the very beginning instead?
> >
>
> I don't understand the meaning of "situation can be made non-emergency
> from the very beginning". Could you please elaborate on that?
>

Let's imagine a simple append-only workflow on current default settings
Postgres. You create a table, and start inserting tuples, one per
transaction. Let's imagine a page fits 50 tuples (my case for taxi movement
data), and Amazon gp2 storage which caps you say at 1000 IOPS in non-burst
mode.
Anti-wrap-around-auto-vacuum (we need a drawing of misreading of this term
with a crossed out car bent in Space) will be triggered
in autovacuum_freeze_max_age inserts, 2 by default. That converts
into 400 pages, or around 32 GB. It will be the first vacuum ever on
that table, since no other mechanism triggers it, and if it steals all the
available IOPS, it will finish in 2/50 /1000 = 4000 seconds,
killing prod for over an hour.

Telemetry workloads can easily generate 32 GB of data a day (I've seen
more, but let's stick to that number). Production going down for an hour a
day isn't good and I consider it an emergency.

Now, two ways to fix it that reading documentation leads you while you're a
sleepy 

Re: Berserk Autovacuum (let's save next Mandrill)

2019-03-28 Thread Komяpa
Hi,

> > Why not select a table that has inserts, updates and deletes for
autovacuum just like we do for autoanalyze, not only deletes and updates
like we do now?

> >
> > Sounds like a good idea, although I do agree with Alvaro when he
> > mentions that it would be good to only invoke a worker that was only
> > going to freeze tuples and not look at the indexes.
>
> The invoking autovacuum on table based on inserts, not only deletes
> and updates, seems good idea to me. But in this case, I think that we
> can not only freeze tuples but also update visibility map even when
> setting all-visible. Roughly speaking  I think vacuum does the
> following operations.
>
> 1. heap vacuum

2. HOT pruning
>
Is it worth skipping it if we're writing a page anyway for the sake of hint
bits and new xids? This will all be no-op anyway on append-only tables and
happen only when we actually need something?


> 3. freezing tuples
> 4. updating visibility map (all-visible and all-frozen)
>
These two are needed, and current autovacuum launch process does not take
into account that this is also needed for non-dead tuples.


> 5. index vacuum/cleanup
>
There is a separate patch for that. But, since
https://commitfest.postgresql.org/16/952/ for almost a year already
Postgres skips index cleanup on tables without new dead tuples, so this
case is taken care of already?


> 6. truncation
>
This shouldn't be a heavy operation?


>
> With the proposed patch[1] we can control to do 5 or not. In addition
> to that, another proposed patch[2] allows us to control 6.
>
> For append-only tables (and similar tables), what we periodically want
> to do would be 3 and 4 (possibly we can do 2 as well). So maybe we
> need to have both an option of (auto)vacuum to control whether to do 1
> and something like a new autovacuum threshold (or an option) to invoke
> the vacuum that disables 1, 5 and 6. The vacuum that does only 2, 3
> and 4 would be much cheaper than today's vacuum and anti-wraparound
> vacuum would be able to skip almost pages.
>

Why will we want to get rid of 1? It's a noop from write perspective and
saves a scan to do it if it's not noop.

Why make it faster in emergency situations when situation can be made
non-emergency from the very beginning instead?


>
> [1] https://commitfest.postgresql.org/22/1817/
> [2] https://commitfest.postgresql.org/22/1981/
>
> Regards,
>
> --
> Masahiko Sawada
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center
>


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Berserk Autovacuum (let's save next Mandrill)

2019-03-28 Thread Komяpa
On Thu, Mar 28, 2019 at 12:32 PM David Rowley 
wrote:

> On Thu, 28 Mar 2019 at 22:04, Darafei "Komяpa" Praliaskouski
>  wrote:
> >
> > On Thu, Mar 28, 2019 at 2:36 AM David Rowley <
> david.row...@2ndquadrant.com> wrote:
> >> I thought recently that it would be good to have some sort of
> >> pro-active auto-vacuum mode that made use of idle workers.
> >
> > Problem with "idle" is that it never happens on system that are going to
> wraparound on their lifetime. This has to be a part of normal database
> functioning.
>
> I'd say auto-vacuum is configured to run too slowly if you never have
> an idle worker. The chances that it happens to be running at exactly
> the right speed to keep up with demand must be about close to nil.
>
> > Why not select a table that has inserts, updates and deletes for
> autovacuum just like we do for autoanalyze, not only deletes and updates
> like we do now?
>
> Sounds like a good idea, although I do agree with Alvaro when he
> mentions that it would be good to only invoke a worker that was only
> going to freeze tuples and not look at the indexes.


This is current behavior of VACUUM on tables without dead tuples, already.
Issue is that nothing triggers this VACUUM apart from user performing
VACUUM manually, or super late antiwraparound vacuum.

Any patch not in the current CF is already PG13 or beyond. Having at
> least a freeze only vacuum mode main ease some pain, even if it still
> needs to be done manually for anyone finding themselves in a similar
> situation as mailchimp.
>

If you're in wraparound halt with super large table on Amazon gp2 nothing
will help you - issue is, there's no option to "rewrite all of it quickly".
Burst limit lets you feel the shared drive as if it was an SSD on most of
your load, but reading and re-writing all the storage gets throttled, and
there's no option to escape this quickly.

The process that freezes and marks all-visible pages has to run in parallel
and at the speed of your backend pushing pages to disk, maybe lagging
behind a bit - but not up to "we need to rescan all the table".


>
> The idea I was mentioning was more targeted to ease the sudden rush of
> auto-vacuum activity when suddenly a bunch of large tables require an
> anti-wraparound vacuum all at once.
>
> [1] https://commitfest.postgresql.org/22/1817/
>
> --
>  David Rowley   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Berserk Autovacuum (let's save next Mandrill)

2019-03-28 Thread Komяpa
On Thu, Mar 28, 2019 at 2:36 AM David Rowley 
wrote:

> On Thu, 28 Mar 2019 at 11:01, Alvaro Herrera 
> wrote:
> >
> > On 2019-Mar-28, Darafei "Komяpa" Praliaskouski wrote:
> > > "Nearing wraparound" is too late already. In Amazon, reading table
> from gp2
> > > after you exhausted your IOPS burst budget is like reading a floppy
> drive,
> > > you have to freeze a lot earlier than you hit several terabytes of
> unfrozen
> > > data, or you're dead like Mandrill's Search and Url tables from the
> link I
> > > shared.
> >
> > OK, then start freezing tuples in the cheap mode (skip index updates)
> > earlier than that.  I suppose a good question is when to start.
>
> I thought recently that it would be good to have some sort of
> pro-active auto-vacuum mode that made use of idle workers.


Problem with "idle" is that it never happens on system that are going to
wraparound on their lifetime. This has to be a part of normal database
functioning.

Why not select a table that has inserts, updates and deletes for autovacuum
just like we do for autoanalyze, not only deletes and updates like we do
now?


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Berserk Autovacuum (let's save next Mandrill)

2019-03-27 Thread Komяpa
Hi,

чт, 28 мар. 2019 г. в 00:32, Alvaro Herrera :

> On 2019-Mar-27, Darafei "Komяpa" Praliaskouski wrote:
>
> > Attached is sketch of small patch that fixes several edge cases with
> > autovacuum. Long story short autovacuum never comes to append only
> tables,
> > killing large productions.
>
> Yeah, autovac is not coping with these scenarios (and probably others).
> However, rather than taking your patch's idea verbatim, I think we
> should have autovacuum use separate actions for those two (wildly
> different) scenarios.  For example:
>
> * certain tables would have some sort of partial scan that sets the
>   visibility map.  There's no reason to invoke the whole vacuuming
>   machinery.  I don't think this is limited to append-only tables, but
>   rather those are just the ones that are affected the most.
>

What other machinery runs on VACUUM invocation that is not wanted there?
Since Postgres 11 index cleanup is already skipped on append-only tables.


> * tables nearing wraparound danger should use the (yet to be committed)
>   option to skip index cleaning, which makes the cleanup action faster.
>   Again, no need for complete vacuuming.
>

"Nearing wraparound" is too late already. In Amazon, reading table from gp2
after you exhausted your IOPS burst budget is like reading a floppy drive,
you have to freeze a lot earlier than you hit several terabytes of unfrozen
data, or you're dead like Mandrill's Search and Url tables from the link I
shared.


>
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Berserk Autovacuum (let's save next Mandrill)

2019-03-27 Thread Komяpa
Hi hackers,

Attached is sketch of small patch that fixes several edge cases with
autovacuum. Long story short autovacuum never comes to append only tables,
killing large productions.

First case, mine.

https://www.postgresql.org/message-id/CAC8Q8tLBeAxR%2BBXWuKK%2BHP5m8tEVYn270CVrDvKXt%3D0PkJTY9g%40mail.gmail.com

We had a table we were appending and wanted Index Only Scan to work. For it
to work, you need to call VACUUM manually, since VACUUM is the only way to
mark pages all visible, and autovacuum never comes to append only tables.
We were clever to invent a workflow without dead tuples and it painfully
bit us.

Second case, just read in the news.
https://mailchimp.com/what-we-learned-from-the-recent-mandrill-outage/

Mandrill has 6TB append only table that autovacuum probably never vacuumed.
Then anti-wraparound came and production went down. If autovacuum did its
job before that last moment, it would probably be okay.

Idea: look not on dead tuples, but on changes, just like ANALYZE does.
It's my first patch on Postgres, it's probably all wrong but I hope it
helps you get the idea.
-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index fa875db816..e297fc8c4b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3055,7 +3055,7 @@ relation_needs_vacanalyze(Oid relid,
 	if (PointerIsValid(tabentry) && AutoVacuumingActive())
 	{
 		reltuples = classForm->reltuples;
-		vactuples = tabentry->n_dead_tuples;
+		vactuples = tabentry->changes_since_vacuum;
 		anltuples = tabentry->changes_since_analyze;
 
 		vacthresh = (float4) vac_base_thresh + vac_scale_factor * reltuples;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 2a8472b91a..cd4611d64f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4687,6 +4687,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
 		result->n_live_tuples = 0;
 		result->n_dead_tuples = 0;
 		result->changes_since_analyze = 0;
+		result->changes_since_vacuum = 0;
 		result->blocks_fetched = 0;
 		result->blocks_hit = 0;
 		result->vacuum_timestamp = 0;
@@ -5817,6 +5818,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
 			tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
 			tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
+			tabentry->changes_since_vacuum = tabmsg->t_counts.t_changed_tuples;
 			tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
 
@@ -5850,6 +5852,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
 			tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
 			tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
+			tabentry->changes_since_vacuum += tabmsg->t_counts.t_changed_tuples;
 			tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
 		}
@@ -6083,6 +6086,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
 
 	tabentry->n_live_tuples = msg->m_live_tuples;
 	tabentry->n_dead_tuples = msg->m_dead_tuples;
+	tabentry->changes_since_vacuum = 0;
 
 	if (msg->m_autovacuum)
 	{
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c080fa6388..65e38edd8d 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -643,6 +643,7 @@ typedef struct PgStat_StatTabEntry
 	PgStat_Counter n_live_tuples;
 	PgStat_Counter n_dead_tuples;
 	PgStat_Counter changes_since_analyze;
+	PgStat_Counter changes_since_vacuum;
 
 	PgStat_Counter blocks_fetched;
 	PgStat_Counter blocks_hit;


Re: PostgreSQL pollutes the file system

2019-03-27 Thread Komяpa
Hello,

at the very least my Ubuntu Cosmic has createdb, createuser and createlang
in user's space, and I had at least two cases when people were trying to
use createuser to create a new OS user.

I would prefer them having pg_ prefix to have less confusion.

On Wed, Mar 27, 2019 at 4:51 PM Tomas Vondra 
wrote:

> On Wed, Mar 27, 2019 at 02:31:14PM +0100, Fred .Flintstone wrote:
> >Many of these are gone in the modern PostgreSQL, a few remain.
> >https://packages.ubuntu.com/disco/amd64/postgresql-client-11/filelist
> >
> >/usr/lib/postgresql/11/bin/clusterdb
> >/usr/lib/postgresql/11/bin/createdb
> >/usr/lib/postgresql/11/bin/createuser
> >/usr/lib/postgresql/11/bin/dropdb
> >/usr/lib/postgresql/11/bin/dropuser
> >/usr/lib/postgresql/11/bin/pg_basebackup
> >/usr/lib/postgresql/11/bin/pg_dump
> >/usr/lib/postgresql/11/bin/pg_dumpall
> >/usr/lib/postgresql/11/bin/pg_isready
> >/usr/lib/postgresql/11/bin/pg_receivewal
> >/usr/lib/postgresql/11/bin/pg_recvlogical
> >/usr/lib/postgresql/11/bin/pg_restore
> >/usr/lib/postgresql/11/bin/psql
> >/usr/lib/postgresql/11/bin/reindexdb
> >/usr/lib/postgresql/11/bin/vacuumdb
> >
> >Can we rename clusterdb, reindexdb and vacuumdb to carry the pg_ prefix?
> >
>
> I think the consensus in this thread (and the previous ancient ones) is
> that it's not worth it. It's one thing to introduce new commands with the
> pg_ prefix, and it's a completely different thing to rename existing ones.
> That has inherent costs, and as Tom pointed out the burden would fall on
> people using PostgreSQL (and that's rather undesirable).
>
> I personally don't see why having commands without pg_ prefix would be
> an issue. Especially when placed in a separate directory, which eliminates
> the possibility of conflict with other commands.
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
>
>

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: [Patch] Log10 and hyperbolic functions for SQL:2016 compliance

2019-03-18 Thread Komяpa
I really appreciate the addition of tanh into core postgres.

If someone doubts it is useful: it is used as a part of math in
geographical calculations.

Say you have your cars in planar Mercator projection and want to move them
"1 second forward by this heading with this speed". sin/cos and the
distance on X/Y, but the distance must be scaled properly - and that
scaling coefficient is cosd(latitude), which you don't have directly - you
have it in projected meters. If you don't want to fire up full-featured
PostGIS on this hot path you inline all formulas together, result is nice
and small - but has tanh in it, which I was surprised to find only in
Oracle Compatibility extensions. Pure sql tanh was good enough, but gave me
disturbance :)

Here's the code:
https://github.com/gojuno/lostgis/blob/master/sql/functions/coslat.sql#L21

On Wed, Mar 13, 2019 at 5:34 PM Lætitia Avrot 
wrote:

> Thanks,  Tom !
>
> Thank you everyone for your help and patience.
>
> Cheers,
>
> Lætitia
>
> Le mar. 12 mars 2019 à 20:57, Tom Lane  a écrit :
>
>> =?UTF-8?Q?L=C3=A6titia_Avrot?=  writes:
>> > So, as you're asking that too, maybe my reasons weren't good enough.
>> You'll
>> > find enclosed a new version of the patch
>> > with asinh, acosh and atanh (v5).
>>
>> Pushed with some minor adjustments (mainly cleanup of the error handling).
>>
>> > Then I tried for several days to implement the 6 last hyperbolic
>> functions,
>> > but I wasn't satisfied with the result, so I just dropped it.
>>
>> Yeah, I agree that sech() and so on are not worth the trouble.  If they
>> were commonly used, they'd be in POSIX ...
>>
>> regards, tom lane
>>
>

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Should we increase the default vacuum_cost_limit?

2019-02-25 Thread Komяpa
I support rising the default.

>From standpoint of no-clue database admin, it's easier to give more
resources to Postgres and google what process called "autovacuum" does than
to learn why is it being slow on read.

It's also tricky that index only scans depend on working autovacuum, and
autovacuum never comes to those tables. Since PG11 it's safe to call vacuum
on table with indexes, since index is no longer being scanned in its
entirety. I would also propose to include "tuples inserted" into formula
for autovacuum threshold the same way it is done for autoanalyze threshold.
This will fix the situation where you delete 50 rows in 100-gigabyte table
and autovacuum suddenly goes to rewrite and reread hint bits on all of it,
since it never touched it before.

On Mon, Feb 25, 2019 at 8:42 AM David Rowley 
wrote:

> Hi,
>
> I've had to do quite a bit of performance investigation work this year
> and it seems that I only too often discover that the same problem is
> repeating itself... A vacuum_cost_limit that is still set to the 200
> default value along with all 3 auto-vacuum workers being flat chat
> trying and failing to keep up with the demand.
>
> I understand we often keep the default config aimed at low-end
> servers, but I don't believe we should categorise this option the same
> way as we do with shared_buffers and work_mem. What's to say that
> having an auto-vacuum that runs too slowly is better than one that
> runs too quickly?
>
> I have in mind that performance problems arising from having
> auto-vacuum run too quickly might be easier to diagnose and fix than
> the ones that arise from it running too slowly. Certainly, the
> aftermath cleanup involved with it running too slowly is quite a bit
> more tricky to solve.
>
> Ideally, we'd have something smarter than the cost limits we have
> today, something that perhaps is adaptive and can make more use of an
> idle server than we do now, but that sounds like a pretty large
> project to consider having it working this late in the cycle.
>
> In the meantime, should we consider not having vacuum_cost_limit set
> so low by default?
>
> I have in mind something in the ballpark of a 5x to 10x increase. It
> seems the standard settings only allow for a maximum of ~3.9MB/s dirty
> rate and ~7.8MB/s shared buffer miss rate.  That seems pretty slow
> even for the micro SD card that's in my 4-year-old phone.  I think we
> should be aiming for setting this to something good for the slightly
> better than average case of modern hardware.
>
> The current default vacuum_cost_limit of 200 seems to be 15 years old
> and was added in f425b605f4e.
>
> Any supporters for raising the default?
>
> --
>  David Rowley   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>
>

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Allowing extensions to find out the OIDs of their member objects

2019-01-22 Thread Komяpa
>
>
> Thoughts?


I have a feeling this is over-engineering in slightly different direction,
solving the way for hack to work instead of original problem.

What's currently happening in PostGIS is that there are functions that need
to perform index-based lookups.

Postgres is unable to plan this for functions, only for operators.

Operators have only two sides, some PostGIS functions have arguments - you
can't squeeze these into operator.
Well, you can squeeze two of your parameters into one, but it will be ugly
too - you'll invent some "query" argument type and alternative grammar
instead of SQL (see tsquery).

ST_DWithin itself is also another way of working around planner limitation
and squeezing something into both sides of operator, since you don't know
which side of your query is going to have an index. It's not perfect either.

A perfect solution will be a way to perform a clean index scan on
ST_Distance(a.geom, b.geom) < 10, which is what ST_DWithin is trying to
express in limited logic of "you only have two sides of operator".

If you need example from another world: imagine jsonb key-value lookup.
It's currently done via

select ... where tags @> '{"highway":"residential"}';

 - which is hard: you have to remember which side the rose should lean
towards, which {} [] to use, how to quote around json and inside and more.

A more intuitive way for many programmers to write this is similar to this:

select ... where (tags->>'highway') = 'residential';

 - but this does not end up with an index lookup.

I'd be happy if we can deprecate ST_DWithin in PostGIS and just allow
ST_Distance(a.geom, b.geom) <  10.

ST_Distance is defined in standard as function, however, there is
equivalent operator <-> that exists for sole purpose of KNN lookups. So,
when you write:

... order by ST_Distance(geom, 'your_position')
 - you're not getting index scan, and when writing

... order by geom <-> 'your_position'

- you're getting index scan but not doing a thing you may intuitively write
by knowing ST_Distance is standard-defined way to measure distance between
two spatial objects.

May it happen to direct you to some other thoughts?


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: [HACKERS] COPY FREEZE and PD_ALL_VISIBLE

2019-01-15 Thread Komяpa
Hello,

Today I bumped into need to limit first VACUUM time on data import.
I'm using utility called osmium together with COPY FREEZE to import
openstreetmap data into database.

osmium export -c osmium.config -f pg belarus-latest.osm.pbf  -v --progress
| psql -1 -c 'create table byosm(geom geometry, osm_type text, osm_id
bigint, tags jsonb);copy byosm from stdin freeze;'

However, first pass of VACUUM rewrites the whole table. Here is two logs of
VACUUM VERBOSE in a row:

https://gist.github.com/Komzpa/e765c1c5e04623d83a6263d4833cf3a5

In Russian Postgres Telegram group I've been recommended this thread.
Can the patch be revived? What is needed to get it up for 12?

On Sun, Aug 14, 2016 at 10:37 PM Jeff Janes  wrote:

> On Tue, Nov 3, 2015 at 6:37 AM, Simon Riggs  wrote:
> > On 3 November 2015 at 15:23, Amit Kapila 
> wrote:
> >>
> >> On Fri, Oct 23, 2015 at 6:29 AM, Simon Riggs 
> >> wrote:
> >>>
> >>> On 21 October 2015 at 13:31, Jeff Janes  wrote:
> >>>
>  Index-only scans will visit the heap for each tuple until the first
>  VACUUM is done.
> 
>  The first vacuum will read the entire table, but not need to write it
>  anymore.  And will create the _vm file.
> 
>  I think we really want to create _vm file as well as set
> PD_ALL_VISIBLE,
>  but I don't know the best way to do that.  Set a flag somewhere and
> then
>  create it in bulk at the end of the transaction?  Set it bit by bit
> as the
>  pages are extended and initialized?
> >>>
> >>>
> >>> Easy enough to do it at the end of the COPY FREEZE in one step.
> >>
> >>
> >> Here, we might want to consider that setting bit in visibility map
> >> will generate WAL log whereas Copy Freeze otherwise skip WAL
> >> when wal_level is less than archive.  This can lead to extra disk
> >> writes which can slow down Copy Freeze, but OTOH that might
> >> be acceptable.
> >
> >
> > I'm building the map as I go, in the latest version of this patch I'm
> > working on.
>
> Hi Simon,
>
> Is this still on your radar?  If you would like someone else to pick
> it up, can you post the WIP patch you have?
>
> Thanks,
>
> Jeff
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hack...@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Joins on TID

2018-12-23 Thread Komяpa
Hi,

Writing as someone who used TID joins and group by's in the past.

One use case is having a chance to peek into what will DELETE do.
A lot of GIS tables don't have any notion of ID, and dirty datasets tend to
have many duplicates you need to cross-reference with something else. So,
you write your query in form of

CREATE TABLE ttt as (SELECT distinct on (ctid) ctid as ct, field1, field2,
b.field3, ... from table b join othertable b on ST_Whatever(a.geom,
b.geom));



DELETE FROM table a USING ttt b where a.ctid = b.ct;
DROP TABLE ttt;

Here:
 - distinct on ctid is used (hash?)
 - a.ctid = b.ct (hash join candidate?)

I know it's all better with proper IDs, but sometimes it works like that,
usually just once per dataset.


сб, 22 дек. 2018 г. в 19:31, Tom Lane :

> Simon Riggs  writes:
> > On Sat, 22 Dec 2018 at 04:31, Tom Lane  wrote:
> >> BTW, if we're to start taking joins on TID seriously, we should also
> >> add the missing hash opclass for TID, so that you can do hash joins
> >> when dealing with a lot of rows.
>
> > I don't think we are trying to do TID joins more seriously, just fix a
> > special case.
> > The case cited requires the batches of work to be small, so nested loops
> > works fine.
> > Looks to me that Edmund is trying to solve the same problem. If so, this
> is
> > the best solution.
>
> No, I think what Edmund is on about is unrelated, except that it touches
> some of the same code.  He's interested in problems like "find the last
> few tuples in this table".  You can solve that today, with e.g.
> "SELECT ... WHERE ctid >= '(n,1)'", but you get a stupidly inefficient
> plan.  If we think that's a use-case worth supporting then it'd be
> reasonable to provide less inefficient implementation(s).
>
> What I'm thinking about in this thread is joins on TID, which we have only
> very weak support for today --- you'll basically always wind up with a
> mergejoin, which requires full-table scan and sort of its inputs.  Still,
> that's better than a naive nestloop, and for years we've been figuring
> that that was good enough.  Several people in the other thread that
> I cited felt that that isn't good enough.  But if we think it's worth
> taking seriously, then IMO we need to add both parameterized scans (for
> nestloop-with-inner-fetch-by-tid) and hash join, because each of those
> can dominate depending on how many tuples you're joining.
>
> regards, tom lane
>
> --
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: What to name the current heap after pluggable storage / what to rename?

2018-12-19 Thread Komяpa
Call it "pile" and "hoard":

https://www.thesaurus.com/browse/heap
https://www.thesaurus.com/browse/pile
https://www.thesaurus.com/browse/hoard

ср, 19 дек. 2018 г. в 07:17, Andres Freund :

> Hi,
>
> The current pluggable table storage patchset [1] introduces the ability
> to specify the access method of a table (CREATE TABLE ... USING
> "ident"). The patchset currently names the current storage method
> (i.e. heapam.c et al) "heap" (whereas e.g. zheap's in named, drumroll,
> zheap).
>
> I'm concerned that naming it heap, and the corresponding functions
> fitting with that name, will be confusing. There's a lot of functions
> that have a heap_ prefix (or similar) that aren't really dependent on
> how the storage works, but much more general infrastructure (consider
> e.g. heap_create_with_catalog()).
>
> One solution is to just live with the confusion, add a few header
> comments to files like src/backend/catalog/heap.c, explaining that
> they're more general than heapam.c (and in the patch heapam_handler.c,
> which mostly dispatches to heapam.c).
>
> Another approach would be to come up with a different, potentially
> witty, name for the current postgres table access method. Then either
> rename a few of the heapam.c et functions, or live with the slightly
> more confusing names.
>
> Another would be to be aggressive in renaming, and deconflict by
> renaming functions like heap_create[_with_catalog] etc to sound more
> accurate. I think that has some appeal, because a lot of those names
> aren't describing their tasks particularly well.
>
> Greetings,
>
> Andres Freund
>
> [1]
> https://postgr.es/m/20180703070645.wchpu5muyto5n647%40alap3.anarazel.de
>
> --
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: zheap: a new storage format for PostgreSQL

2018-11-19 Thread Komяpa
>
> > In PostGIS workloads, UPDATE table SET geom = ST_CostyFunction(geom,
> magicnumber); is one of biggest time-eaters that happen upon initial load
> and clean up of your data. It is commonly followed by CLUSTER table using
> table_geom_idx; to make sure you're back at full speed and no VACUUM is
> needed, and your table (usually static after that) is more-or-less
> spatially ordered. I see that zheap can remove the need for VACUUM, which
> is a big win already. If you can do something that will allow reorder of
> tuples according to index happen during an UPDATE that rewrites most of
> table, that would be a game changer :)
> >
>
> If the tuples are already in the order of the index, then we would
> retain the order, otherwise, we might not want to anything special for
> ordering w.r.t index.  I think this is important as we are not sure of
> the user's intention and I guess it won't be easy to do such
> rearrangement during Update statement.
>

User's clustering intention is recorded in existence of CLUSTER index over
table. That's not used by anything other than CLUSTER command now though.

When I was looking into current heap implementation it seemed that it's
possible to hook in a lookup for a couple blocks with values adjacent to
the new value, and prefer them to FSM lookup and "current page", for
clustered table. Due to dead tuples, free space is going to end very very
soon in usual heap, so it probably doesn't make sense there - you're
consuming space with old one in old page and new one in new page.

If I understand correctly, in zheap an update would not result in a dead
tuple in old page, so space is not going to end immediately, and this may
unblock path for such further developments. That is, if there is a spot
where to plug in such or similar logic in code :)

I've described the business case in [1].

1:
https://www.postgresql.org/message-id/flat/CAC8Q8tLBeAxR%2BBXWuKK%2BHP5m8tEVYn270CVrDvKXt%3D0PkJTY9g%40mail.gmail.com

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: zheap: a new storage format for PostgreSQL

2018-11-18 Thread Komяpa
On Sat, Nov 17, 2018 at 8:51 AM Adam Brusselback 
wrote:

> >  I don't know how much what I write on this thread is read by others or
> how useful this is for others who are following this work
>
> I've been following this thread and many others like it, silently soaking
> it up, because I don't feel like i'd have anything useful to add in most
> cases. It is very interesting seeing the development take place though, so
> just know it's appreciated at least from my perspective.
>


I'm also following the development and have hopes about it going forward.
Not much low-level details I can comment on though :)

In PostGIS workloads, UPDATE table SET geom = ST_CostyFunction(geom,
magicnumber); is one of biggest time-eaters that happen upon initial load
and clean up of your data. It is commonly followed by CLUSTER table using
table_geom_idx; to make sure you're back at full speed and no VACUUM is
needed, and your table (usually static after that) is more-or-less
spatially ordered. I see that zheap can remove the need for VACUUM, which
is a big win already. If you can do something that will allow reorder of
tuples according to index happen during an UPDATE that rewrites most of
table, that would be a game changer :)

Another story is Visibility Map and Index-Only Scans. Right now there is a
huge gap between the insert of rows and the moment they are available for
index only scan, as VACUUM is required. Do I understand correctly that for
zheap this all can be inverted, and UNDO can become "invisibility map" that
may be quite small and discarded quickly?




-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Changing SQL Inlining Behaviour (or...?)

2018-11-14 Thread Komяpa
>
>
>> At pgconf.eu, I canvassed this problem and some potential solutions:
>>
>

I wonder if there is a middle ground between #2 and #3. A proper mechanism
for deduplicating entries might be hard, but on the inlining stage we
already know they're going to get duplicated. Can we make a subplan/lateral
join/whatever for arguments and deduplicate just them, as we know about
that duplication from structure already?

Another thing I see here is PostGIS using indexes. Can we just always
inline if we see an index-accelerated operator (or just an operator) on top
level AND inside inlined function?



> * Solution #1 - Quick and dirty and visible: Add an 'INLINE' function
>> decorator, which tells PostgreSQL to just ignore costs and inline the
>> function regardless. Pros: it's not too hard to implement and I'm happy to
>> contribute this. Cons: it adds very specific single-purpose syntax to
>> CREATE FUNCTION.
>>
>> * Solution #2 - Quick and dirty and invisible. Tom suggested a hack that
>> achieves the aims of #1 but without adding syntax to CREATE FUNCTION: have
>> the inlining logic look at the cost of the wrapper and the cost of
>> parameters, and if the cost of the wrapper "greatly exceeded" the cost of
>> the parameters, then inline. So the PostGIS project would just set the cost
>> of our wrappers very high, and we'd get the behaviour we want, while other
>> users who want to use wrappers to force caching of calculations would have
>> zero coded wrapper functions.  Pros: Solves the problem and easy to
>> implement, I'm happy to contribute. Cons: it's so clearly a hack involving
>> hidden (from users) magic.
>>
>> * Solution #3 - Correct and globally helpful. When first presented with
>> this problem last year, both Andres and Tom said [1] "but the right fix is
>> to avoid the double-calculation of identical entries in the target list"
>> because then it would be safe to inline functions with duplicate expensive
>> parameters. This would not only address the proximate PostGIS problem but
>> make a whole class of queries faster. There was some discussion of this
>> approach last week [2]. Pros: The right thing! Improves a whole pile of
>> other performance cases. Cons: Hard! Only experienced PgSQL developers need
>> apply.
>>
>> Naturally, I would love to see #3 implemented, but there's only so much
>> experienced developer time to go around, and it's beyond my current skill
>> set. I would like to be able to start to improve PostGIS parallelism with
>> PgSQL 12, so in order to make that not impossible, I'd like to implement
>> either #1 or #2 in case #3 doesn't happen for PgSQL 12.
>>
>> So my question to hackers is: which is less worse, #1 or #2, to implement
>> and submit to commitfest, in case #3 does not materialize in time for PgSQL
>> 12?
>>
>
> Absent any preferences, I would be inclined to go with #2, having a high
> personal tolerance for ugly hacks... :)
>
> P
>
>
>
>>
>> [1]
>> https://www.postgresql.org/message-id/20171116182208.kcvf75nfaldv36uh%40alap3.anarazel.de
>> [2]
>> https://www.postgresql.org/message-id/10355.1540926295%40sss.pgh.pa.us
>>
>> --
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Parallel threads in query

2018-11-01 Thread Komяpa
>
>
> Because you said "faster than reasonable IPC" - which to me implies that
> you don't do full blown IPC. Which using threads in a bgworker is very
> strongly implying. What you're proposing strongly implies multiple
> context switches just to process a few results. Even before, but
> especially after, spectre that's an expensive proposition.
>
>
To have some idea of what it could be:

a)
PostGIS has ST_ClusterKMeans window function. It collects all geometries
passed to it to memory, re-packs to more compact buffer and starts a loop
that goes over it several (let's say 10..100) times. Then it spits out all
the assigned cluster numbers for each of the input rows.

It's all great when you need to calculate KMeans of 200-5 rows, but for
a million input rows even a hundred passes on a single core are painful.

b)
PostGIS has ST_Subdivide function. It takes a single row of geometry
(usually super-large, like a continent or the wholeness of Russia) and
splits it into many rows that have more simple shape, by performing a
horizontal or vertical split recursively. Since it's a tree traversal, it
can be paralleled efficiently, with one task becoming to follow the right
subpart of geometry and other - to follow left part of it.

Both seem to be a standard thing for OpenMP, which has compiler support in
GCC and clang and MSVC. For an overview how it work, have a look here:
https://web.archive.org/web/20180828151435/https://bisqwit.iki.fi/story/howto/openmp/


So, do I understand correctly that I need to start a parallel worker that
does nothing for each thread I launch to consume the parallel worker limit?
-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Parallel threads in query

2018-11-01 Thread Komяpa
>
> In theory, simulating such global limit should be possible using a bit
> of shared memory for the current total, per-process counter and probably
> some simple abort handling (say, just like contrib/openssl does using
> ResourceOwner).
>

I would expect that this limit is already available and it's parallel
worker limit. Basically, when start a new thread I would like to somehow
consume a part of parallel worker limit - a thread is a kind of parallel
worker, from user's perspective. If I have 4 cores and Postgres already
started 4 parallel workers, I don't really want to start 4 threads for each
of them, or 4 for one of them and 1 for each of the rest, if I manage that
separately from parallel worker limit.

IPC and co - that's another question and out of scope for this one. Since
OpenMP allows to write multithreaded code by just adding more #pragma
around loops, I don't want to reinvent that part of infrastructure.
-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Parallel threads in query

2018-10-31 Thread Komяpa
Hi,

I've tried porting some of PostGIS algorithms to utilize multiple cores via
OpenMP to return faster.

Question is, what's the best policy to allocate cores so we can play nice
with rest of postgres?

What I'd like to see is some function that I can call and get a number of
threads I'm allowed to run, that will also advise rest of postgres to not
use them, and a function to return the cores back (or do it automatically
at the end of query). Is there an infrastructure for that?
-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: INSTALL file

2018-10-29 Thread Komяpa
>
> > That is not the first file people looking at. Especially not people
> looking
> > at the GitHub copy:
> >
> > https://github.com/postgres/postgres
> >
> > I understand that there is documentation, but for the casual developer
> > looking at this, it seems broken.
>
> FWIW, I think that people depend too much on github and what github
> thinks projects should do to be more presentable, like adding a
> markdown-style README or such.
>

If you push a README.md file formatted in Markdown, github/gitlab will show
it instead of README.

That's the way it is done in PostGIS, showing badges and stuff for github
only:
https://github.com/postgis/postgis
-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: JIT breaks PostGIS

2018-07-23 Thread Komяpa
Hello,

пн, 23 июл. 2018 г. в 8:13, Andres Freund :

> Hi,
>
> On 2018-07-21 23:14:47 +0300, Darafei "Komяpa" Praliaskouski wrote:
> >
> > I suspect that a fix would require to bisect llvm/clang version which
> stops
> > showing this behavior and making it a new minimum for JIT, if this is
> not a
> > symptom of bigger (memory management?) problem.
>
> It looks to me like it's a LLVM issue, specifically
> https://bugs.llvm.org/show_bug.cgi?id=34424
> fixed in LLVM 5+.
>

Thank you for your investigation.


> It'll only be an issue for extensions that throw c++ style exceptions. I
> don't think that rises to the level of disallowing any LLVM version <
> 5.0. I suggest postgis adds an error check to its buildprocess that
> refuses to run if jit is enabled and a too old version is used?
>

Unfortunately that's not an option.

Debian Stretch amd64 is quite popular platform, and is supported by
Postgres 10 / PostGIS 2.4.

If Christoph decides to keep LLVM enabled for 11 on that platform, as he is
recommended upthread by Tom, that would mean that PostGIS can't be packaged
at all, and all the people who need it will have to stay on Postgres 10.

If PostGIS decides not to implement the check, and instead tweaks test
runner to execute `set jit to off;` before tickets.sql, then Postgres 11 on
that platform will have a known way to segfault it, even without superuser
rights, as jit GUC is tweakable by anyone.

I think that a good way to deal with it will be to bump minimum required
version of LLVM to 5 on Postgres build, with a notice in docs that will say
that "you can build with lower version, but that will give you this and
this bug".

It also may happen that a patch for LLVM can be applied to LLVM4 build in
Debian and brought in as an update, but such a plan should not be a default
one.
-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: JIT breaks PostGIS

2018-07-21 Thread Komяpa
Hi,

Here's somewhat minimized example.

https://gist.github.com/Komzpa/cc3762175328ff5d11de4b972352003d

You can put this file into regress/jitbug.sql in PostGIS code tree and run
after building postgis:

perl regress/run_test.pl regress/jitbug.sql --expect
perl regress/run_test.pl regress/jitbug.sql


сб, 21 июл. 2018 г. в 23:39, Andres Freund :

> Hi,
>
> On 2018-07-21 22:36:44 +0200, Christoph Berg wrote:
> > Re: Andres Freund 2018-07-21 <
> 20180721202543.ri5jyfclj6kb6...@alap3.anarazel.de>
> > > Could you attempt to come up with a smaller reproducer?
> >
> > The original instructions in
> > https://trac.osgeo.org/postgis/ticket/4125 are fairly easy to follow;
> > there's also a postgresql-11-postgis-2.5{,-scripts} package (not
> > mentioned in there) that exhibits the bug.
>
> Sure, but a more minimal example (than a 1kloc regression script, wiht
> possible inter statement dependencies) still makes the debugging
> easier...
>
> Greetings,
>
> Andres Freund
>
-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


JIT breaks PostGIS

2018-07-21 Thread Komяpa
Hi,

Today I spent some time closing PostGIS tickets in preparation to Monday's
release.

One of the blockers, https://trac.osgeo.org/postgis/ticket/4125, was filed
by Postgres APT repository maintainer Christoph Berg who noticed a test
suite failure on Debian Stretch with Postgres 11.

Upon investigation we found:
 - A build for Ubuntu Bionic installed on Debian Stretch passes the suite,
requiring llvm6;
 - A build for Debian Stretch fails the suite on a call to external library
GEOS, showing no traces of JIT in the stacktrace;
 - Setting jit=off lets the suite pass;
 - The query called in clean session by itself does not crash Postgres.
Queries above it are required to reproduce the crash;
 - The crash affects not only Stretch, but customly collected Postgres 11 /
clang 3.9 on Travis CI running Ubuntu Trusty:
https://github.com/postgis/postgis/pull/262.

I suspect that a fix would require to bisect llvm/clang version which stops
showing this behavior and making it a new minimum for JIT, if this is not a
symptom of bigger (memory management?) problem.

Thank you!
-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: [HACKERS] GUC for cleanup indexes threshold.

2018-06-26 Thread Komяpa
вт, 26 июн. 2018 г. в 15:42, Alexander Korotkov :

> On Tue, Jun 26, 2018 at 1:46 PM Masahiko Sawada 
> wrote:
> > On Fri, Jun 22, 2018 at 6:55 PM, Alexander Korotkov
> >  wrote:
> > > So, I propose to just
> > > increase maximum value for both GUC and reloption.  See the attached
> > > patch.  It also changes calculations _bt_vacuum_needs_cleanup() for
> > > better handling of large values (just some kind of overflow paranoia).
> >
> > The patch looks good to me.
>
> Pushed, thanks!
>

Thank you for the enhancement. Now Index Only Scans over Append-Only tables
in Postgres  can be implemented, even if it requires manual kicking of
VACUUM over large table, and that's a great enhancement for moving object
databases. :)

My eye catches another thing, the error message in tests is:

DETAIL:  Valid values are between "0.00" and
"179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.00".

a) do we really need to print digits of dblmax? "Valid values are double
precision, non-negative"?
b) double precision binary-to-decimal noise starts at 16th digit. Why does
it stop at the point, and we have precise ".00"? Does it bite the
conversion somewhere else too?


Re: [HACKERS] GUC for cleanup indexes threshold.

2018-06-16 Thread Komяpa
Hi!

It is cool to see this in Postgres 11. However:


> 4) vacuum_cleanup_index_scale_factor can be set either by GUC or
> reloption.
> Default value is 0.1.  So, by default cleanup scan is triggered after
> increasing of
> table size by 10%.
>

vacuum_cleanup_index_scale_factor can be set to the maximum of 100.
I imagine that on a large append-only table with IOPS storage system budget
it may happen that I would want to never perform a full scan on index.
Roughly, with parameter set to 100, if we vacuum the table first time with
1 tuple and 130 byte wide rows, we'll have a full scan at 130 bytes, 12
kbytes, 1.2MB, 123MB, 12 GB, 1.2TB.

If we happen to perform the first vacuum when there are 4 tuples in the
table, it becomes 52kb, 5MB, 495MB, 48GB - and both 12GB and 48GB will
exhaust any storage spike IOPS budget, slowing everything down rather
suddenly.

Can the upper limit for this GUC be lifted, or have a value for "never"?


Re: late binding of shared libs for C functions

2018-06-12 Thread Komяpa
>
>  >> The real question is why check_function_bodies doesn't cover this;
>  >> there's a comment in fmgr_c_validator that this is deliberate, but it's
>  >> rather unclear what the advantage is supposed to be:
>
>  Tom> Error detection, ie did you spell the C symbol name correctly.
>
> Right, but surely restoring a dump is not the place to be doing that
> error check?
>


Similar check also happens on pg_upgrade in link mode. It would be more
useful to get it to attempt to upgrade the extension if there is absent
function or absent library error.


Re: late binding of shared libs for C functions

2018-06-12 Thread Komяpa
This thing also bites PostGIS upgrades.

When distro's packaging system decides to upgrade PostGIS, or both
Postgres/PostGIS at the same time, you may often get to a situation when
you only have one version of PostGIS .so installed, and it's not the one
referenced in catalog. There are workarounds that tell you to symlink a
newer PostGIS .so file to the spot an older one is being looked for, and
then do ALTER EXTENSION UPGRADE to get out of this inconsistent state.

This also means PostGIS has to ship stubs for all the functions that should
have been deleted, but may be needed during such hacky upgrade process.
For example,
https://github.com/postgis/postgis/blob/16270b9352e84bc989b9b946d279f16e0de5c2b9/postgis/lwgeom_accum.c#L55


вт, 12 июн. 2018 г. в 13:48, Geoff Winkless :

> Hi All
>
> Is it possible to use CREATE FUNCTION to link a shared library that
> doesn't yet exist? I don't think it is, but I might be missing
> something.
>
> If not, would it be something that people would be open to a patch
> for? I'm thinking of e.g.
>
> CREATE [ OR REPLACE ] FUNCTION
> name ( [ [ argmode ] [ argname ] argtype [ { DEFAULT | = }
> default_expr ] [, ...] ] )
> [ RETURNS rettype
>   | RETURNS TABLE ( column_name column_type [, ...] ) ]
>   { LANGUAGE lang_name
> | TRANSFORM { FOR TYPE type_name } [, ... ]
> | WINDOW
> | IMMUTABLE | STABLE | VOLATILE | [ NOT ] LEAKPROOF
> | CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT | STRICT
> | [ EXTERNAL ] SECURITY INVOKER | [ EXTERNAL ] SECURITY DEFINER
> | PARALLEL { UNSAFE | RESTRICTED | SAFE }
> | COST execution_cost
> | ROWS result_rows
> | SET configuration_parameter { TO value | = value | FROM CURRENT }
> | AS 'definition'
> -| AS 'obj_file', 'link_symbol'
> +| AS 'obj_file', 'link_symbol' [UNBOUNDED]
>  } ...
> [ WITH ( attribute [, ...] ) ]
>
>
> (I know UNBOUNDED isn't quite the word - BINDLATE would be better -
> but I figured I should try to use an existing reserved keyword...)
>
> We run our SQL scripts before we install binaries (because our
> binaries are started by the installer, so having the database in place
> is a Good Thing). The binary installer includes the .so. We're now
> stuck in a catch-22 where I can't run the SQL script because it
> requires the .so to be in place, but I can't run the binary installer
> because if I do the SQL won't be updated...
>
> This specific problem is obviously workaround-able, but it occurred to
> me that since the libraries are bound late anyway it seems like this
> wouldn't cause any serious problems.
>
> Of course chances are I've missed something...
>
> Geoff
>
>


Re: psql leaks memory on query cancellation

2018-04-23 Thread Komяpa
>
> Therefore, I propose the attached patch, which simply sees to it that
> we discard any partial query result at the start of error message
> collection not the end.  This makes the behavior very much better,
> at least on Linux.
>

I have tested the build of Postgres with attached patch and confirm that I
don't see this problematic behavior anymore even with default allocator.
Thank you!

I think this is a back-patchable bug fix; certainly so at least back
> to 9.6 where \errverbose was added.  Versions before that do not show
> the persistent memory bloat the OP is complaining of, so that what
> we have here is arguably a performance regression.  Comments?
>

This should not bring regression, since the memory is freed anyway, but is
valuable as puts less pressure on client memory requirements for manual
data inspection workflows.

Darafei Praliaskouski,
GIS Engineer / Juno Minsk


Re: Is a modern build system acceptable for older platforms

2018-04-19 Thread Komяpa
>
> The above is all about getting the build system to work at all. If that
> isn't a showstopper there's a subsequent discussion to be had about older
> platforms where one could get the build system to work but convenient
> packages are missing. For example not even RHEL7 has any Python3 packages
> in the base system (it does in Software Collections though) which means
> some extra hoops on getting meson running there. And RHEL5 is in an even
> worse spot as it has no Software Collections, who knows if Python 3 builds
> on it out of the box etc.
>

I would expect that a new version of software should not target versions of
platform that are end of full support. Per
https://access.redhat.com/support/policy/updates/errata currently only
RHEL7 is at Full Support, and RHEL5 is already past Product Retirement. I
would say it's fine to support these at already released branches, but
limiting .

PostGIS has several forks that move it towards CMake (five-year-old ticket
https://trac.osgeo.org/postgis/ticket/2362, forks
https://github.com/nextgis-borsch/postgis,
https://github.com/mloskot/postgis/tree/cmake-build) - none of these are
upstream mostly because there's an expectation to match the Postgres build
system. If Postgres moved to CMake (there are already CMake-enabled forks
available for people who) then I expect PostGIS to quickly catch up.

A lot of libraries PostGIS depends on are already built using CMake, so if
the platform has recent PostGIS it has CMake available somehow.

Darafei Praliaskouski,
GIS Engineer / Juno Minsk


Re: Covering GiST indexes

2018-04-12 Thread Komяpa
> Another thing that could be done for PostGIS geometries is just another
> opclass which
> stores geometries "as is" in leafs.  As I know, geometries contain MBRs
> inside their
> own, so there is no need to store extra MBR.  I think the reason why
> PostGIS
> doesn't have such opclass yet is that geometries are frequently large and
> easily
> can exceed maximum size of index tuple.
>

Geometry datatype layout was designed with TOASTing in mind: most of data
is stored in the header, including type, SRID, box and some other flags, so
getting just several first bytes tells you a lot.

PostGIS datasets are often of a mixed binary length: in buildings, for
example, it is quite common to have a lot of four corner houses, and just
one mapped as a circle, that digitizing software decided to make via
720-point polygon. Since reading it from TOAST for an index would require a
seek of some kind, it may be as efficient to just look it up in the table.

This way a lossy decompress function can help with index only scans: up to
some binary length, try to store the original geometry in the index. After
that, store a shape that has less points in it but covers slightly larger
area, plus a flag that it's not precise. There are a lot of ways to
generate a covering shape with less points: obvious is a box, less obvious
is non axis aligned box, a collection of boxes for a multipart shape, an
outer ring for an area with lots of holes, a convex hull or a smallest
enclosing k-gon.

In GIS there is a problem of border of Russia: the country overlaps over
180 meridian and has a complex border shape. if you take a box of it, it
spans from -180 to 180. If you query any spot in US or in Europe, you'll
have it intersecting with your area, require a full recheck, complete
detoast and decompression, and then "no, it's not a thing we need, skip".
Allowing anything better than a box would help. If we're allowing a complex
shape - we've got the container for it, geometry.

If we're storing geometry in index and original's small, why not allow
complete Index Only Scan on it, and let it skip recheck? :)

Darafei Praliaskouski,
GIS Engineer / Juno Minsk


Re: psql leaks memory on query cancellation

2018-04-12 Thread Komяpa
>
>
> > Is it expected behavior (so I can have a look at something server
> returned
> > somehow and it's kept there for me), or a plain leak?
>
> This is totally normal behaviour for any C program.
>

Thanks Konstantin and Craig for the help.

To mitigate the issue I've changed the allocator on my laptop to jemalloc.
For single psql run on my Ubuntu system:

sudo apt install libjemalloc1
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 psql

A global replacement by putting /usr/lib/x86_64-linux-gnu/libjemalloc.so.1
to /etc/ld.so.preload also had a positive effect fixing this behavior in
all the applications, reducing after-boot memory footprint from 7 GB to 3.

Darafei Praliaskouski,
GIS Engineer / Juno Minsk


psql leaks memory on query cancellation

2018-04-12 Thread Komяpa
Hi,

psql (PostgreSQL) 10.3

Here are the steps to reproduce a leak:

1. connect to 10.3 server, perform the query similar to:

select 'message' || generate_series(1,10);

2. monitoring psql memory usage in htop or similar tool, press ctrl+c at
some point where you can clearly distinguish a psql with a big allocated
buffer from psql without it.

3. see the query cancelled, but psql memory usage stays the same.

This is especially painful when query you're debugging has a runaway join
condition, and you understand it only after it doesn't return in seconds as
you've expected.

Is it expected behavior (so I can have a look at something server returned
somehow and it's kept there for me), or a plain leak?

Darafei Praliaskouski,
GIS Engineer / Juno Minsk


Re: [HACKERS] [PATCH] Incremental sort

2018-03-21 Thread Komяpa
Hi,

on a PostGIS system tuned for preferring parallelism heavily (
min_parallel_table_scan_size=10kB) we experience issues with QGIS table
discovery query with this patch:

Failing query is:
[local] gis@gis=# SELECT
l.f_table_name,l.f_table_schema,l.f_geometry_column,upper(l.type),l.srid,l.coord_dimension,c.relkind,obj_description(c.oid)
FROM geometry_columns l,pg_class c,pg_namespace n WHERE
c.relname=l.f_table_name AND l.f_table_schema=n.
nspname AND n.oid=c.relnamespace AND
has_schema_privilege(n.nspname,'usage') AND
has_table_privilege('"'||n.nspname||'"."'||c.relname||'"','select') AND
l.f_table_schema='public' ORDER BY n.nspname,c.relname,l.f_geometry_column;

ERROR:  XX000: badly formatted node string "INCREMENTALSORT :startup_cost
37"...
CONTEXT:  parallel worker
LOCATION:  parseNodeString, readfuncs.c:2693
Time: 42,052 ms


Query plan:


QUERY PLAN


──
Sort  (cost=38717.21..38717.22 rows=1 width=393)
  Sort Key: c_1.relname, a.attname
  ->  Nested Loop  (cost=36059.35..38717.20 rows=1 width=393)
->  Index Scan using pg_namespace_nspname_index on pg_namespace n
(cost=0.28..2.30 rows=1 width=68)
  Index Cond: (nspname = 'public'::name)
  Filter: has_schema_privilege((nspname)::text, 'usage'::text)
->  Nested Loop  (cost=36059.08..38714.59 rows=1 width=407)
  ->  Nested Loop Left Join  (cost=36058.65..38712.12 rows=1
width=334)
Join Filter: ((s_2.connamespace = n_1.oid) AND
(a.attnum = ANY (s_2.conkey)))
->  Nested Loop Left Join  (cost=36058.51..38711.94
rows=1 width=298)
  Join Filter: ((s_1.connamespace = n_1.oid) AND
(a.attnum = ANY (s_1.conkey)))
  ->  Nested Loop  (cost=36058.38..38711.75 rows=1
width=252)
Join Filter: (a.atttypid = t.oid)
->  Gather Merge  (cost=36057.95..38702.65
rows=444 width=256)
  Workers Planned: 10
  ->  Merge Left Join
(cost=35057.76..37689.01 rows=44 width=256)
Merge Cond: ((n_1.oid =
s.connamespace) AND (c_1.oid = s.conrelid))
Join Filter: (a.attnum = ANY
(s.conkey))
->  Incremental Sort
(cost=37687.19..37687.30 rows=44 width=210)
  Sort Key: n_1.oid, c_1.oid
  Presorted Key: n_1.oid
  ->  Nested Loop
(cost=34837.25..37685.99 rows=44 width=210)
->  Merge Join
(cost=34836.82..34865.99 rows=9 width=136)
  Merge Cond:
(c_1.relnamespace = n_1.oid)
  ->  Sort
(cost=34834.52..34849.05 rows=5814 width=72)
Sort
Key: c_1.relnamespace
->
Parallel Seq Scan on pg_class c_1  (cost=0.00..34470.99 rows=5814 width=72)

Filter: ((relname <> 'raster_columns'::name) AND (NOT
pg_is_other_temp_schema(relnamespace)) AND has_table_privilege(oid,
'SELECT'::text) AND (relkind = ANY ('{r,v,m,f,p}'::"char"[])))
  ->  Sort
(cost=2.30..2.31 rows=1 width=68)
Sort
Key: n_1.oid
->
Index Scan using pg_namespace_nspname_index on pg_namespace n_1
(cost=0.28..2.29 rows=1 width=68)

Index Cond: (nspname = 'public'::name)
->  Index Scan
using pg_attribute_relid_attnum_index on pg_attribute a  (cost=0.43..200.52
rows=11281 width=78)
  Index Cond:
(attrelid = c_1.oid)
  Filter: (NOT
attisdropped)
->  Sort  (cost=1.35..1.35
rows=1 width=77)
  Sort Key: s.connamespace,
s.conrelid
  ->  Seq Scan on
pg_constraint s  (cost=0.00..1.34 rows=1 width=77)
Filter: (consrc ~~*
'%geometrytype(% = %'::text)
->  Materialize  (cost=0.42..2.45 rows=1
width=4)

Re: Cast jsonb to numeric, int, float, bool

2018-03-12 Thread Komяpa
Hi Tom,


> I hadn't been following this thread particularly, but I happened to notice
> this bit, and I thought I'd better pop up to say No Way.  There will be
> *no* implicit casts from json to any numeric type.  We have learned the
> hard way that implicit cross-category casts are dangerous.
>

I can see how a cast from text can be problematic, especially before the
'unknown' type.

But what would be the scenario of failure if we have an implicit cast from
jsonb datatype (that likely already parsed the number internally, or knows
it holds non-numeric value) to numeric, that returns an error if it's
unable to perform the conversion?

The issue I think is that jsonb is special because it is in fact container.
We've got dot-notation to unpack things from composite TYPE, and we've got
arrow-notation to unpack things from jsonb, but such unpacking cannot take
part in further computations, even though it's visually compatible.

What would be other options, if not implicit cast?

Darafei Praliaskouski,
GIS Engineer / Juno Minsk


Re: All Taxi Services need Index Clustered Heap Append

2018-03-06 Thread Komяpa
вт, 6 мар. 2018 г. в 4:57, Craig Ringer <cr...@2ndquadrant.com>:

> On 3 March 2018 at 00:30, Darafei "Komяpa" Praliaskouski <m...@komzpa.net>
> wrote:
>
>
>> I gave this all some thought and it looks like it all could have not
>> happened if Postgres was able to cluster heap insertions by (id, ts) index.
>> We're ok with synchronuous_commit=off, so amplified write won't immediately
>> hit disk and can get cooled down in progress. Clustering doesn't require
>> perfect sorting: we need to minimize number of pages fetched, it's ok if
>> the pages are not consecutive on disk.
>>
>
> I'm surprised nobody has mentioned BRIN yet.
>
> Ever since BRIN was introduced, I've thought it would be very interesting
> to use it + the freespace map for coarse-grained tuple routing in heap
> inserts. Try to find the best-match range with BRIN and look for free space
> there. I think there was discussion of this early on, so you may want to
> look up the BRIN threads.
>

It can be done using any indexing mechanism that can get you some list of
pages containing sibling tuples for current one. For btree, GiST I see it
as inserting tuple to index first (or taking a look at the page the
insertion is going to happen at) and then looking at page numbers of tuples
in the same index page. Similar can work for SP-GiST. BRIN, just the way
you say.

I'm not sure how it can be implemented for GIN. Probably choosing the most
empty page from all the ones referenced by posting lists the inserted row
can participate in, in cuckoo-hashing manner?


> The main challenge probably comes when a range is exhausted. Your writes
> would start spilling over into new heap pages and get intermixed again.
> Some support for pre-allocating new ranges would be needed.
>

This can potentially be solved provided there is mechanism to move tuples
between pages updating all the indexes. Then autovacuum can be taught to
move tuples between ranges to make the ranges smaller. I see how we can
survive without such mechanism for exact indexes like GiST or btree, but
not for BRIN at the start.


Darafei Praliaskouski,
GIS Engineer / Juno Minsk


Re: All Taxi Services need Index Clustered Heap Append

2018-03-06 Thread Komяpa
пн, 5 мар. 2018 г. в 19:48, Ants Aasma <ants.aa...@eesti.ee>:

> On Mon, Mar 5, 2018 at 2:11 PM, Darafei "Komяpa" Praliaskouski
> <m...@komzpa.net> wrote:
> >> This approach mixes well with hash
> >> partitioning. It would be neat indeed if PostgreSQL do something
> >> equivalent on its own, and pluggable storage work being done could
> >> enable index organized tables that would help. But you probably need
> >> something right now.
> >
> >
> > Fixing glaring issues (no vacuum and thus no Index-Only Scan on
> append-only
> > tables, vacuum processing all of the eternity of btree) by 11 will get
> most
> > of spike-nails out of the microservice code, and we can probably live
> with
> > them until 11 gets to RDS.
> >
> > I also don't see why a pluggable storage is a must for the clustered
> write.
> > Postgres does have a mechanism for selecting the next page to write tuple
> > to, right now it's just looking at FSM - but what if it just peeked at
> > existing index that already has enough the data to route tuple to correct
> > page on write?
>
> The mechanism you outlined would likely work for your use case, but it
> has many issues that prevent it from being universally useful. From
> the top of my head:
>
> * One extra index descent per insertion (I/O for this is necessary
> anyway, but CPU work is duplicated).
>

This part is going to be needed for any version of such mechanisms.
It is going to only be enabled for CLUSTERed tables only, so most people
won't be affected.


> * We don't currently track the amount of bloat. A mechanism that does
> this needs to be added.
>

We've got number of tuples and number of pages in relation. We know we're
degraded when number of tuples is going toward number of pages. It seems to
be enough for a good enough cutoff heuristic.


> * If table hits the bloat limit there will be a sudden change in
> behavior. This is pretty nasty from an operations point of view.
>

We can omit the bloat limit for PoC implementation and see if that is
enough. It may happen that there are other ways to calculate sibling pages
that will handle both ascending and descending insertion.


> * With your (id,ts) clustering and data coming in mostly ordered by
> timestamp, after initial warmup, each page will contain rows from a
> single id, but different ids are arbitrarily interleaved. This is
> better than current state, but people might want to have an
> interleaving step bigger than 8kB to better utilize storage hardware.
>

8kb is what btree index offers in clustered fashion, and people are okay
with that, isn't it?

When thinking of this I thought "ok, how can we make the current heap
behave like a perfectly-clustered btree, given we've got a
perfectly-clustered btree nearby, but cannot use split operation since we
cannot move a tuple between pages". This was the best I can think of, for
now. Do you have better idea? :)


> * It seems that with a common (ts) clustering and age of timestamp
> coming from an exponential distribution, this will quickly bloat to
> threshold and then insert data in a rather arbitrary order. This is
> much worse than the default behavior.
>

Can you please explain how is it going to happen? I see that such process
is going to quickly start reusing previous pages, and not reach threshold,
as a page that has enough space to hold the newly coming tuple will be
among the list of pages acquired form index. Clustering will likely not be
perfect but will reorder at least part of the tuples.

Please note that every tuple inserted to heap by FSM is going to drag the
tuples following it into that page, before the limiting mechanism kicks in
because of no space on that page.

Providing an out-of-the-box solution in core PostgreSQL would of
> course be best, but


Let's stop right here. We need out of box solution in core, it would be
best, period. I'll happily discuss out-of-postgres solutions
out-of-postgres list :)

Darafei Praliaskouski,
GIS Engineer / Juno Minsk


Re: All Taxi Services need Index Clustered Heap Append

2018-03-05 Thread Komяpa
>
> This approach mixes well with hash
> partitioning. It would be neat indeed if PostgreSQL do something
> equivalent on its own, and pluggable storage work being done could
> enable index organized tables that would help. But you probably need
> something right now.
>

Fixing glaring issues (no vacuum and thus no Index-Only Scan on append-only
tables, vacuum processing all of the eternity of btree) by 11 will get most
of spike-nails out of the microservice code, and we can probably live with
them until 11 gets to RDS.

I also don't see why a pluggable storage is a must for the clustered write.
Postgres does have a mechanism for selecting the next page to write tuple
to, right now it's just looking at FSM - but what if it just peeked at
existing index that already has enough the data to route tuple to correct
page on write?



> I guess I don't have to tell you that it looks like your needs have
> outgrown what RDS works well with and you are in for a painful move
> sooner or later.
>

Painful move where to? If we just run a Postgres instance without RDS we'll
get the pain of setting up Postgres and replication and backups and
autofailover, with no visible gain except if we get some private /
unaccepted patches applied to it. If we can get these things right upstream
why would we want to switch?

Per my colleagues, MySQL offers clustered index, also MySQL is available on
RDS without the need of "painful move", which is doable by writing to two
locations for a day and then pointing readers to new DB. But if we can
instead do no move and be sure the issues are gone upstream before we hit
the limit of spike-nails we're running on currently, wouldn't that be
better? :)

Darafei Praliaskouski,
GIS Engineer / Juno Minsk


All Taxi Services need Index Clustered Heap Append

2018-03-02 Thread Komяpa
Hi,

I work at a ride sharing company and we found a simple scenario that
Postgres has a lot to improve at.
After my talk at pgconf.ru Alexander Korotkov encouraged me to share my
story and thoughts in -hackers.

Story setting:

 - Amazon RDS (thus vanilla unpatchable postgres), synchronuous_commit=off
(we're ok to lose some seconds on crash).

 - Amazon gp2 storage. It's almost like SSD for small spikes, but has IOPS
limit if you're doing a lot of operations.

 - Drivers driving around the city and sending their GPS positions each
second, for simplicity - 50k of them at a time.

 - An append-only table of shape (id uuid, ts timestamp, geom geometry,
heading speed accuracy float, source text).
A btree index on (id, ts).

 - A service that gets the measurements on the network, batches all into a
buffer of 1 second (~N=50k rows) and inserting via COPY.

After someone orders and completes a trip, we have the id of the driver and
trip time interval coming from another service, and want to get trip's
route to calculate the bill. Trip times are from seconds up to 4 hours
(N=4*3600=14400 rows, a page typically has 60 rows).

select * from positions where id = :id and ts between :start_ts and :end_ts;

Data for more than 24 hours ago is not needed in this service, thus a
storage of 70 gb should be enough. On the safe side we're giving it 500 gb,
which on gp2 gives a steady 1500 iops.

In development (on synthetic data) plans use index and look great, so we
proceed with Postgres and not MySQL. :)

When deployed to production we figured out that a single query for interval
of more than half an hour (1800+ rows) can exhaust all the IOPS.
Data is appended with increasing time field, which effectively ensures no
rows from the same driver are ever going to be in the same heap page. A 4
hour long request can degrade system for 10 seconds. gp2 provides max 1
IOPS, and we need to get 14400 pages then. We need the biggest available
gp2 offer just to read 2 megabytes of data in 1.5 seconds. The best
io-optimized io1 volumes provide 32000 IOPS, which get us as low as 500 ms.

If the data was perfectly packed, it would be just 240 8k pages and
translate to 120 input operations of gp2's 16kb blocks.

Our options were:

 - partitioning. Not entirely trivial when your id is uuid. To get visible
gains, we need to make sure each driver gets their own partition. That
would leave us with 50 000(+) tables, and rumors say that in that's what is
done in some bigger taxi service, and relcache then eats up all the RAM and
system OOMs.

 - CLUSTER table using index. Works perfect on test stand, isn't available
as online option.

 - Postgres Pro suggested CREATE INDEX .. INCLUDE (on commitfest
https://commitfest.postgresql.org/15/1350/). We can't use that as it's not
in upstream/amazon Postgres yet.

 - We decided to live with overhead of unnecessary sorting by all fields
and keeping a copy of heap and created a btree over all the fields to
utilize Index-Only Scans:

  * Testing went well on dump of production database.

  * After we've made indexes on production, we found out that performance
is _worse_ than with simpler index.

  * EXPLAIN (BUFFERS) revealed that Visibility Map is never being frozen,
as autovacuum ignores append-only never-updated never-deleted table that is
only truncated once a day. No way to force autovacuum on such table exists.

  * We created a microservice (hard to find spot for crontab in distributed
system!) that periodically agressively runs VACUUM on the table.
It indeed helped with queries, but.. VACUUM skips all-visible pages in
index, but always walks over all the pages of btree, which is even larger
than the heap in our case.
There is a patch to repair this behavior on commitfest
https://commitfest.postgresql.org/16/952/ - but not yet in upstream/amazon
Postgres.

  * We ended up inventing partitioning schema that rotates a table every
gigabyte of data, to keep VACUUM run time low. There are a hundreds
partitions with indexes on all the fields. Finally the system is stable.

  * EXPLAIN (BUFFERS) shows later that each reading query visits all the
indexes of partitioned table and fetches a page from index to know there
are 0 rows there. To prune obviously unneeded partitions we decided to add
constraint on timestamp after a partition is finalized. Timestamps are
sanitized due to mobile network instability are stamped on the client side,
so we don't know the bounds in advance. Unfortunately that means we need
two seq scans to do it: first one to get min(ts), max(ts), and second one
on ALTER TABLE ADD CONSTRAINT. This operation also eats up iops.

We are not very large company but we bump into too many scalability issues
on this path already. Searching for solutions on every step shows other
people with tables named like "gpsdata" and "traces", so we're not alone
with this problem. :)

I gave this all some thought and it looks like it all could have not
happened if Postgres was able to cluster heap 

Re: Cast jsonb to numeric, int, float, bool

2018-03-01 Thread Komяpa
> Attached new version of the patch in which I removed duplicated code
using new subroutine JsonbExtractScalar(). I am not sure what is better to
do when a JSON item has an unexpected type: to throw an  error or to return
SQL NULL. Also JSON nulls could be converted to SQL NULLs.

I would expect it to follow whatever is happening in JavaScript.
I'm unsure about mapping of NULL and undefined/null though.

> I should note here that expression (jb -> 'key')::datatype can be
rewritten with SQL/JSON function JSON_VALUE: JSON_VALUE(jb, '$.key'
RETURNING datatype ERROR ON ERROR)

I would expect some casts to be implicit, so that chaining with other
functions is possible:

select ST_MakePoint(r->'lon', r->'lat');

select sum(r->'income');

> But by standard JSON_VALUE tries to cast string JSON items to the
specified datatype too, so JSON_VALUE('{"key": "123"}'::jsonb, '$.key'
RETURNING int ERROR ON ERROR) does not throw an error but returns 123.

In actual JSON implementations number datatype is usually the one available
in browsers, double precision.
For some numbers (I've met this with nanoseconds) it leads to value being
changed on subsequent serializations and deserializations, so it's common
to wrap them in a string to be unchanged.
So, I would expect that to work, but give me an exception if the datatype
loses precision on conversion of specific value.


Re: Better testing coverage and unified coding for plpgsql loops

2018-01-02 Thread Komяpa
Hello!

> However, while I was doing that, it seemed like the tests I was adding
> were mighty repetitive, as many of them were just exactly the same thing
> adjusted for a different kind of loop statement.  And so I began to wonder
> why it was that we had five copies of the RC_FOO management logic, no two
> quite alike.  If we only had *one* copy then it would not seem necessary
> to have such duplicative test cases for it.  A bit of hacking later, and
> I had the management logic expressed as a macro, with only one copy for
> all five kinds of loop.  I verified it still passes the previous set of
> tests and then removed the ones that seemed redundant, yielding
> plpgsql-unify-loop-rc-code.patch below.  So what I propose actually
> committing is the combination of these two patches.
>

I have looked into plpgsql-unify-loop-rc-code.patch.
I have two questions:

 - how do currently existing coverage tools display coverage for such a
large macro?

I expect DEFINE's to be treated as comments.

I've looked into https://coverage.postgresql.org/src/port/qsort.c.gcov.html and
on line 70 I see a similar multi line define that is yellow in coverage,
not counted at all. I think that "higher coverage" effect you are seeing is
mostly due to code being hidden from coverage counter, not actually better
testing. Another thing I see is that most define's are in .h files, and
they're also not in coverage report mostly.

 - can this macro become a function?