from:"Alexander Korotkov"

On Mon, Aug 10, 2015 at 6:36 PM, Petr Jelinek p...@2ndquadrant.com wrote:

 On 2015-08-10 16:58, Alexander Korotkov wrote:

 That should work, thanks! Also we can have SQL-visible functions to get
 amsupport and amstrategies and use them in the regression tests.


 SQL-visible functions would be preferable to storing it in pg_am as
 keeping the params in pg_am would limit the extensibility of pg_am itself.


I actually meant just two particular functions (not per AM) which both get
AM oid as argument. They should call amhandler and return amsupport and
amstrategies correspondingly.
These functions can be used in regression tests to check opclass and
opfamilies correctness.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP: Rework access method interface

On Mon, Aug 10, 2015 at 5:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov a.korot...@postgrespro.ru writes:
  On Mon, Aug 10, 2015 at 1:12 PM, Petr Jelinek p...@2ndquadrant.com
 wrote:
  I don't understand this, there is already AmRoutine in RelationData, why
  the need for additional field for just amsupport?

  We need amsupport in load_relcache_init_file() which reads
  pg_internal.init. I'm not sure this is correct place to call
 am_handler.
  It should work in the case of built-in AM. But if AM is defined in the
  extension then we wouldn't be able to do catalog lookup for am_handler on
  this stage of initialization.

 This is an issue we'll have to face before there's much hope of having
 index AMs as extensions: how would you locate any extension function
 without catalog access?  Storing raw function pointers in pg_internal.init
 is not an answer in an ASLR world.

 I think we can dodge the issue so far as pg_internal.init is concerned by
 decreeing that system catalogs can only have indexes with built-in AMs.
 Calling a built-in function doesn't require catalog access, so there
 should be no problem with re-calling the handler function by OID during
 load_relcache_init_file().


That should work, thanks! Also we can have SQL-visible functions to get
amsupport and amstrategies and use them in the regression tests.


 We could also have problems with WAL replay, though I think the consensus
 there is that extension AMs have to use generic WAL records that don't
 require any index-specific replay code.


Yes, I've already showed one version of generic WAL records. The main
objecting against them was it's hard insure that composed WAL-record is
correct.
Now I'm working on new version which would be much easier and safe to use.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP: Rework access method interface

On Mon, Aug 10, 2015 at 6:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Petr Jelinek p...@2ndquadrant.com writes:
  On 2015-08-10 16:58, Alexander Korotkov wrote:
  That should work, thanks! Also we can have SQL-visible functions to get
  amsupport and amstrategies and use them in the regression tests.

  SQL-visible functions would be preferable to storing it in pg_am as
  keeping the params in pg_am would limit the extensibility of pg_am
 itself.

 I don't see any particularly good reason to remove amsupport and
 amstrategies from pg_am.  Those are closely tied to the other catalog
 infrastructure for indexes (pg_amproc, pg_amop) which I don't think are
 candidates for getting changed by this patch.

 There are a couple of other pg_am columns, such as amstorage and
 amcanorderbyop, which similarly bear on what's legal to appear in
 related catalogs such as pg_opclass.  I'd be sort of inclined to
 leave those in the catalog as well.  I do not see that exposing
 a SQL function is better than exposing a catalog column; either
 way, that property is SQL-visible.


That answers my question, thanks!

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP: Rework access method interface

On Mon, Aug 10, 2015 at 1:12 PM, Petr Jelinek p...@2ndquadrant.com wrote:

On 2015-08-09 23:56, Alexander Korotkov wrote:

Hacker,

some time before I proposed patches implementing CREATE ACCESS METHOD.

http://www.postgresql.org/message-id/capphfdsxwzmojm6dx+tjnpyk27kt4o7ri6x_4oswcbyu1rm...@mail.gmail.com
As I get from comments to my patches and also from Tom's comment about
AM interface in tablesampling thread – AM interface needs reworking. And
AFAICS AM interface rework is essential to have CREATE ACCESS METHOD
command.
http://www.postgresql.org/message-id/5438.1436740...@sss.pgh.pa.us

Cool, I was planning to take a stab at this myself when I have time, so I
am glad you posted this.

This is why I'd like to show a WIP patch implementing AM interface
rework. Patch is quite dirty yet, but I think it illustrated the idea
quite clear. AM now have single parameter – handler function. This
handler returns pointer to AmRoutine which have the same data as pg_am
had before. Thus, API is organized very like FDW.

I wonder if it would make sense to call it IndexAmRoutine instead in case
we add other AMs (like the sequence am, or maybe column store if that's
done as AM) in the future.

Good point!

The patch should probably add the am_handler type which should be return
type of the am handler function (see fdw_handler and tsm_handler types).

Sounds reasonable!

I also think that the CHECK_PROCEDUREs should be in the place of the
original GET_REL_PROCEDUREs (just after relation check). If the interface
must exist we may as well check for it at the beginning and not after we
did some other work which is useless without the interface.

Ok, good point too.

However, this patch appears to take more work than I expected. It have
to do many changes spread among many files.

Yeah you need to change the definition and I/O handling of every AM
function, but that's to be expected.

Yes, this is why I decided to get some feedback on this stage of work.

Also, it appears not so easy
to hide amsupport into AmRoutine, because it's needed for relcache. As a
temporary solution it's duplicated in RelationData.

I don't understand this, there is already AmRoutine in RelationData, why
the need for additional field for just amsupport?

We need amsupport in load_relcache_init_file() which reads
pg_internal.init. I'm not sure this is correct place to call am_handler.
It should work in the case of built-in AM. But if AM is defined in the
extension then we wouldn't be able to do catalog lookup for am_handler on
this stage of initialization.

Another point is that amsupport and amstrategies are used for regression
tests of opclasses and opfamilies. Thus, we probably can keep them in pg_am.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Proposal for CSN based snapshots

2015-07-27 Thread Alexander Korotkov

On Sat, Jul 25, 2015 at 11:39 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On 24 July 2015 at 19:21, Robert Haas robertmh...@gmail.com wrote:

 On Fri, Jul 24, 2015 at 1:00 PM, Simon Riggs si...@2ndquadrant.com
 wrote:
  It depends on the exact design we use to get that. Certainly we do not
 want
  them if they cause a significant performance regression.

 Yeah.  I think the performance worries expressed so far are:

 - Currently, if you see an XID that is between the XMIN and XMAX of
 the snapshot, you hit CLOG only on first access.  After that, the
 tuple is hinted.  With this approach, the hint bit doesn't avoid
 needing to hit CLOG anymore, because it's not enough to know whether
 or not the tuple committed; you have to know the CSN at which it
 committed, which means you have to look that up in CLOG (or whatever
 SLRU stores this data).  Heikki mentioned adding some caching to
 ameliorate this problem, but it sounds like he was worried that the
 impact might still be significant.


 This seems like the heart of the problem. Changing a snapshot from a list
 of xids into one number is easy. Making XidInMVCCSnapshot() work is the
 hard part because there needs to be a translation/lookup from CSN to
 determine if it contains the xid.

 That turns CSN into a reference to a cached snapshot, or a reference by
 which a snapshot can be derived on demand.


I got the problem. Currently, once we set hint bits don't have to visit
CLOG anymore. With CSN snapshots that is not so. We still have to translate
XID into CSN in order to compare it with snapshot CSN. In version of CSN
patch in this thread we still have XMIN and XMAX in the snapshot. AFAICS
with CSN snapshots XMIN and XMAX are not necessary required to express
snapshot, they were kept for optimization. That restricts usage of XID =
CSN map with given range of XIDs. However, with long running transactions
[XMIN; XMAX] range could be very wide and we could use XID = CSN map
heavily in wide range of XIDs.

As I can see in Huawei PGCon talk Dense Map in shared memory is proposed
for XID = CSN transformation. Having large enough Dense Map we can do
most of XID = CSN transformations with single shared memory access. PGCon
talk gives us result of pgbench. However, pgbench doesn't run any long
transactions. With long running transaction we can run out of Dense Map
for significant part of XID = CSN transformations. Dilip, did you test
your patch with long transactions?

I'm also thinking about different option for optimizing this. When we set
hint bits we can also change XMIN/XMAX with CSN. In this case we wouldn't
need to do XID = CSN transformation for this tuple anymore. However, we
have 32-bit XIDs for now. We could also have 32-bit CSNs. However, that
would doubles our troubles with wraparound: we will have 2 counters that
could wraparound. That could return us to thoughts about 64-bit XIDs.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Selectivity estimation for intarray with @@

On Tue, Jul 21, 2015 at 6:49 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 07/21/2015 03:44 PM, Alexander Korotkov wrote:

 While Uriy is on vacation, I've revised this patch a bit.


 I whacked this around quite a bit, and I think it's in a committable state
 now. But if you could run whatever tests you were using before on this, to
 make sure it still produces the same estimates, that would be great. I
 didn't change the estimates it should produce, only the code structure.

 One thing that bothers me slightly with this patch is the way it peeks
 into the Most-Common-Elements arrays, which is produced by the built-in
 type analyze function. If we ever change what statistics are collected for
 arrays, or the way they are stored, this might break. In matchsel, why
 don't we just call the built-in estimator function for each element that we
 need to probe, and not look into the statistics ourselves at all? I
 actually experimented with that, and it did slash much of the code, and it
 would be more future-proof. However, it was also a lot slower for queries
 that contain multiple values. That's understandable: the built-in estimator
 will fetch the statistics tuple, parse the arrays, etc. separately for each
 value in the query_int, while this patch will do it only once for the whole
 query, and perform a simple binary search for each value. So overall, I
 think this is OK as it is. But if we find that we need to use the MCE list
 in this fashion in more places in the future, it might be worthwhile to add
 some support code for this in the backend to allow extracting the stats
 once, and doing multiple lightweight estimations using the extracted
 stats.


Yeah, I see. We could end up with something like this. But probably we
would need something more general for extensions which wants to play with
statistics.
For instance, pg_trgm could estimate selectivity for text % text
operator. But in order to provide that it needs trigram statistics. Now it
could be implemented by defining separate datatype, but it's a kluge.
Probably, we would end up with custom additional statistics for datatypes.


 Some things I fixed/changed:

 * I didn't like that transformOperator() function, which looked up the
 function's name. I replaced it with separate wrapper functions for each
 operator, so that the built-in operator's OID can be hardcoded into each.

 * I refactored the matchsel function heavily. I think it's more readable
 now.

 * I got rid of the Int4Freq array. It didn't seem significantly easier to
 work with than the separate values/numbers arrays, so I just used those
 directly.

 * Also use the matchsel estimator for ~~ (the commutator of @@)


In this version of patch it's not checked if variable is actually and int[]
not query_int. See following test case.

# create table test2 as (select '1'::query_int val from
generate_series(1,100));
# analyze test2;

# explain select * from test2 where '{1}'::int[] @@ val;
ERROR:  unrecognized int query item type: 0

I've added this check.

* Also use the estimators for the obsolete @ and ~ operators. Not that I
 care much about those as they are obsolete, but seems strange not to, as
 it's a trivial matter of setting the right estimator function.

 * I added an ANALYZE in the regression test. It still won't systematically
 test all the cost estimation code, and there's nothing to check that the
 estimates make sense, but at least more of the code will now run.


You also forgot to include intarray--1.0--1.1.sql into patch. I've also
added it.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


intarray-sel-4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-07-22 Thread Alexander Korotkov

On Wed, Jul 22, 2015 at 8:48 AM, Michael Paquier michael.paqu...@gmail.com
wrote:

 On Mon, Jul 20, 2015 at 9:18 PM, Alexander Korotkov
 a.korot...@postgrespro.ru wrote:
  attached patch allows pg_rewind to work when target timeline was
 switched.
  Actually, this patch fixes TODO from pg_rewind comments.
 
/*
 * Trace the history backwards, until we hit the target timeline.
 *
 * TODO: This assumes that there are no timeline switches on the target
 * cluster after the fork.
 */
 
  This patch allows pg_rewind to handle data directory synchronization is
 much
  more general way. For instance, user can return promoted standby to old
  master.

 Yes. That would be useful.

  In this patch target timeline history is exposed as global variable.
 Index
  in target timeline history is used in function interfaces instead of
  specifying TLI directly. Thus, SimpleXLogPageRead() can easily start
 reading
  XLOGs from next timeline when current timeline ends.

 The implementation looks rather neat by having a first look at it, but
 the attached patch fails to compile:
 pg_rewind.c:488:4: error: use of undeclared identifier 'targetEntry'
 targetEntry = targetHistory[i];
 ^
 Nice to see as well that this has been added to the CF of September.


Uh, sorry. Fixed version is attached.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pg-rewind-target-switch-2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Selectivity estimation for intarray with @@

On Tue, Jul 21, 2015 at 5:14 PM, Teodor Sigaev teo...@sigaev.ru wrote:

 Some notices about code:

 1 Near function transformOperator() there is wrong operator name @


Fixed.


 2 int_query (and intquery) should be replaced to query_int to be
 consistent with actual type name. At least where it's used as separate
 lexeme.


Fixed.


 3 For historical reasons @@ doesn't commutate with itself, it has a
 commutator ~~. Patch assumes that @@ is self-commutator, but ~~ will use
 'contsel' as a restrict estimation. Suppose, we need to declare ~~ as
 deprecated and introduce commutating @@ operator.


I think deprecating ~~ is a subject of separate patch. I fixed patch
behavior in the different way. @@ and ~~ are now handled by the same
function. The function handles var @@ const and const ~~ var, but
doesn't handle const @@ var and var  ~~ const. It determines the case
by type of variable: it should be int[].

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


intarray_sel-3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fillfactor for GIN indexes

On Tue, Jul 21, 2015 at 7:20 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 07/21/2015 04:14 PM, Alexander Korotkov wrote:

 On Tue, Jul 21, 2015 at 3:52 PM, Heikki Linnakangas hlinn...@iki.fi
 wrote:

  On 07/21/2015 02:56 PM, Alexander Korotkov wrote:

  Probably, but currently we are in quite unlogical situation. We have
 default fillfactor = 90 for GiST where it has no use cases at all and
 effectively is just a waste of space.


 Why is it useless for GiST?



 It's because all of GiST pages are produced by page splits. So, just after
 CREATE INDEX GiST pages aren't tightly packed in average. Actually, they
 could be tightly packed by incredible coincidence, but for large indexes
 it's quite safe assumption that they are not. With GiST we don't have
 storm
 of page splits after index creation with fillfactor = 100. So, why should
 we reserve additional space with fillfactor = 90?


 Aha, I see. Yeah, that's pretty useless. Ideally, we would make the GiST
 build algorithm smarter so that it would pack the pages more tightly. I
 have no idea how to do that, however.

 Anyway, the fact that fillfactor is useless for GiST is more of an
 argument for removing it from GiST, than for adding it to GIN.


Yes it is. I've just think that fillfactor is more useful for GIN than for
GiST now. However, it's probably doesn't worth it for both of them.
One of our customers complains that freshly built GIN indexes get bloat
very fast. I've asked them a test case. Using that test case I would try to
see if fillfactor for GIN could help in practice. For now we can mark it as
Returned with feedback where feedback is Real life use cases needed.
Removing fillfactor for GiST is now not easy. Once it's exposed for users,
removing it completely would break compatibility. I would propose to chage
default fillfactor for GiST from 90 to 100.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-07-17 Thread Alexander Korotkov

On Fri, Jul 17, 2015 at 6:05 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Peter Geoghegan p...@heroku.com writes:
  I've heard that clock_gettime() with CLOCK_REALTIME_COARSE, or with
  CLOCK_MONOTONIC_COARSE can have significantly lower overhead than
  gettimeofday().

 It can, but it also has *much* lower precision, typically 1ms or so.


I've write simple benchmark of QueryPerformanceCounter() for Windows. The
source code is following.

#include stdio.h
#include windows.h
#include Winbase.h

int _main(int argc, char* argv[])
{
LARGE_INTEGER start, freq, current;
long count = 0;
QueryPerformanceFrequency(freq);
QueryPerformanceCounter(start);
current = start;
while (current.QuadPart  start.QuadPart + freq.QuadPart)
{
QueryPerformanceCounter(current);
count++;
}
printf(QueryPerformanceCounter() per second:  %lu\n, count);
return 0;
}

On my virtual machine in runs 1532746 QueryPerformanceCounter() per second.
In a contrast my MacBook can natively run 26260236 gettimeofday() per
second.
So, performance of PostgreSQL instr_time.h can vary in more than order of
magnitude. It's possible that we can found systems where measurements of
time are even much slower.
In general, there could be some systems where accurate measurements of time
intervals is impossible or slow. That means we should provide them some
different solution like sampling. But does it means we should force
majority of systems use sampling which is both slower and less accurate for
them? Could we end up with different options for user?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Proposal for CSN based snapshots

2015-07-24 Thread Alexander Korotkov

On Wed, Aug 27, 2014 at 10:42 AM, Heikki Linnakangas 
hlinnakan...@vmware.com wrote:

 On 08/27/2014 09:40 AM, Heikki Linnakangas wrote:

 On 08/27/2014 08:23 AM, Jeff Davis wrote:

 On Tue, 2014-08-26 at 13:45 +0300, Heikki Linnakangas wrote:

 Yeah. This patch in the current state is likely much much slower than
 unpatched master, except in extreme cases where you have thousands of
 connections and short transactions so that without the patch, you spend
 most of the time acquiring snapshots.


 What else are you looking to accomplish with this patch during this
 'fest? Bug finding? Design review? Performance testing?


 Design review, mostly. I know the performance still sucks. Although if
 you can foresee some performance problems, aside from the extra CSNLOG
 lookups, it would be good to know.


 I think for this commitfest, I've gotten as much review of this patch that
 I can hope for. Marking as Returned with Feedback. But of course, feel
 free to continue reviewing and commenting ;-).


What is current state of this patch? Does community want CSN based
snapshots?
Last email in lists was in August 2014. Huawei did talk about their further
research on this idea at PGCon and promised to publish their patch in open
source.
http://www.pgcon.org/2015/schedule/events/810.en.html
However, it isn't published yet, and we don't know how long we could wait
for it.
Now, our company have resources to work on CSN based snapshots for 9.6. If
Huawei will not publish their patch, we can pick up this work from last
community version of this patch.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

[HACKERS] Use pg_rewind when target timeline was switched

2015-07-20 Thread Alexander Korotkov

Hackers,

attached patch allows pg_rewind to work when target timeline was switched.
Actually, this patch fixes TODO from pg_rewind comments.

  /*
   * Trace the history backwards, until we hit the target timeline.
   *
   * TODO: This assumes that there are no timeline switches on the target
   * cluster after the fork.
   */

This patch allows pg_rewind to handle data directory synchronization is
much more general way. For instance, user can return promoted standby to
old master.

In this patch target timeline history is exposed as global variable. Index
in target timeline history is used in function interfaces instead of
specifying TLI directly. Thus, SimpleXLogPageRead() can easily start
reading XLOGs from next timeline when current timeline ends.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pg_rewind_target_switch.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_trgm version 1.2

2015-07-20 Thread Alexander Korotkov

On Wed, Jul 15, 2015 at 12:31 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Tue, Jul 7, 2015 at 6:33 AM, Alexander Korotkov 
 a.korot...@postgrespro.ru wrote:



 See Tom Lane's comment about downgrade scripts. I think just remove it is
 a right solution.


 The new patch removes the downgrade path and the ability to install the
 old version.

 (If anyone wants an easy downgrade path for testing, they can keep using
 the prior patch--no functional changes)

 It also added a comment before the trigramsMatchGraph call.

 I retained the palloc and the loop to promote the ternary array to a
 binary array.  While I also think it is tempting to get rid of that by
 abusing the type system and would do it that way in my own standalone code,
 it seems contrary to way the project usually does things.  And I couldn't
 measure a difference in performance.


Ok!


 



 Let's consider '^(?!.*def).*abc' regular expression as an example. It
 matches strings which contains 'abc' and don't contains 'def'.

 # SELECT 'abc' ~ '^(?!.*def).*abc';
  ?column?
 --
  t
 (1 row)

 # SELECT 'def abc' ~ '^(?!.*def).*abc';
  ?column?
 --
  f
 (1 row)

 # SELECT 'abc def' ~ '^(?!.*def).*abc';
  ?column?
 --
  f
 (1 row)

 Theoretically, our trigram regex processing could support negative
 matching of 'def' trigram, i.e. trigramsMatchGraph(abc = true, def = false)
 = true but trigramsMatchGraph(abc = true, def = true) = false. Actually, it
 doesn't because trigramsMatchGraph() implements a monotonic function. I
 just think it should be stated explicitly.


 Do you think it is likely to change to stop being monotonic and so support
 the (def=GIN_TRUE) = false case?

 ^(?!.*def)   seems like a profoundly niche situation.  (Although one that
 I might actually start using myself now that I know it isn't just a
 Perl-ism).

 It doesn't make any difference to this patch, other than perhaps how to
 word the comments.


Yes, it was just about the comments.

I also run few tests on real-life dataset: set of dblp paper titles. As it
was expected, there is huge speedup when pattern is long enough.

# SELECT count(*) FROM dblp_titles WHERE s ILIKE '%Multidimensional Data%';
 count
---
   318
(1 row)

Time: 29,114 ms

# SELECT count(*) FROM dblp_titles WHERE s ~* '.*Multidimensional Data.*';
 count
---
   318
(1 row)

Time: 26,273 ms

# SELECT count(*) FROM dblp_titles WHERE s ILIKE '%Multidimensional Data%';
 count
---
   318
(1 row)

Time: 249,725 ms

# SELECT count(*) FROM dblp_titles WHERE s ~* '.*Multidimensional Data.*';
 count
---
   318
(1 row)

Time: 218,627 ms

For me, patch is in the pretty good shape. I'm going to mark it Ready for
committer.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Fillfactor for GIN indexes

 not as useful as btree
fillfactor, but it still could be useful in some cases...
Probably, we could leave 100 as default fillfactor, but provide an option.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Fillfactor for GIN indexes

On Tue, Jul 21, 2015 at 2:49 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 Hmm. That's slightly different from the test case I used, in that the
 update is changing the indexed value, which means that the new value goes
 to different posting tree than the old one. I'm not sure if that makes any
 difference. You're also updating more rows, 1/50 vs. 1/100.

 This case is closer to my earlier one:

 postgres=# create table foo (id int4, i int4, t text) with (fillfactor=90);
 CREATE TABLE
 postgres=# insert into foo select g, 1 from  generate_series(1, 100) g;
 INSERT 0 100
 postgres=# create index btree_foo_id on foo (id); -- to force non-HOT
 updates
 CREATE INDEX
 postgres=# create index gin_i on foo using gin (i) with (fastupdate=off);
 CREATE INDEX
 postgres=# \di+
List of relations
  Schema | Name | Type  | Owner  | Table |  Size   | Description
 +--+---++---+-+-
  public | btree_foo_id | index | heikki | foo   | 21 MB   |
  public | gin_i| index | heikki | foo   | 1072 kB |
 (2 rows)

 postgres=# update foo set id=-id where id % 100 = 0;
 UPDATE 1
 postgres=# \di+
List of relations
  Schema | Name | Type  | Owner  | Table |  Size   | Description
 +--+---++---+-+-
  public | btree_foo_id | index | heikki | foo   | 22 MB   |
  public | gin_i| index | heikki | foo   | 1080 kB |
 (2 rows)

 I'm not sure what's making the difference to your test case. Could be
 simply that your case happens to leave less free space on each page,
 because of the different data.


Yeah, it's likely because of different data.

 Based on this, I think we should just drop this patch. It's not useful in
 practice.


 I wouldn't say it's not useful at all. It's for sure not as useful as
 btree
 fillfactor, but it still could be useful in some cases...
 Probably, we could leave 100 as default fillfactor, but provide an option.


 Doesn't seem worth the trouble to me...


Probably, but currently we are in quite unlogical situation. We have
default fillfactor = 90 for GiST where it has no use cases at all and
effectively is just a waste of space. On the other had we're rejecting
fillfactor for GIN where it could have at least some use cases... Should we
don't have fillfactor for both GiST and GIN?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Fillfactor for GIN indexes

On Tue, Jul 21, 2015 at 3:52 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 07/21/2015 02:56 PM, Alexander Korotkov wrote:

 Probably, but currently we are in quite unlogical situation. We have
 default fillfactor = 90 for GiST where it has no use cases at all and
 effectively is just a waste of space.


 Why is it useless for GiST?


It's because all of GiST pages are produced by page splits. So, just after
CREATE INDEX GiST pages aren't tightly packed in average. Actually, they
could be tightly packed by incredible coincidence, but for large indexes
it's quite safe assumption that they are not. With GiST we don't have storm
of page splits after index creation with fillfactor = 100. So, why should
we reserve additional space with fillfactor = 90?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Selectivity estimation for intarray with @@

Hi!

While Uriy is on vacation, I've revised this patch a bit.

On Tue, Jul 14, 2015 at 8:55 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 Hi Uriy,

 This patch looks pretty good.

 The first line of intarray--1.1.sql mis-identifies itself as /*
 contrib/intarray/intarray--1.0.sql */


Fixed.


 The real file intarray--1.0.sql file probably should not be included in
 the final patch, but I like having it for testing.


I've removed intarray--1.0.sql since it shouldn't be in final commit.


 It applies and builds cleanly over the alter operator patch (and now the
 commit as well), passes make check of the contrib module both with and
 without cassert.

 I could succesfully upgrade from version 1.0 to 1.1 without having to drop
 the gin__int_ops indexes in the process.

 I could do pg_upgrade from 9.2 and 9.4 to 9.6devel with large indexes in
 place, and then upgrade the extension to 1.1, and it worked without having
 to rebuild the index.

 It does what it says, and I think we want this.


Good.


 There were some cases where the estimates were not very good, but that
 seems to be limitation of what pg_stats makes available, not of this
 patch.  Specifically if the elements listed in the query text are not part
 of most_common_elems (or worse yet, most_common_elems is null) then it is
 left to guess with no real data, and those guesses can be pretty bad.  It
 is not this patches job to fix that, however.



 It assumes all the stats are independent and so doesn't account for
 correlation between members.  This is also how the core selectivity
 estimates work between columns, so I can't really complain about that.  It
 is easy to trick it with things like @@ '(!300  300)'::query_int, but I
 don't think that is necessary to fix that.


Analysis of all the dependencies inside query is NP-complete task. We could
try to workout simple cases, but postgres optimizer currently doesn't care
about it.

# explain select * from test where val = 'val' and val != 'val';
 QUERY PLAN
-
 Seq Scan on test  (cost=0.00..39831.00 rows=45 width=8)
   Filter: ((val  'val'::text) AND (val = 'val'::text))
(2 rows)

I think we could do the same inside intquery until we figure out some
better solution for postgres optimizer in general.


 I have not been creative enough to come up with queries for which this
 improvement in selectivity estimate causes the execution plan to change in
 important ways.  I'm sure the serious users of this module would have such
 cases, however.

 I haven't tested gist__int_ops as I can't get those indexes to build in a
 feasible amount of time.  But the selectivity estimates are independent of
 any actual index, so testing those doesn't seem to be critical.

 There is no documentation change, which makes sense as this internal stuff
 which isn't documented to start with.


Yes. For instance, tsquery make very similar selectivity estimation as
intquery, but it's assumed to be internal and isn't mentioned in
documentation.

There are no regression test changes.  Not sure about that, it does seem
 like regression tests would be desirable.


It would be nice to cover selectivity estimation with regression tests, but
AFAICS we didn't do it for any selectivity estimation functions yet.
Problem is that selectivity estimation is quite complex process and its
result depending on random sampling during analyze, floating points
operations and so on. We could make a test like with very high level of
confidence, estimate number of rows here should be in [10;100]. But it's
hard to fit such assumptions into our current regression tests
infrastructure.

I haven't gone through the C code.


I also did some coding style and comments modifications.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


intarray_sel-2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] clearing opfuncid vs. parallel query

2015-10-23 Thread Alexander Korotkov

On Fri, Oct 23, 2015 at 12:31 PM, YUriy Zhuravlev <
u.zhurav...@postgrespro.ru> wrote:

> On Thursday 22 October 2015 09:26:46 David Fetter wrote:
> > On Thu, Oct 22, 2015 at 07:15:35PM +0300, YUriy Zhuravlev wrote:
> > > Hello.
> > > Currently using nodeToString and stringToNode you can not pass a
> > > full plan. In this regard, what is the plan to fix it? Or in the
> > > under task parallel query does not have such a problem?
> > >
> > > > This turns out not to be straightforward to code, because we don't
> > > > have a generic plan tree walker,
> > >
> > > I have an inner development. I am using python analyzing header
> > > files and generates a universal walker (parser, paths ,executer and
> > > etc trees), as well as the serializer and deserializer to jsonb.
> > > Maybe I should publish this code?
> >
> > Please do.
> Tom Lane and Robert Haas are very unhappy with a python. Is there any
> reason?
>

Requirement of python with pycparser as build dependency is a
serious cataclysm. For instance, how many buildfarms will survive it?
This is why Tom and Robert are looking for ways to evade it.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-10-29 Thread Alexander Korotkov

On Thu, Sep 24, 2015 at 6:36 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Thu, Sep 24, 2015 at 6:32 PM, Andres Freund <and...@anarazel.de> wrote:
>
>> On 2015-09-15 20:16:10 +0300, YUriy Zhuravlev wrote:
>> > We will be tested.
>>
>> Did you have a chance to run some benchmarks?
>>
>
> Yes, we now have 60 physical cores intel server and we're running
> benchmarks on it.
>

We got a consensus with Andres that we should commit the CAS version first
and look to other optimizations.
Refactored version of atomic state patch is attached. The changes are
following:
1) Macros are used for access refcount and usagecount.
2) likely/unlikely were removed. I think introducing of likely/unlikely
should be a separate patch since it touches portability. Also, I didn't see
any performance effect of this.
3) LockBufHdr returns the state after taking lock. Without using atomic
increments it still can save some loops on skip atomic value reading.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

pinunpin-cas.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Some questions about the array.

2015-11-11 Thread Alexander Korotkov

On Mon, Nov 9, 2015 at 8:23 PM, Pavel Stehule <pavel.steh...@gmail.com>
wrote:

> 2015-11-09 17:55 GMT+01:00 Alexander Korotkov <a.korot...@postgrespro.ru>:
>
>> On Mon, Nov 9, 2015 at 4:53 PM, Pavel Stehule <pavel.steh...@gmail.com>
>> wrote:
>>
>>> 2015-11-09 14:44 GMT+01:00 YUriy Zhuravlev <u.zhurav...@postgrespro.ru>:
>>>
>>>> On Monday 09 November 2015 13:50:20 Pavel Stehule wrote:
>>>> > New symbols increase a complexity of our code and our documentation.
>>>> >
>>>> > If some functionality can be implemented via functions without
>>>> performance
>>>> > impacts, we should not to create new operators or syntax - mainly for
>>>> > corner use cases.
>>>> >
>>>> > Regards
>>>> >
>>>> > Pavel
>>>>
>>>> Ok we can use {:} instead [:] for zero array access.
>>>> The function is the solution half.
>>>>
>>>
>>> It isn't solution. The any syntax/behave change have to have stronger
>>> motivation. We had so talk about it 20 years ago :(
>>>
>>
>> Assuming array[~n] has a current meaning, could we give a try to new
>> syntax which doesn't have current meaning? Not yet sure what exactly it
>> could be...
>>
>
> Using this syntax can introduce compatibility issues -
> http://www.postgresql.org/docs/9.1/static/sql-createoperator.html
>

I actually meant some other syntax which doesn't introduce compatibility
issues. For instance, array{n} doesn't have meaning in current syntax
AFAICS.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Some questions about the array.

2015-11-09 Thread Alexander Korotkov

On Mon, Nov 9, 2015 at 4:53 PM, Pavel Stehule <pavel.steh...@gmail.com>
wrote:

> 2015-11-09 14:44 GMT+01:00 YUriy Zhuravlev <u.zhurav...@postgrespro.ru>:
>
>> On Monday 09 November 2015 13:50:20 Pavel Stehule wrote:
>> > New symbols increase a complexity of our code and our documentation.
>> >
>> > If some functionality can be implemented via functions without
>> performance
>> > impacts, we should not to create new operators or syntax - mainly for
>> > corner use cases.
>> >
>> > Regards
>> >
>> > Pavel
>>
>> Ok we can use {:} instead [:] for zero array access.
>> The function is the solution half.
>>
>
> It isn't solution. The any syntax/behave change have to have stronger
> motivation. We had so talk about it 20 years ago :(
>

Assuming array[~n] has a current meaning, could we give a try to new syntax
which doesn't have current meaning? Not yet sure what exactly it could be...

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP: Rework access method interface

2015-11-03 Thread Alexander Korotkov

On Tue, Nov 3, 2015 at 2:36 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Alvaro Herrera <alvhe...@2ndquadrant.com> writes:
> > Tom Lane wrote:
> >> I'm kind of inclined to just let the verifiers read the catalogs for
> >> themselves.  AFAICS, a loop around the results of SearchSysCacheList
> >> is not going to be significantly more code than what this patch does,
> >> and it avoids presuming that we know what the verifiers will wish to
> >> look at.
>
> > Hmm, so this amounts to saying the verifier can only run after the
> > catalog rows are written.  Is that okay?
>
> Why not?  Surely we are not interested in performance-optimizing for
> the case of a detectably incorrect CREATE OP CLASS command.
>
> I don't actually care that much about running the verifiers during
> CREATE/etc OP CLASS at all.  It's at least as important to be able
> to run them across the built-in opclasses, considering all the chances
> for human error in constructing those things.  Even in the ALTER OP CLASS
> case, the verifier would likely need to look at existing catalog rows as
> well as the new ones.  So basically, I see zero value in exposing CREATE/
> ALTER OP CLASS's internal working representation to the verifiers.
>

I'm OK with validating opclass directly by system catalog, i.e. looping
over SearchSysCacheList results. Teodor was telling me something similar
personally.
I'll also try to rearrange header according to the comments upthread.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-10-30 Thread Alexander Korotkov

On Thu, Oct 29, 2015 at 8:18 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Thu, Sep 24, 2015 at 6:36 PM, Alexander Korotkov <
> a.korot...@postgrespro.ru> wrote:
>
>> On Thu, Sep 24, 2015 at 6:32 PM, Andres Freund <and...@anarazel.de>
>> wrote:
>>
>>> On 2015-09-15 20:16:10 +0300, YUriy Zhuravlev wrote:
>>> > We will be tested.
>>>
>>> Did you have a chance to run some benchmarks?
>>>
>>
>> Yes, we now have 60 physical cores intel server and we're running
>> benchmarks on it.
>>
>
> We got a consensus with Andres that we should commit the CAS version first
> and look to other optimizations.
> Refactored version of atomic state patch is attached. The changes are
> following:
> 1) Macros are used for access refcount and usagecount.
> 2) likely/unlikely were removed. I think introducing of likely/unlikely
> should be a separate patch since it touches portability. Also, I didn't see
> any performance effect of this.
> 3) LockBufHdr returns the state after taking lock. Without using atomic
> increments it still can save some loops on skip atomic value reading.
>

pinunpin-cas-original-fix.patch is just original patch by Andres Freund
with fixed bug which causes hang.
Performance comparison on 72-cores Intel server in attached. On this
machine we see no regression in version of patch in previous letter.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
n_clients,master,pinunpin-cas-original-fix,pinunpin-cas
   1,   18860,   19421,   19972
   8,  162503,  166901,  164363
  16,  316429,  321103,  319214
  36,  627215,  641028,  642854
  56,  745108,  774195,  786642
  90,  847150,  995036, 1022657
 100,  802922, 1077732, 1083769
 110,  615070, 1014446, 1034496
 120,  611956, 1074345, 1094610
 130,  571697, 1082626, 1100578
 140,  579909, 1075855, 1092749
 150,  540442, 1070737, 1089283
 160,  545942, 1068991, 1096360
 170,  515444, 1073508, 1089891
 180,  520867, 1076281, 1093987
 190,  488836, 1071283, 1097871
 200,  493795, 1082849, 1103040
 210,  463765,  967185,  988692
 220,  467661,  972180,  993521
 230,  437297,  976865,  980113
 250,  420711,  956151,  980965


pinunpin-cas-original-fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Summary of Vienna sharding summit, new TODO item

2015-11-07 Thread Alexander Korotkov

On Sat, Nov 7, 2015 at 7:52 PM, Bruce Momjian <br...@momjian.us> wrote:

> On Wed, Nov  4, 2015 at 07:58:52AM -0500, Bruce Momjian wrote:
> > [BCC to pgsql-cluster-hack...@postgresql.org]
> >
> > I have summarized the Vienna cluster summit meeting from last week by
> > adding italicized answers to the agenda questions, even though we didn't
> > follow the agenda ;-) :
> >
> >   https://wiki.postgresql.org/wiki/PG-EU_2015_Cluster_Summit
> >
> > I would like to propose the following TODO item:
> >
> >   *  Enhance foreign data wrappers, parallelism, partitioning,
> >   and perhaps add a global snapshot/transaction manager to allow
> >   creation of a proof-of-concept built-in sharding solution
> >
> >   Ideally these enhancements and new facilities will be available to
> >   external sharding solutions as well.
> >
> > Is this acceptable?
>
> Added to TODO.


Great, thank you, Bruce!

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP: Access method extendability

2015-11-02 Thread Alexander Korotkov

Hi!

Thank you for review!

On Mon, Sep 7, 2015 at 6:41 PM, Teodor Sigaev <teo...@sigaev.ru> wrote:

> Some notices:
>
> 1) create-am.3.patch.gz
>   As I understand, you didn't add schema name to access method. Why?
> Suppose, if we implement SQL-like interface for AM screation/dropping then
> we should provide a schema qualification for them
>

Per subsequent discussion it's unclear why we need to make access methods
schema qualified.


> 2) create-am.3.patch.gz get_object_address_am()
> +   switch (list_length(objname))
> ...
> +   case 2:
> +   catalogname = strVal(linitial(objname));
> +   amname = strVal(lthird(objname));
> ^^ seems, it should be
> lsecond()
>

Fixed.


> 3) create-am.3.patch.gz
>  Suppose, RM_GENERIC_ID is part of generic-xlog.3.patch.gz
>

Fixed.

4) Makefile(s)
> As I can see, object files are lexicographically ordered
>

Fixed.

5) gindesc.c -> genericdesc.c in file header
>

Fixed.


> 6) generic-xlog.3.patch.gz
> I don't like an idea to split START_CRIT_SECTION and END_CRIT_SECTION to
> GenericXLogStart and GenericXLogFinish. User's code could throw a error or
> allocate memory between them and error will become a panic.


Fixed. Now, critical section is only inside GenericXLogFinish.
GenericXLogRegister returns pointer to the page copies which could be
modified. GenericXLogFinish swaps the data between page copy and page
itself. That allow us to avoid critical section in used code.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


generic-xlog.4.patch.gz
Description: GNU Zip compressed data


create-am.4.patch.gz
Description: GNU Zip compressed data


bloom-contrib.4.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-10-14 Thread Alexander Korotkov

On Wed, Sep 30, 2015 at 9:46 AM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> On Sat, Sep 19, 2015 at 8:27 AM, Michael Paquier wrote:
> > On Fri, Sep 18, 2015 at 6:25 PM, Michael Paquier wrote:
> >> The refactoring of getTimelineHistory as you propose looks like a good
> >> idea to me, I tried to remove by myself the difference between source
> >> and target in copy_fetch.c and friends but this gets uglier,
> >> particularly because of datadir_source in copy_file_range. Not worth
> >> it.
> >
> > Forgot that:
> > if (ControlFile_target.state != DB_SHUTDOWNED)
> > pg_fatal("target server must be shut down cleanly\n");
> > We may want to allow a target node shutdowned in recovery as well here.
>
> So, attached is a more polished version of this patch, cleaned up of
> its typos with as well other things. I have noticed for example that
> it would be more useful to add the debug information of a timeline
> file fetched from the source or a target server directly in
> getTimelineHistory. I have as well updated a couple of comments in the
> code regarding the fact that we do not necessarily use a master as a
> target node, and mentioned in findCommonAncestorTimeline that we check
> as well the start position of a timeline to cover the case where both
> target and source node forked at the same timeline number but with a
> different WAL fork position.
> I am marking this patch as ready for committer. It would be cool in
> the future to use the recovery test suite to have more advanced
> scenarios tested, but it seems a shame to block this patch because of
> that.
>

Thanks a lot. Now patch looks much better.
I found that it doesn't applies cleanly on the current master. Rebased
version is attached.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pg-rewind-target-switch-6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-10-14 Thread Alexander Korotkov

On Sat, Sep 19, 2015 at 2:25 AM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> OK, I see your point and you are right. This additional check allows
> pg_rewind to switch one timeline back and make the scan of blocks
> begin at the real origin of both timelines. I had in mind the case
> where you tried to actually rewind node 2 to node 3 actually which
> would not be possible with your patch, and by thinking more about that
> I think that it could be possible to remove the check I am listing
> below and rely on the checks in the history files, basically what is
> in findCommonAncestorTimeline:
> if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
> ControlFile_source.checkPointCopy.ThisTimeLineID)
> pg_fatal("source and target cluster are on the same
> timeline\n");
> Alexander, what do you think about that? I think that we should be
> able to rewind with for example node 2 as target and node 3 as source,
> and vice-versa as per the example you gave even if both nodes are on
> the same timeline, just that they forked at a different point. Could
> you test this particular case? Using my script, simply be sure to
> switch archive_mode to on/off depending on the node, aka only 3 and 4
> do archiving.
>

I think relying on different fork point is not safe enough. Imagine more
complex case.

  1
 / \
2   3
|   |
4   5

At first, nodes 2 and 3 are promoted at the same point and they both get
timeline 2.
Then nodes 4 and 5 are promoted at different points and they both get
timeline 3.
Then we can try to rewind node 4 with node 5 as the source or vice versa.
In this case we can't find collision of timeline 2.

The same collision could happen even when source and target are on the
different timeline number. However, having the on the same timeline numbers
is suspicious enough to disallow it until we have additional checks.

I could propose following plan:

   1. Commit this patch without allowing rewind when target and source are
   on the same timelines.
   2. Make additional checks for distinguish different timelines with the
   same numbers.
   3. Allow rewind when target and source are on the same timelines.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] PoC: Partial sort

2015-10-16 Thread Alexander Korotkov

On Sun, Jun 7, 2015 at 11:01 PM, Peter Geoghegan <p...@heroku.com> wrote:

> On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson <andr...@proxel.se>
> wrote:
> > Are you planning to work on this patch for 9.6?
>
> FWIW I hope so. It's a nice patch.
>

I'm trying to to whisk dust. Rebased version of patch is attached. This
patch isn't passing regression tests because of plan changes. I'm not yet
sure about those changes: why they happens and are they really regression?
Since I'm not very familiar with planning of INSERT ON CONFLICT and RLS,
any help is appreciated.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

partial-sort-basic-3.patch
Description: Binary data

regression.diffs
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PoC: Partial sort

2015-10-20 Thread Alexander Korotkov

On Fri, Oct 16, 2015 at 7:11 PM, Alexander Korotkov <aekorot...@gmail.com>
wrote:

> On Sun, Jun 7, 2015 at 11:01 PM, Peter Geoghegan <p...@heroku.com> wrote:
>
>> On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson <andr...@proxel.se>
>> wrote:
>> > Are you planning to work on this patch for 9.6?
>>
>> FWIW I hope so. It's a nice patch.
>>
>
> I'm trying to to whisk dust. Rebased version of patch is attached. This
> patch isn't passing regression tests because of plan changes. I'm not yet
> sure about those changes: why they happens and are they really regression?
> Since I'm not very familiar with planning of INSERT ON CONFLICT and RLS,
> any help is appreciated.
>

Planner regression is fixed in the attached version of patch. It appears
that get_cheapest_fractional_path_for_pathkeys() behaved wrong when no
ordering is required.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


partial-sort-basic-4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] point_ops for GiST

2015-10-12 Thread Alexander Korotkov

Hi, Stefan!

On Sun, Oct 11, 2015 at 10:00 PM, Stefan Keller <sfkel...@gmail.com> wrote:

> Pls. don't misunderstand my questions: They are directed to get an
> even more useful spatial data handling of PostgreSQL. I'm working with
> PostGIS since years and are interested in any work regarding spatial
> types...
>
> Can anyone report use cases or applications of these built-in geometric
> types?
>
> Would'nt it be even more useful to concentrate to PostGIS
> geometry/geography types and extend BRIN to these types?
>

Note, that PostGIS is a different project which is maintained by separate
team. PostGIS have its own priorities, development plan etc.
PostgreSQL have to be self-consistent. In particular, it should have
reference implementation of operator classes which extensions can use as
the pattern. This is why it's important to maintain built-in geometric
types.

In short: once we implement it for built-in geometric types, you can ask
PostGIS team to do the same for their geometry/geography.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] point_ops for GiST

2015-10-12 Thread Alexander Korotkov

On Mon, Oct 12, 2015 at 12:47 PM, Emre Hasegeli <e...@hasegeli.com> wrote:

> > This was already fixed for GiST.
> > See following discussion
> >
> http://www.postgresql.org/message-id/capphfdvgticgniaj88vchzhboxjobuhjlm6c09q_op_u9eo...@mail.gmail.com
> > and commit
> >
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3c29b196b0ce46662cb9bb7a1f91079fbacbcabb
> > "Consistent" method of GiST influences only search and can't lead to
> corrupt
> > indexes. However, "same" method can lead to corrupt indexes.
> > However, this is not the reason to not backpatch the changes and preserve
> > buggy behaviour; this is the reason to recommend reindexing to users.
> And it
> > was already backpatched.
>
> Fixing it on the opclass is not an option for BRIN.  We designed BRIN
> opclasses extensible using extra SQL level support functions and
> operators.  It is possible to support point datatype using box vs
> point operators.  Doing so would lead to wrong results, because point
> operators use FP macros, but box_contain_pt() doesn't.
>

You still can workaround this problem in opclass. For instance, you can
assign different strategy number for this case. And call another support
function instead of overlap operator in brin_inclusion_consistent. For
sure, this would be a kluge.


> GiST opclass could be more clean and extensible, if we wouldn't have
> those macros.
>

In my opinion it would be cool remove FP macros. I see absolutely no sense
in them. But since it break compatibility it would be quite hard though.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] point_ops for GiST

2015-10-12 Thread Alexander Korotkov

Hi, Alvaro!

On Tue, May 12, 2015 at 9:13 PM, Alvaro Herrera <alvhe...@2ndquadrant.com>
wrote:

> Robert Haas wrote:
> > 2009/12/30 Teodor Sigaev <teo...@sigaev.ru>:
> > > Sync with current CVS
> >
> > I have reviewed this patch and it looks good to me.  The only
> > substantive question I have is why gist_point_consistent() uses a
> > different coding pattern for the box case than it does for the polygon
> > and circle cases?  It's not obvious to me on the face of it why these
> > aren't consistent.
>
> Emre Hasegeli just pointed out to me that this patch introduced
> box_contain_pt() and in doing so used straight C comparison (<= etc)
> instead of FPlt() and friends.  I would think that that's a bug and
> needs to be changed -- but certainly not backpatched, because gist
> indexes would/might become corrupt.
>

This was already fixed for GiST.
See following discussion
http://www.postgresql.org/message-id/capphfdvgticgniaj88vchzhboxjobuhjlm6c09q_op_u9eo...@mail.gmail.com
and
commit
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3c29b196b0ce46662cb9bb7a1f91079fbacbcabb
"Consistent" method of GiST influences only search and can't lead to
corrupt indexes. However, "same" method can lead to corrupt indexes.
However, this is not the reason to not backpatch the changes and preserve
buggy behaviour; this is the reason to recommend reindexing to users. And
it was already backpatched.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Some questions about the array.

2015-10-09 Thread Alexander Korotkov

On Fri, Oct 9, 2015 at 6:27 PM, Andrew Dunstan <and...@dunslane.net> wrote:

> On 10/09/2015 08:02 AM, YUriy Zhuravlev wrote:
>
>> We were some of the issues associated with the behavior of arrays.
>> 1. We would like to implement arrays negative indices (from the end) like
>> in
>> Python or Ruby: arr[-2]  or arr[1: -1]
>> but as an array can be indexed in the negative area so it probably can
>> not be
>> done.
>> 2. We would like to add the ability be omitted boundaries in the slice.
>> Example: arr[2:] or arr[:2]. But there was a problem with the update of an
>> empty array:
>> arr[1:][1:] = {1,2,3,4,5,6} can be interpreted as
>> arr[1:3][1:2] or arr[1:2] [1:3] or [1:1], [1:6]
>>
>> What is the history of the emergence of such arrays? Maybe something can
>> be
>> improved?
>>
>> P.S. I would like List datatype as in Python.  Is there any fundamental
>> objections? Or we just did not have the time and enthusiasm before?
>> The current implementation I would call vectors or matrices but not
>> arrays.
>> IMHO
>>
>>
>>
> The name array is now far too baked in to change it.
>
> jsonb and json arrays have many of the characteristics you seem to want.
> They are always 0-based and negative indexes count from the end. They also
> don't have to be regular, unlike our native arrays.
>

jsonb and json arrays support very limited number of types. Casting other
datatypes to/from text is an option, but it is both awkward and not
space-compact.

Omitted boundaries in the slice looks nice for me. Considering problem with
empty array, current behaviour of empty array updating doesn't look
consistent for me.
When updating non-empty array its boundaries isn't extending. If one update
non-empty array out of its boundaries then he get an error "ERROR:  array
subscript out of range".
If we extrapolate this logic to empty arrays then we this error should be
thrown on any update of empty array. Despite this, we allow any update of
empty array.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Fillfactor for GIN indexes

2015-07-10 Thread Alexander Korotkov

On Fri, Jul 10, 2015 at 4:54 AM, Michael Paquier michael.paqu...@gmail.com
wrote:



 On Thu, Jul 9, 2015 at 10:33 PM, Alexander Korotkov wrote:
  [...]

 + /* Caclculate max data size on page according to fillfactor */
 s/Caclculate/Calculate

 When creating a simple gin index, I am seeing that GinGetMaxDataSize gets
 called with ginEntryInsert:
   * frame #0: 0x00010a49d72e
 postgres`GinGetMaxDataSize(index=0x000114168378, isBuild='\x01') + 14
 at gindatapage.c:436
 frame #1: 0x00010a49ecbe
 postgres`createPostingTree(index=0x000114168378,
 items=0x000114a9c038, nitems=35772, buildStats=0x7fff55787350) +
 302 at gindatapage.c:1754
 frame #2: 0x00010a493423
 postgres`buildFreshLeafTuple(ginstate=0x7fff55784d90, attnum=1,
 key=2105441, category='\0', items=0x000114a9c038, nitem=35772,
 buildStats=0x7fff55787350) + 339 at gininsert.c:158
 frame #3: 0x00010a492f84
 postgres`ginEntryInsert(ginstate=0x7fff55784d90, attnum=1, key=2105441,
 category='\0', items=0x000114a9c038, nitem=35772,
 buildStats=0x7fff55787350) + 916 at gininsert.c:228

 And as far as I can see GinGetMaxDataSize uses isBuild:
 + int
 + GinGetMaxDataSize(Relation index, bool isBuild)
 + {
 + int fillfactor, result;
 +
 + /* Grab option value which affects only index build */
 + if (!isBuild)
 However isBuild is not set in this code path, see
 http://www.postgresql.org/message-id/cab7npqsc4vq9mhkqm_yvafcteho-iuy8skbxydnmgnai1xn...@mail.gmail.com
 where I reported the problem. So this code is somewhat broken, no? I am
 attaching to this email the patch in question that should be applied on top
 of Alexander's latest version.


I didn't get what is problem. Was this stacktrace during index build? If
so, GinGetMaxDataSize really should get isBuild='\x01'. It is set by
following line

maxdatasize = GinGetMaxDataSize(index, buildStats ? true: false);


 + #define GIN_MIN_FILLFACTOR20
 + #define GIN_DEFAULT_FILLFACTOR90
 I am still worrying about using a default fillfactor at 90, as we did a
 lot of promotion about compression of GIN indexes in 9.4. Feel free to
 ignore this comment if you think that 90 is a good default. The difference
 is visibly in order of MBs even for large indexes still, this is changing
 the default of 9.4 and 9.5.


On the other side, it's unclear why should we have fillfactor distinct from
btree..

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] pg_trgm version 1.2

2015-07-07 Thread Alexander Korotkov

On Tue, Jun 30, 2015 at 11:28 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Tue, Jun 30, 2015 at 2:46 AM, Alexander Korotkov 
 a.korot...@postgrespro.ru wrote:

 On Sun, Jun 28, 2015 at 1:17 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 This patch implements version 1.2 of contrib module pg_trgm.

 This supports the triconsistent function, introduced in version 9.4 of
 the server, to make it faster to implement indexed queries where some keys
 are common and some are rare.


 Thank you for the patch! Lack of pg_trgm triconsistent support was
 significant miss after fast scan implementation.


 I've included the paths to both upgrade and downgrade between 1.1 and
 1.2, although after doing so you must close and restart the session before
 you can be sure the change has taken effect. There is no change to the
 on-disk index structure


 pg_trgm--1.1.sql andpg_trgm--1.1--1.2.sql are useful for debug, but do
 you expect them in final commit? As I can see in other contribs we have
 only last version and upgrade scripts.



 I had thought that pg_trgm--1.1.sql was needed for pg_upgrade to work, but
 I see that that is not the case.

 I did see another downgrade path for different module, but on closer
 inspection it was one that I wrote while trying to analyze that module, and
 then never removed.  I have no objection to removing pg_trgm--1.2--1.1.sql
 before the commit, but I don't see what we gain by doing so.  If a
 downgrade is feasible and has been tested, why not include it?


See Tom Lane's comment about downgrade scripts. I think just remove it is a
right solution.


 You could get the same benefit just by increasing MAX_MAYBE_ENTRIES (in
 core) from 4 to some higher value (which it probably should be anyway, but
 there will always be a case where it needs to be higher than you can afford
 it to be, so a real solution is needed).


 Actually, it depends on how long it takes to calculate consistent
 function. The cheaper consistent function is, the higher MAX_MAYBE_ENTRIES
 could be. Since all functions in PostgreSQL may define its cost, GIN could
 calculate MAX_MAYBE_ENTRIES from the cost of consistent function.


 The consistent function gets called on every candidate tuple, so if it is
 expensive then it is also all the more worthwhile to reduce the set of
 candidate tuples.  Perhaps MAX_MAYBE_ENTRIES could be calculated from the
 log of the maximum of the predictNumberResult entries? Anyway, a subject
 for a different day


Yes, that also a point. However, we optimize not only costs of consistent
calls but also costs of getting candidate item pointers which are both CPU
and IO.

There may also be some gains in the similarity and regex cases, but I
 didn't really analyze those for performance.



 I've thought about how to document this change.  Looking to other
 example of other contrib modules with multiple versions, I decided that we
 don't document them, other than in the release notes.


 The same patch applies to 9.4 code with a minor conflict in the
 Makefile, and gives benefits there as well.


 Some other notes about the patch:
 * You allocate boolcheck array and don't use it.


 That was a bug (at least notionally).  trigramsMatchGraph was supposed to
 be getting boolcheck, not check, passed in to it.

 It may not have been a bug in practise, because GIN_MAYBE and GIN_TRUE
 both test as true when cast to booleans. But it seems wrong to rely on
 that.  Or was it intended to work this way?

 I'm surprised the compiler suite doesn't emit some kind of warning on that.


I'm not sure. It's attractive to get rid of useless memory allocation and
copying.
You can also change trigramsMatchGraph() to take GinTernaryValue *
argument. Probably casting bool * as GinTernaryValue * looks better than
inverse casting.




 * Check coding style and formatting, in particular check[i]==GIN_TRUE
 should be check[i] == GIN_TRUE.


 Sorry about that, fixed.  I also changed it in other places to check[i] !=
 GIN_FALSE, rather than checking against both GIN_TRUE and GIN_MAYBE.  At
 first I was concerned we might decide to add a 4th option to the type which
 would render  != GIN_FALSE the wrong way to test for it.  But since it is
 called GinTernaryValue, I think we wouldn't add a fourth thing to it.  But
 perhaps the more verbose form is clearer to some people.


Thanks!


 * I think some comments needed in gin_trgm_triconsistent() about
 trigramsMatchGraph(). gin_trgm_triconsistent() may use trigramsMatchGraph()
 that way because trigramsMatchGraph() implements monotonous boolean
 function.


 I have a function-level comment that in all cases, GIN_TRUE is at least as
 favorable to inclusion of a tuple as GIN_MAYBE.  Should I reiterate that
 comment before trigramsMatchGraph() as well?  Calling it a monotonic
 boolean function is precise and concise, but probably less understandable
 to people reading the code.  At least, if I saw that, I would need to go do
 some reading before I knew what it meant.


Let's

Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-07-07 Thread Alexander Korotkov

 (offset 71 lines).
patching file src/include/nodes/plannodes.h
Hunk #1 succeeded at 586 (offset 43 lines).
patching file src/include/nodes/relation.h
Hunk #1 succeeded at 1045 (offset 14 lines).
Hunk #2 succeeded at 1422 (offset 14 lines).
patching file src/include/optimizer/cost.h
Hunk #1 succeeded at 114 (offset 1 line).
patching file src/include/optimizer/pathnode.h
Hunk #1 succeeded at 91 (offset 2 lines).
Hunk #2 succeeded at 104 (offset 2 lines).
Hunk #3 succeeded at 119 (offset 2 lines).
patching file src/include/optimizer/planmain.h
Hunk #1 succeeded at 124 (offset 2 lines).
patching file src/test/regress/expected/equivclass.out
patching file src/test/regress/expected/join.out
Hunk #1 succeeded at 2656 (offset 42 lines).
Hunk #2 succeeded at 3428 (offset 90 lines).
Hunk #3 succeeded at 4506 (offset 90 lines).
patching file src/test/regress/expected/rowsecurity.out
Hunk #1 succeeded at 274 (offset 26 lines).
patching file src/test/regress/expected/select_views.out
Hunk #1 succeeded at 1411 (offset 46 lines).
Hunk #2 succeeded at 1432 (offset 46 lines).
Hunk #3 succeeded at 1466 (offset 46 lines).
Hunk #4 succeeded at 1497 (offset 46 lines).
patching file src/test/regress/expected/select_views_1.out
Hunk #1 succeeded at 1411 (offset 46 lines).
Hunk #2 succeeded at 1432 (offset 46 lines).
Hunk #3 succeeded at 1466 (offset 46 lines).
Hunk #4 succeeded at 1497 (offset 46 lines).
patching file src/test/regress/sql/join.sql
Hunk #1 succeeded at 1344 (offset 37 lines).

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Fillfactor for GIN indexes

2015-07-09 Thread Alexander Korotkov

Hi!

On Wed, Jul 8, 2015 at 10:27 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 In dataPlaceToPageLeaf-function:

  if (append)
 {
 /*
  * Even when appending, trying to append more items than
 will fit is
  * not completely free, because we will merge the new
 items and old
  * items into an array below. In the best case, every new
 item fits in
  * a single byte, and we can use all the free space on
 the old page as
  * well as the new page. For simplicity, ignore segment
 overhead etc.
  */
 maxitems = Min(maxitems, freespace +
 GinDataPageMaxDataSize);
 }


 Hmm. So after splitting the page, there is freespace +
 GinDataPageMaxDataSize bytes available on both pages together. freespace
 has been adjusted with the fillfactor, but GinDataPageMaxDataSize is not.
 So this overshoots, because when leafRepackItems actually distributes the
 segments on the pages, it will fill both pages only up to the fillfactor.
 This is an upper bound, so that's harmless, it only leads to some
 unnecessary work in dealing with the item lists. But I think that should be:

 maxitems = Min(maxitems, freespace + leaf-maxdatasize);


Good catch! Fixed.



  else
 {
 /*
  * Calculate a conservative estimate of how many new
 items we can fit
  * on the two pages after splitting.
  *
  * We can use any remaining free space on the old page to
 store full
  * segments, as well as the new page. Each full-sized
 segment can hold
  * at least MinTuplesPerSegment items
  */
 int nnewsegments;

 nnewsegments = freespace / GinPostingListSegmentMaxSize;
 nnewsegments += GinDataPageMaxDataSize /
 GinPostingListSegmentMaxSize;
 maxitems = Min(maxitems, nnewsegments *
 MinTuplesPerSegment);
 }


 This branch makes the same mistake, but this is calculating a lower bound.
 It's important that maxitems is not set to higher value than what actually
 fits on the page, otherwise you can get an ERROR later in the function,
 when it turns out that not all the items actually fit on the page. The
 saving grace here is that this branch is never taken when building a new
 index, because index build inserts all the TIDs in order, but that seems
 pretty fragile. Should use leaf-maxdatasize instead of
 GinDataPageMaxDataSize here too.


Fixed.


 But that can lead to funny things, if fillfactor is small, and BLCKSZ is
 small too. With the minimums, BLCKSZ=1024 and fillfactor=0.2, the above
 formula will set nnewsegments to zero. That's not going to end up well. The
 problem is that maxdatasize becomes smaller than
 GinPostingListSegmentMaxSize, which is a problem. I think
 GinGetMaxDataSize() needs to make sure that the returned value is always =
 GinPostingListSegmentMaxSize.


Fixed.


 Now that we have a fillfactor, shouldn't we replace the 75% heuristic
 later in that function, when inserting to the rightmost page rather than
 building it from scratch? In B-tree, the fillfactor is applied to that case
 too.


Sounds reasonable. Now it works like btree.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


gin_fillfactor_6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cube extension kNN support

2015-07-09 Thread Alexander Korotkov

Hi!

On Sat, May 9, 2015 at 6:53 AM, Stas Kelvich stas.kelv...@gmail.com wrote:

 Patch is pretty ready, last issue was about changed extension interface,
 so there should be migration script and version bump.
 Attaching a version with all migration stuff.


I can't see cube--1.0--1.1.sql in the patch. Did forget to include it?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-07-10 Thread Alexander Korotkov

On Fri, Jun 26, 2015 at 6:39 AM, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Jun 25, 2015 at 9:23 AM, Peter Eisentraut pete...@gmx.net wrote:
  On 6/22/15 1:37 PM, Robert Haas wrote:
  Currently, the only time we report a process as waiting is when it is
  waiting for a heavyweight lock.  I'd like to make that somewhat more
  fine-grained, by reporting the type of heavyweight lock it's awaiting
  (relation, relation extension, transaction, etc.).  Also, I'd like to
  report when we're waiting for a lwlock, and report either the specific
  fixed lwlock for which we are waiting, or else the type of lock (lock
  manager lock, buffer content lock, etc.) for locks of which there is
  more than one.  I'm less sure about this next part, but I think we
  might also want to report ourselves as waiting when we are doing an OS
  read or an OS write, because it's pretty common for people to think
  that a PostgreSQL bug is to blame when in fact it's the operating
  system that isn't servicing our I/O requests very quickly.
 
  Could that also cover waiting on network?

 Possibly.  My approach requires that the number of wait states be kept
 relatively small, ideally fitting in a single byte.  And it also
 requires that we insert pgstat_report_waiting() calls around the thing
 that is notionally blocking.  So, if there are a small number of
 places in the code where we do network I/O, we could stick those calls
 around those places, and this would work just fine.  But if a foreign
 data wrapper, or any other piece of code, does network I/O - or any
 other blocking operation - without calling pgstat_report_waiting(), we
 just won't know about it.


Idea of fitting wait information into single byte and avoid both locking
and atomic operations is attractive.
But how long we can go with it?
Could DBA make some conclusion by single querying of pg_stat_activity or
double querying?
In order to make a conclusion about system load one have to run daemon or
background worker which is continuously sampling current wait events.
Sampling current wait event with high rate also gives some overhead to the
system as well as locking or atomic operations.
Checking if backend is stuck isn't easy as well. If you don't expose how
long last wait event continues it's hard to distinguish getting stuck on
particular lock and high concurrency on that lock type.

I can propose following:

1) Expose more information about current lock to user. For instance, having
duration of current wait event, user can determine if backend is getting
stuck on particular event without sampling.
2) Accumulate per backend statistics about each wait event type: number of
occurrences and total duration. With this statistics user can identify
system bottlenecks again without sampling.

Number #2 will be provided as a separate patch.
Number #1 require different concurrency model. ldus will extract it from
waits monitoring patch shortly.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP: Rework access method interface

2015-08-27 Thread Alexander Korotkov

On Wed, Aug 26, 2015 at 7:20 PM, Alexander Korotkov 
a.korot...@postgrespro.ru wrote:

 On Wed, Aug 26, 2015 at 6:50 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov a.korot...@postgrespro.ru writes:
  OK. So, as we mentioned before, if we need to expose something of am
  parameters at SQL-level then we need to write special functions which
 would
  call amhandler and expose it.
  Did we come to the agreement on this solution?

 I think we were agreed that we should write functions to expose whatever
 needs to be visible at SQL level.  I'm not sure that we had a consensus
 on exactly which things need to be visible.

 One thought here is that we might not want to just blindly duplicate
 the existing pg_am behavior anyway.  For example, the main use of the
 amstrategies column was to allow validation of pg_amop.amopstrategy
 entries --- but in 4 out of the 6 existing AMs, knowledge of the AM alone
 isn't sufficient information to determine the valid set of strategy
 numbers anyway.  So inventing a pg_amstrategies(am oid) function seems
 like it would be repeating a previous failure.  Not quite sure what to
 do instead, though.  We could imagine something like pg_amstrategies(am
 oid, opclass oid), but I don't see how to implement it without asking
 opclasses to provide a validation function, which maybe is a change we
 don't want to take on here.


 Could we add another function to access method interface which would
 validate opclass?
 Am could validate this way not only strategies, but also supporting
 functions. For instance, in GIN, we now require opclass to specify at least
 one of consistent and triconsistent. ISTM I would be nice to let the access
 method check such conditions. Also, we would be able to check opclass
 correction on its creation. Now one can create opclass with missing support
 functions which doesn't work.
 In the SQL-level we can create function which validates opclass using this
 new method. This function can be used in regression tests.


Should I try to implement such new access method function, say 'amvalidate'?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP: Rework access method interface

2015-08-26 Thread Alexander Korotkov

On Wed, Aug 26, 2015 at 6:50 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov a.korot...@postgrespro.ru writes:
  OK. So, as we mentioned before, if we need to expose something of am
  parameters at SQL-level then we need to write special functions which
 would
  call amhandler and expose it.
  Did we come to the agreement on this solution?

 I think we were agreed that we should write functions to expose whatever
 needs to be visible at SQL level.  I'm not sure that we had a consensus
 on exactly which things need to be visible.

 One thought here is that we might not want to just blindly duplicate
 the existing pg_am behavior anyway.  For example, the main use of the
 amstrategies column was to allow validation of pg_amop.amopstrategy
 entries --- but in 4 out of the 6 existing AMs, knowledge of the AM alone
 isn't sufficient information to determine the valid set of strategy
 numbers anyway.  So inventing a pg_amstrategies(am oid) function seems
 like it would be repeating a previous failure.  Not quite sure what to
 do instead, though.  We could imagine something like pg_amstrategies(am
 oid, opclass oid), but I don't see how to implement it without asking
 opclasses to provide a validation function, which maybe is a change we
 don't want to take on here.


Could we add another function to access method interface which would
validate opclass?
Am could validate this way not only strategies, but also supporting
functions. For instance, in GIN, we now require opclass to specify at least
one of consistent and triconsistent. ISTM I would be nice to let the access
method check such conditions. Also, we would be able to check opclass
correction on its creation. Now one can create opclass with missing support
functions which doesn't work.
In the SQL-level we can create function which validates opclass using this
new method. This function can be used in regression tests.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP: Rework access method interface

2015-08-26 Thread Alexander Korotkov

On Tue, Aug 25, 2015 at 7:20 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Jim Nasby jim.na...@bluetreble.com writes:
  On 8/25/15 10:56 AM, Tom Lane wrote:
  I'm good with this as long as all the things that get stored in pg_am
  are things that pg_class.relam can legitimately reference.  If somebody
  proposed adding an access method kind that was not a relation access
  method, I'd probably push back on whether that should be in pg_am or
  someplace else.

  Would fields in pg_am be overloaded then?

 No, because the proposal was to reduce pg_am to just amname, amkind
 (which would be something like 'i' or 's'), and amhandler.  Everything
 specific to a particular type of access method would be shoved down to
 the level of the C APIs.


OK. So, as we mentioned before, if we need to expose something of am
parameters at SQL-level then we need to write special functions which would
call amhandler and expose it.
Did we come to the agreement on this solution?

 From a SQL standpoint it'd be
  much nicer to have child tables, though that could potentially be faked
  with views.

 I've looked into having actual child tables in the system catalogs, and
 I'm afraid that the pain-to-reward ratio doesn't look very good.


Agree. Teach syscache about inheritance would be overengeneering for this
problem.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Horizontal scalability/sharding

On Mon, Aug 31, 2015 at 9:48 PM, Mason S <masonli...@gmail.com> wrote:

>
>>  We also a bit disappointed by Huawei position about CSN patch, we hoped
>> to use for  our XTM.
>>
>
> Disappointed in what way? Moving to some sort of CSN approach seems to
> open things up for different future ideas. In the short term, it would mean
> replacing potentially large snapshots and longer visibility checks. In the
> long term, perhaps CSN could help simplify the design of multi-master
> replication schemes.
>

We are disappointed because at PGCon talk Huawei announced publishing of
their CSN patch and further work in this direction together with community.
However, it's even not published yet despite all the promises. Nobody from
Huawei answers CSN thread in the hackers.
So, I think we got nothing from Huawei except teasing and should rely only
on ourselves. That is disappointing.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP: Access method extendability

Hackers,

there is next revision of patches providing access method extendability.
Now it's based on another patch which reworks access method interface.
http://www.postgresql.org/message-id/capphfdsxwzmojm6dx+tjnpyk27kt4o7ri6x_4oswcbyu1rm...@mail.gmail.com

Besides access method interface, major change is generic xlog interface.
Now, generic xlog interface is more user friendly. Generic xlog compares
initial and changed versions of page by itself. The only thing it can't do
is to find data moves inside page, because it would be too high overhead.
So in order to get compact WAL records one should use
GenericXLogMemmove(dst, src, size) in order to move data inside page. If
this function wasn't used then WAL records would just not so compact.

In general pattern of generic WAL usage is following.

1) Start using generic WAL: specify relation

GenericXLogStart(index);

2) Register buffers

GenericXLogRegister(0, buffer1, false);
GenericXLogRegister(1, buffer2, true);

first argument is a slot number, second is the buffer, third is flag
indicating new buffer

3) Do changes in the pages. Use GenericXLogMemmove() if needed.

4) Finish using GenericXLogFinish(), or abort using GenericXLogAbort(). In
the case of abort initial state of pages will be reverted.

Generic xlog takes care about critical section, unlogged relation, setting
lsn, making buffer dirty. User code is just simple and clear.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

create-am.2.patch.gz
Description: GNU Zip compressed data

generic-xlog.2.patch.gz
Description: GNU Zip compressed data

bloom-contrib.2.patch.gz
Description: GNU Zip compressed data

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Commitfest remaining "Needs Review" items

On Mon, Aug 31, 2015 at 4:31 PM, Andres Freund <and...@anarazel.de> wrote:

> On 2015-08-31 16:22:54 +0300, Alexander Korotkov wrote:
> > Is it correct to switch 2015-09 commitfest to inprogress now?
>
> Yea, isn't it only starting the 15th?

AFICS, on the last developer meeting it was decided to start commitfests in
1st.
https://wiki.postgresql.org/wiki/PgCon_2015_Developer_Meeting#9.6_Schedule

> Can we add an option to display
> days in the CF app?
>

+1 for making these dates more explicit

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Commitfest remaining "Needs Review" items

On Mon, Aug 31, 2015 at 3:16 PM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> On Mon, Aug 31, 2015 at 9:13 PM, Magnus Hagander wrote:
> > Anyway - that CF is closed, but we seem to not have *any* CF open at this
> > point. Should we not make 2015-11 open?
>
> Yeah, switched.

Is it correct to switch 2015-09 commitfest to inprogress now?
I thought we should still accept patches to 2015-09 since it's 2015-08-31
now.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] [PROPOSAL] Effective storage of duplicates in B-tree index.

2015-09-01 Thread Alexander Korotkov

Hi, Tomas!

On Mon, Aug 31, 2015 at 6:26 PM, Tomas Vondra <tomas.von...@2ndquadrant.com>
wrote:

> On 08/31/2015 09:41 AM, Anastasia Lubennikova wrote:
>
>> I'm going to begin work on effective storage of duplicate keys in B-tree
>> index.
>> The main idea is to implement posting lists and posting trees for B-tree
>> index pages as it's already done for GIN.
>>
>> In a nutshell, effective storing of duplicates in GIN is organised as
>> follows.
>> Index stores single index tuple for each unique key. That index tuple
>> points to posting list which contains pointers to heap tuples (TIDs). If
>> too many rows having the same key, multiple pages are allocated for the
>> TIDs and these constitute so called posting tree.
>> You can find wonderful detailed descriptions in gin readme
>> <
>> https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README
>> >
>> and articles <http://www.cybertec.at/gin-just-an-index-type/>.
>> It also makes possible to apply compression algorithm to posting
>> list/tree and significantly decrease index size. Read more in
>> presentation (part 1)
>> <http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.
>>
>> Now new B-tree index tuple must be inserted for each table row that we
>> index.
>> It can possibly cause page split. Because of MVCC even unique index
>> could contain duplicates.
>> Storing duplicates in posting list/tree helps to avoid superfluous splits.
>>
>> So it seems to be very useful improvement. Of course it requires a lot
>> of changes in B-tree implementation, so I need approval from community.
>>
>
> In general, index size is often a serious issue - cases where indexes need
> more space than tables are not quite uncommon in my experience. So I think
> the efforts to lower space requirements for indexes are good.
>
> But if we introduce posting lists into btree indexes, how different are
> they from GIN? It seems to me that if I create a GIN index (using
> btree_gin), I do get mostly the same thing you propose, no?
>

Yes, In general GIN is a btree with effective duplicates handling + support
of splitting single datums into multiple keys.
This proposal is mostly porting duplicates handling from GIN to btree.

Sure, there are differences - GIN indexes don't handle UNIQUE indexes,

The difference between btree_gin and btree is not only UNIQUE feature.
1) There is no gingettuple in GIN. GIN supports only bitmap scans. And it's
not feasible to add gingettuple to GIN. At least with same semantics as it
is in btree.
2) GIN doesn't support multicolumn indexes in the way btree does.
Multicolumn GIN is more like set of separate singlecolumn GINs: it doesn't
have composite keys.
3) btree_gin can't effectively handle range searches. "a < x < b" would be
hangle as "a < x" intersect "x < b". That is extremely inefficient. It is
possible to fix. However, there is no clear proposal how to fit this case
into GIN interface, yet.

> but the compression can only be effective when there are duplicate rows.
> So either the index is not UNIQUE (so the b-tree feature is not needed), or
> there are many updates.
>

>From my observations users can use btree_gin only in some cases. They like
compression, but can't use btree_gin mostly because of #1.

Which brings me to the other benefit of btree indexes - they are designed
> for high concurrency. How much is this going to be affected by introducing
> the posting lists?
>

I'd notice that current duplicates handling in PostgreSQL is hack over
original btree. It is designed so in btree access method in PostgreSQL, not
btree in general.
Posting lists shouldn't change concurrency much. Currently, in btree you
have to lock one page exclusively when you're inserting new value.
When posting list is small and fits one page you have to do similar thing:
exclusive lock of one page to insert new value.
When you have posting tree, you have to do exclusive lock on one page of
posting tree.

One can say that concurrency would became worse because index would become
smaller and number of pages would became smaller too. Since number of pages
would be smaller, backends are more likely concur for the same page. But
this argument can be user against any compression and for any bloat.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Use pg_rewind when target timeline was switched

On Thu, Aug 20, 2015 at 9:57 AM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> On Wed, Jul 22, 2015 at 4:28 PM, Alexander Korotkov <
> a.korot...@postgrespro.ru> wrote:
>
>> On Wed, Jul 22, 2015 at 8:48 AM, Michael Paquier <
>> michael.paqu...@gmail.com> wrote
>>
>>> On Mon, Jul 20, 2015 at 9:18 PM, Alexander Korotkov
>>> <a.korot...@postgrespro.ru> wrote:
>>> > attached patch allows pg_rewind to work when target timeline was
>>> switched.
>>> > Actually, this patch fixes TODO from pg_rewind comments.
>>> >
>>> >   /*
>>> >* Trace the history backwards, until we hit the target timeline.
>>> >*
>>> >* TODO: This assumes that there are no timeline switches on the
>>> target
>>> >* cluster after the fork.
>>> >*/
>>> >
>>> > This patch allows pg_rewind to handle data directory synchronization
>>> is much
>>> > more general way.
>>> For instance, user can return promoted standby to old master.
>>
>>
> +   /*
> +* Since incomplete segments are copied into next
> timelines, find the
> +* lastest timeline holding required segment.
> +*/
> +   while (private->tliIndex < targetNentries - 1 &&
> +  targetHistory[private->tliIndex].end <
> targetSegEnd)
> +   {
> +   private->tliIndex++;
> +   tli_index++;
> +   }
> It seems to me that the patch is able to handle timeline switches onwards,
> but not backwards and this is what would be required to return a promoted
> standby, that got switched to let's say timeline 2 when promoted, to an old
> master, that is still on timeline 1. This code actually fails when scanning
> for the last checkpoint before WAL forked as it will be on the timeline 1
> of the old master. Imagine for example that the WAL has forked at
> 0/30X which is saved in segment 00020003 (say 2/0/3) on
> the promoted standby, and that the last checkpoint record is on 0/20X,
> which is part of 00010002 (1/0/2). I think that we should
> scan 2/0/3 (not the partial segment 1/0/3), and then 1/0/2 when looking for
> the last checkpoint record. Hence the startup index TLI should be set to
> the highest timeline and should be decremented depending on what is in the
> history file.
> The code above looks correct to me when scanning the WAL history onwards
> though, which is what is done when extracting the page map, but not
> backwards when we try to find the last common checkpoint record. This code
> actually fails trying to open 2/0/2 that does not exist in the promoted
> standby's pg_xlog in my test case.
>
> Attached is a small script I have used to reproduce the failure.
>

Right, thanks! It should be fixed in the attached version of patch.

I think that the documentation needs a brush up as well to outline the fact
> that pg_rewind would be able to put back as well a standby in a cluster,
> after for example an operator mistake when promoting a node that should not
> be.
>

Yes. But from current pg_rewind documentation I can't figure out that it
can't do so. However, we can add an explicit statement which claims that it
can.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pg-rewind-target-switch-3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Access method extendability

On Tue, Sep 1, 2015 at 6:50 PM, Teodor Sigaev <teo...@sigaev.ru> wrote:

> In general pattern of generic WAL usage is following.
>>
>> 1) Start using generic WAL: specify relation
>>
>
> M-m, what about extensions which wants to use WAL but WAL record doesn't
> connected to any relation? For example, transaction manager or kind of FDW.
>
>
>> GenericXLogStart(index);
>>
>> 2) Register buffers
>>
>> GenericXLogRegister(0, buffer1, false);
>> GenericXLogRegister(1, buffer2, true);
>>
>> first argument is a slot number, second is the buffer, third is flag
>> indicating
>> new buffer
>>
>
> Why do we need a slot number? to replace already registered buffer?


For further simplification slot number could be omitted. In the attached
patches, generic xlog replay applies changes in the same order
GenericXLogRegister was called.
Patches was rebased against latest version of am interface rework patch.
http://www.postgresql.org/message-id/CAPpHfduGY=KZSBPZN5+USTXev-9M2PAUp3Yi=syfdo2n244...@mail.gmail.com

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


create-am.3.patch.gz
Description: GNU Zip compressed data


generic-xlog.3.patch.gz
Description: GNU Zip compressed data


bloom-contrib.3.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Rework access method interface

On Mon, Sep 7, 2015 at 10:02 PM, Petr Jelinek <p...@2ndquadrant.com> wrote:

> On 2015-09-07 20:56, Alexander Korotkov wrote:
>
>> On Mon, Sep 7, 2015 at 9:17 PM, Petr Jelinek <p...@2ndquadrant.com
>>
>> However I don't like the naming differences between validate_opclass
>> and amvalidate. If you expect that the current amvalidate will only
>> be used for opclass validation then it should be renamed accordingly.
>>
>>
>> I'm not yet sure if we need separate validation of opfamilies.
>>
>>
> Well either the amvalidate or the validate_opclass should be renamed IMHO,
> depending on which way the checking goes (one interface for everything with
> generic name or multiple interfaces for multiple validations).


Yes, I agree with you about naming.
I'm not sure about separate validation of opfamilies independent of its
naming. I'd like to get any arguments/advises about it.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Waits monitoring

On Mon, Sep 7, 2015 at 6:28 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:

> On Sun, Sep 6, 2015 at 5:58 PM, Andres Freund <and...@anarazel.de> wrote:
> >
> > On 2015-09-04 23:44:21 +0100, Simon Riggs wrote:
> > > I see the need for both current wait information and for cumulative
> > > historical detail.
> > >
> > > I'm willing to wait before reviewing this, but not for more than 1
> more CF.
> > >
> > > Andres, please decide whether we should punt to next CF now, based upon
> > > other developments. Thanks
> >
> > I think we can do some of the work concurrently - the whole lwlock
> > infrastructure piece is rather independent and what currently most of
> > the arguments are about. I agree that the actual interface will need to
> > be coordinated.
> >
> > Ildus, could you please review Amit & Robert's patch?
> >
>
> Are you talking about patch where I have fixed few issues in Robert's
> patch [1] or the original patch [3] written by me.
>

AFAICS, Andres meant [3].

[1] -
> http://www.postgresql.org/message-id/caa4ek1kdec1tm5ya9gkv85vtn4qqsrxzkjru-tu70g_tl1x...@mail.gmail.com
> [2] -
> http://www.postgresql.org/message-id/3f71da37-a17b-4961-9908-016e6323e...@postgrespro.ru
> [3] -
> http://www.postgresql.org/message-id/caa4ek1kt2e6xhvigisr5o1ac9nfo0j2wtb8n0ggd1_jklde...@mail.gmail.com
>

For me, major issues of [3] are:
1) We fit all information about current wait event into single byte.
2) We put this byte into PgBackendStatus.
Both of these issues are essential for design of [3].

Issue #1 means that we actually can expose to user only event type (until
there are less then 256 types) with no parameters [4].
Issue #2 means that we have to monitor only backend not auxiliary processes
[5].

Both of these issues seems to be serious limitations for me. It makes me
think we either need a different design or need to expose current wait
event in some other way additionally. Anyway, I think attracting more
attention to this problem is good. It would be very nice if Simon or Andres
give review of this.

[4] -
http://www.postgresql.org/message-id/CAPpHfdtdF8LyR0zBA_tzAwYq00GFZyVbh_XfFAABRQQ=mbn...@mail.gmail.com
[5] - http://www.postgresql.org/message-id/55c3306b.5010...@postgrespro.ru

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP: Rework access method interface

On Mon, Sep 7, 2015 at 9:17 PM, Petr Jelinek <p...@2ndquadrant.com> wrote:

> On 2015-09-04 16:26, Alexander Korotkov wrote:
>
>>
>> Attached patch is implementing this. It doesn't pretend to be fully
>> correct implementation, but it should be enough for proof the concept.
>> In this patch access method exposes another function: amvalidate. It
>> takes data structure representing opclass and throws error if it finds
>> it invalid.
>> This method is used on new opclass definition (alter operator family
>> etc. are not yet implemented but planned). Also, there is SQL function
>> validate_opclass(oid) which is used in regression tests.
>> Any thoughts?
>>
>>
> This is starting to look good
>

Thanks!


> However I don't like the naming differences between validate_opclass and
> amvalidate. If you expect that the current amvalidate will only be used for
> opclass validation then it should be renamed accordingly.
>

I'm not yet sure if we need separate validation of opfamilies.

Also GetIndexAmRoutine should check the return type of the amhandler.


Will be fixed.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-09-08 Thread Alexander Korotkov

On Tue, Sep 8, 2015 at 10:28 AM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> On Tue, Sep 8, 2015 at 1:14 AM, Alexander Korotkov wrote:
> > On Thu, Aug 20, 2015 at 9:57 AM, Michael Paquier wrote:
> >> The code above looks correct to me when scanning the WAL history onwards
> >> though, which is what is done when extracting the page map, but not
> >> backwards when we try to find the last common checkpoint record. This
> code
> >> actually fails trying to open 2/0/2 that does not exist in the promoted
> >> standby's pg_xlog in my test case.
> >>
> >> Attached is a small script I have used to reproduce the failure.
> >
> >
> > Right, thanks! It should be fixed in the attached version of patch.
>
> So, instead of a code review, I have been torturing your patch and did
> advanced tests on it with the attached script, that creates a cluster
> as follows:
>  master (5432)
>   /\
>  1 (5433)   2 (5434)
>  |
>  3 (5435)
> Once cluster is created, nodes are promoted in a certain order giving
> them different timeline jump properties:
> - master, stays on tli 1
> - standby 1, tli 1->2
> - standby 2, tli 1->3
> - standby 3, tli 1->2->4
> And data is inserted on each of them to make WAL fork properly.
> Finally the script tries to rewind one node using another node as
> source, and then tries to link this target node back to the source
> node via streaming replication.
>
> I have tested all the one-one permutations possible in the structure
> above (see commented portions at the bottom of my script), and all of
> them worked. I have to say that from the testing prospective this
> patch looks in great shape, and will greatly improve the use cases of
> pg_rewind!
>

Great! Many thanks for your testing!

I am planning to do as well a detailed code review rather soon.
>

Good, thanks.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Waits monitoring

2015-09-09 Thread Alexander Korotkov

On Tue, Sep 8, 2015 at 8:01 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Mon, Sep 7, 2015 at 7:33 AM, Ildus Kurbangaliev
> <i.kurbangal...@postgrespro.ru> wrote:
> >> > Ildus, could you please review Amit & Robert's patch?
> >> [1] -
> >>
> http://www.postgresql.org/message-id/caa4ek1kdec1tm5ya9gkv85vtn4qqsrxzkjru-tu70g_tl1x...@mail.gmail.com
> > About [1] and [2]. They are slightly conflicting, but if [1] is
> > going to be committed I can easily use it in [2]. And it will simplify
> > my patch, so I have no objections here. And the same time [3] can be
> > significantly simplified and improved on top of [1] and [2].
>
> Great - let's try to deal with [1] first, then.
>
> Does anyone wish to object to me committing that part?
>

No objections from me.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-09-09 Thread Alexander Korotkov

On Wed, Sep 9, 2015 at 9:01 AM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

>
>
> On Wed, Sep 9, 2015 at 3:27 AM, Alexander Korotkov
> <a.korot...@postgrespro.ru> wrote:
>
>> On Tue, Sep 8, 2015 at 10:28 AM, Michael Paquier
>> <michael.paqu...@gmail.com>wrote:
>>
>>> I am planning to do as well a detailed code review rather soon.
>>>
>>
>> Good, thanks.
>>
>
> When testing a bit more complex structures, it occurred to me that we may
> want as well to use as a source node a standby. For example based on the
> case of my cluster above:
>  master (5432)
>   /  \
>  1 (5433)   2 (5434)
>  |
>  3 (5435)
> If I try for example to rebuild the cluster as follows there will be
> failures:
> 1) Rewind with source = 3, target = 2
> 2) Start 3 and 2
> 3) Shutdown 2
> 3) Rewind source = 2, target = 1, failure with:
> source data directory must be shut down cleanly
>
> It seems to me that we should allow a node that has been shutdowned in
> recovery to be used as a source for rewind as well, as follows:
> -   if (datadir_source && ControlFile_source.state != DB_SHUTDOWNED)
> +   if (datadir_source &&
> +   ControlFile_source.state != DB_SHUTDOWNED &&
> +   ControlFile_source.state != DB_SHUTDOWNED_IN_RECOVERY)
> pg_fatal("source data directory must be shut down
> cleanly\n");
> At least your patch justifies in my eyes such a change.
>

It's source is not required to be √ in recovery if we specify it by
connection string. This is why I think DB_SHUTDOWNED_IN_RECOVERY is OK when
we specify source by data directory. Included in attached patch.

 /*
> + * Find minimum from two xlog pointers assuming invalid pointer is
> greatest
> + * possible pointer.
> + */
> +static XLogRecPtr
> +xlPtrMin(XLogRecPtr a, XLogRecPtr b)
> +{
> +   if (XLogRecPtrIsInvalid(a))
> +   return b;
> +   else if (XLogRecPtrIsInvalid(b))
> +   return a;
> +   else
> +   return Min(a, b);
> +}
> This is true as timeline.h tells so, still I think that it would be better
> to mention that this is this assumption is held in this header file, or
> simply that timeline history entries at the top have their end position set
> as InvalidXLogRecPtr which is a synonym of infinity.
>

Agree. Comment is adjusted.


> The code building the target history file is a duplicate of what is done
> with the source. Perhaps we could refactor that as a single routine in
> pg_rewind.c?
>

Do you mean duplication between rewind_parseTimeLineHistory()
and readTimeLineHistory(). We could put readTimeLineHistory() into common
with some refactoring. This is the subject of separate patch, though.

Except that, the patch looks pretty neat to me. I was wondering as well:
> what tests did you run up to now with this patch? I am attaching an updated
> version of my test script I used for some more complex scenarios. Feel free
> to play with it.
>

I was running manual tests like a noob :) When you attached you bash
script, I've switched to it.

A also added additional check in findCommonAncestorTimeline(). Two standbys
could be independently promoted and get the same new timeline ID. Now, it's
checked that timelines, that we assume to be the same, have equal begins.
Begins could still match by coincidence. But the same risk exists in
original pg_rewind, though.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pg-rewind-target-switch-4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

On Mon, Sep 14, 2015 at 2:12 PM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Mon, Sep 14, 2015 at 2:25 PM, Alexander Korotkov <aekorot...@gmail.com>
> wrote:
>
>> On Sat, Sep 12, 2015 at 2:05 PM, Amit Kapila <amit.kapil...@gmail.com>
>> wrote:
>>
>>> On Thu, Aug 6, 2015 at 3:31 PM, Ildus Kurbangaliev <
>>> i.kurbangal...@postgrespro.ru> wrote:
>>> >
>>> > On 08/05/2015 09:33 PM, Robert Haas wrote:
>>> >>
>>> >>
>>> >> You're missing the point.  Those multi-byte fields have additional
>>> >> synchronization requirements, as I explained in some detail in my
>>> >> previous email. You can't just wave that away.
>>> >
>>> > I see that now. Thank you for the point.
>>> >
>>> > I've looked deeper and I found PgBackendStatus to be not a suitable
>>> > place for keeping information about low level waits. Really,
>>> PgBackendStatus
>>> > is used to track high level information about backend. This is why
>>> auxiliary
>>> > processes don't have PgBackendStatus, because they don't have such
>>> information
>>> > to expose. But when we come to the low level wait events then auxiliary
>>> > processes are as useful for monitoring as backends are. WAL writer,
>>> > checkpointer, bgwriter etc are using LWLocks as well. This is
>>> certainly unclear
>>> > why they can't be monitored.
>>> >
>>>
>>> I think the chances of background processes stuck in LWLock is quite less
>>> as compare to backends as they do the activities periodically.  As an
>>> example
>>> WALWriter will take WALWriteLock to write the WAL, but actually there
>>> will never
>>> be any much contention for WALWriter. In synchronous_commit = on, the
>>> backends themselves write the WAL so WALWriter won't do much in that
>>> case and for synchronous_commit = off, backends won't write the WAL so
>>> WALWriter won't face any contention unless some buffers have to be
>>> written
>>> by bgwriter or checkpoint for which WAL is not flushed which I don't
>>> think
>>> would lead to any contention.
>>>
>>
>> Hmm, synchronous_commit is per session variable: some transactions could
>> run with synchronous_commit = on, but some with synchronous_commit = off.
>> This is very popular feature of PostgreSQL: achieve better performance by
>> making non-critical transaction asynchronous while leaving critical
>> transactions synchronous. Thus, contention for WALWriteLock between
>> backends and WALWriter could be real.
>>
>>
> I think it is difficult to say that can lead to contention due to periodic
> nature of WALWriter, but I don't deny that there is chance for
> background processes to have contention.
>

We don't know if there could be contention in advance. This is why we need
monitoring.


> I am not denying from the fact that there could be some contention in rare
>>> scenarios for background processes, but I think tracking them is not as
>>> important as tracking the LWLocks for backends.
>>>
>>
>> I would be more careful in calling some of scenarios rare. As DBMS
>> developers we should do our best to evade contention for LWLocks: any
>> contention, not only between backends and background processes.  One may
>> assume that high LWLock contention is rare scenario in general. Once we're
>> here we doesn't think so, though.
>> You claims that there couldn't be contention for WALWriteLock between
>> backends and WALWriter. This is unclear for me: I think it could be.
>>
>
> I think there would be more things where background processes could wait
> than LWLocks and I think they are important to track, but could be done
> separately
> from tracking them for pg_stat_activity.  Example, we have a
> pg_stat_bgwriter
> view, can't we think of tracking bgwriter/checkpointer wait information in
> that
> view and similarly for other background processes we can track in other
> views
> if any related view exists or create a new one to track for all background
> processes.
>
>
>> Nobody opposes tracking wait events for backends and tracking them for
>> background processes. I think we need to track both in order to provide
>> full picture to DBA.
>>
>>
> Sure, that is good to do, but can't we do it separately in another patch.
> I think in this patch lets just work for wait_events for backends.
>

Yes, but I think we should have a design of tracking wait event for

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

On Mon, Sep 14, 2015 at 2:25 PM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Mon, Sep 14, 2015 at 3:02 PM, Alexander Korotkov <aekorot...@gmail.com>
> wrote:
>
>> On Sat, Sep 12, 2015 at 3:24 PM, Amit Kapila <amit.kapil...@gmail.com>
>> wrote:
>>
>>> On Fri, Aug 14, 2015 at 7:23 PM, Alexander Korotkov <
>>> aekorot...@gmail.com> wrote:
>>> >
>>> > On Thu, Aug 6, 2015 at 1:01 PM, Ildus Kurbangaliev <
>>> i.kurbangal...@postgrespro.ru> wrote:
>>> >>
>>> >>
>>> >> I've looked deeper and I found PgBackendStatus to be not a suitable
>>> >> place for keeping information about low level waits. Really,
>>> PgBackendStatus
>>> >> is used to track high level information about backend. This is why
>>> auxiliary
>>> >> processes don't have PgBackendStatus, because they don't have such
>>> information
>>> >> to expose. But when we come to the low level wait events then
>>> auxiliary
>>> >> processes are as useful for monitoring as backends are. WAL writer,
>>> >> checkpointer, bgwriter etc are using LWLocks as well. This is
>>> certainly unclear
>>> >> why they can't be monitored.
>>> >>
>>> >> This is why I think we shoudn't place wait event into
>>> PgBackendStatus. It
>>> >> could be placed into PGPROC or even separate data structure with
>>> different
>>> >> concurrency model which would be most suitable for monitoring.
>>> >
>>> >
>>> > +1 for tracking wait events not only for backends
>>> >
>>> > Ildus, could you do following?
>>> > 1) Extract LWLocks refactoring into separate patch.
>>> > 2) Make a patch with storing current wait event information in PGPROC.
>>> >
>>>
>>> Now as Robert has committed standardization of lwlock names in
>>> commit - aa65de04, let us try to summarize and work on remaining parts
>>> of the patch. So I think now the next set of work is as follows:
>>>
>>> 1. Modify the tranche mechanism so that information about LWLocks
>>> can be tracked easily.  For this already there is some discussion, ideas
>>> and initial patch is floated in this thread and there doesn't seem to be
>>> much
>>> conflicts, so we can write the patch for it.  I am planning to write or
>>> modify
>>> the existing patch unless you, IIdus or anyone has objections or want to
>>> write it, please let me know to avoid duplication of work.
>>>
>>> 2. Track wait_event in pg_stat_activity.  Now here the main point where
>>> we doesn't seem to be in complete agreement is that shall we keep it
>>> as one byte in PgBackendStatus or use two or more bytes to store
>>> wait_event information and still there is another point made by you to
>>> store it in PGProc rather than in PgBackendStatus so that we can display
>>> information of background processes as well.
>>>
>>> Now as a matter of consensus, I think Tom has raised a fair point [1]
>>> against
>>> storing this information in PGProc and I feel that it is relatively less
>>> important to have such information about background processes and the
>>> reason for same is explained upthread [2].  About having more than
>>> one-byte
>>> to store information about various wait_events, I think now we will not
>>> have
>>> more than 100 or so events to track, do you really think that anytime in
>>> forseeable
>>> future we will have more than 256 important events which we would like
>>> to track?
>>>
>>> So I think about this lets first try to build the consensus and then
>>> attempt to
>>> write or modify the patch.
>>>
>>>
>>> [1] - http://www.postgresql.org/message-id/4067.1439561...@sss.pgh.pa.us
>>> [2] -
>>> http://www.postgresql.org/message-id/CAA4eK1JRQGTgRMRao6H65rA=fhifucnqhx7ongy58um6rn1...@mail.gmail.com
>>>
>>
>> In order to build the consensus we need the roadmap for waits monitoring.
>> Would single byte in PgBackendStatus be the only way for tracking wait
>> events? Could we have pluggable infrastructure in waits monitoring: for
>> instance, hooks for wait event begin and end?
>> Limit of 256 wait events is probably OK. But we have no room for any
>> additional parameters of wait event. For instance, if you notice high
>> contention for buffer content lock th

Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-09-16 Thread Alexander Korotkov

On Thu, Sep 10, 2015 at 8:33 AM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> On Wed, Sep 9, 2015 at 7:13 PM, Alexander Korotkov wrote:
> > On Wed, Sep 9, 2015 at 9:01 AM, Michael Paquier wrote:
> >> The code building the target history file is a duplicate of what is done
> >> with the source. Perhaps we could refactor that as a single routine in
> >> pg_rewind.c?
> >
> >
> > Do you mean duplication between rewind_parseTimeLineHistory() and
> > readTimeLineHistory(). We could put readTimeLineHistory() into common
> with
> > some refactoring. This is the subject of separate patch, though.
>
> Well, there is that :) But I was referring to this part (beware of
> useless newlines in your code btw):
> +   if (targettli == 1)
> {
> -   TimeLineHistoryEntry *entry = [i];
> +   targetHistory = (TimeLineHistoryEntry *)
> pg_malloc(sizeof(TimeLineHistoryEntry));
> +   targetHistory->tli = targettli;
> +   targetHistory->begin = targetHistory->end =
> InvalidXLogRecPtr;
> +   targetNentries = 1;
> +
> +   }
> Your patch generates a timeline history array for the target node, and
> HEAD does exactly the same for the source node, so IMO your patch is
> duplicating code, the only difference being that fetchFile is used for
> the source history file and slurpFile is used for the target history
> file. Up to now the code duplication caused by the difference that the
> target is always a local instance, and the source may be either local
> or remote has created the interface that we have now, but I think that
> we should refactor fetch.c such as the routines it contains do not
> rely anymore on datadir_source, and use instead a datadir argument. If
> datadir is NULL, the remote code path is used. If it is not NULL,
> local code path is used. This way we can make the target node use
> fetchFile only, and remove the duplication your patch is adding, as
> well as future ones like that. Thoughts?
>

OK, I get it. I tried to get rid of code duplication in the attached patch.
There is getTimelineHistory() function which gets control file as argument
and fetches history from appropriate path.


> >> Except that, the patch looks pretty neat to me. I was wondering as well:
> >> what tests did you run up to now with this patch? I am attaching an
> updated
> >> version of my test script I used for some more complex scenarios. Feel
> free
> >> to play with it.
> >
> > I was running manual tests like a noob :) When you attached you bash
> script,
> > I've switched to it.
>
> [product_placement]The patch to improve the recovery test suite
> submitted in this CF would have eased the need of bash-ing test cases
> here, and we could have test cases directly included in your
> patch.[/product_placement]
>
> > A also added additional check in findCommonAncestorTimeline(). Two
> standbys
> > could be independently promoted and get the same new timeline ID. Now,
> it's
> > checked that timelines, that we assume to be the same, have equal begins.
> > Begins could still match by coincidence. But the same risk exists in
> > original pg_rewind, though.
>
> Really? pg_rewind blocks attempts to rewind two nodes that are already
> on the same timeline before checking if their timeline history map at
> some point or not:
> /*
>  * If both clusters are already on the same timeline, there's
> nothing to
>  * do.
>  */
> if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
> ControlFile_source.checkPointCopy.ThisTimeLineID)
> pg_fatal("source and target cluster are on the same
> timeline\n");
> And this seems really justified to me. Imagine that you have one
> master, with two standbys linked to it. If both standbys are promoted
> to the same timeline, you could easily replug them to the master, but
> I fail to see the point to be able to replug one promoted standby with
> the other in this case: those nodes have segment and history files
> that map with each other, an operator could easily mess up things in
> such a configuration.
>

Imagine following configuration of server.
  1
 / \
2   3
|
4

Initially, node 1 is master, nodes 2 and 3 are standbys for node 1. node 4
is cascaded standby for node 3.
Then node 2 and node 3 are promoted. They both got timeline number 2. Then
node 3 is promoted and gets timeline number 3.
Then we could try to rewind node 4 with node 2 as source. How pg_rewind
could know that timeline number 2 for those nodes is not the same?
We can only check that those timelines are forked from timeline 1 at the
same place. But coincidence is still possible.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pg-rewind-target-switch-5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] statistics for array types

2015-09-16 Thread Alexander Korotkov

On Mon, Aug 24, 2015 at 8:26 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:

> On Thu, Aug 20, 2015 at 6:00 PM, Tomas Vondra <
> tomas.von...@2ndquadrant.com> wrote:
>
>> Hi,
>>
>> On 08/11/2015 04:38 PM, Jeff Janes wrote:
>>
>>> When reviewing some recent patches, I decided the statistics gathered
>>>  for arrays had some pre-existing shortcomings.
>>>
>>> The main one is that when the arrays contain rare elements there is
>>> no histogram to fall back upon when the MCE array is empty, the way
>>> there is for scalar stats.  So it has to punt completely and resort
>>> to saying that it is 0.5% selectivity without recourse to any data at
>>> all.
>>>
>>> The rationale for applying the threshold before things are eligible
>>> for inclusion in the MCE array seems to be that this puts some
>>> theoretical bound on the amount of error we are likely to have in
>>> that element.  But I think it is better to exceed that theoretical
>>> bound than it is to have no data at all.
>>>
>>> The attached patch forces there to be at least one element in MCE,
>>> keeping the one element with the highest predicted frequency if the
>>> MCE would otherwise be empty.  Then any other element queried for is
>>> assumed to be no more common than this most common element.
>>>
>>
>> We only really need the frequency, right? So do we really need to keep
>> the actual MCV element? I.e. most_common_elem_freqs does not have the
>> same number of values as most_common_elems anyway:
>>
>>   A list of the frequencies of the most common element values, i.e., the
>>   fraction of rows containing at least one instance of the given value.
>>   Two or three additional values follow the per-element frequencies;
>>   these are the minimum and maximum of the preceding per-element
>>   frequencies, and optionally the frequency of null elements.
>>   (Null when most_common_elems is.)
>>
>> So we might modify it so that it's always defined - either it tracks the
>> same values as today (when most_common_elems is defined), or the
>> frequency of the most common element (when most_common_elems is NULL).
>>
>
> I had also considered that.  It requires more changes to make it happen,
> and it seems to create a more complex contract on what those columns mean,
> but without giving a corresponding benefit.
>
>
>>
>> This way we can keep the current theoretical error-bound on the MCE
>> frequencies, and if that's not possible we can have at least the new
>> value without confusing existing code.
>
>
> But if the frequency of the most common element was grossly wrongly, then
> whatever value we stick in there is still going to be grossly wrong.
> Removing the value associated with it isn't going to stop it from being
> wrong.  When we do query with the (incorrectly thought) first most common
> element, either it will find and use the wrong value from slot 1, or it
> will find nothing and fall back on the same wrong value from slot 3.
>

Hmm, I think we should store cutoff_freq / nonnull_cnt as minfreq when we
collect no MCEs. Moreover, I think we should store it even when num_mcelem
>= track_len and we haven't cut MCEs we find. In this case we can get more
precise estimation for rare element using the knowledge that all MCEs which
exceeds the threshold are present (assuming their frequecies could be much
higher than threshold).

When there are no MCEs then we should use assumption that there are no
elements more frequent than cutoff_freq / nonnull_cnt. Using lower values
wouldn't be statistically correct.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-16 Thread Alexander Korotkov

On Mon, Sep 14, 2015 at 3:03 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Mon, Sep 14, 2015 at 5:32 AM, Alexander Korotkov
> <aekorot...@gmail.com> wrote:
> > In order to build the consensus we need the roadmap for waits monitoring.
> > Would single byte in PgBackendStatus be the only way for tracking wait
> > events? Could we have pluggable infrastructure in waits monitoring: for
> > instance, hooks for wait event begin and end?
>
> No, it's not the only way of doing it.  I proposed doing that way
> because it's simple and cheap, but I'm not hell-bent on it.  My basic
> concern here is about the cost of this.  I think that the most data we
> can report without some kind of synchronization protocol is one 4-byte
> integer.  If we want to report anything more than that, we're going to
> need something like the st_changecount protocol, or a lock, and that's
> going to add very significantly - and in my view unacceptably - to the
> cost.  I care very much about having this facility be something that
> we can use in lots of places, even extremely frequent operations like
> buffer reads and contended lwlock acquisition.
>

Yes, the major question is cost. But I think we should validate our
thoughts by experiments assuming there are more possible synchronization
protocols. Ildus posted implemention of double buffering approach that
showed quite low cost.

I think that there may be some *kinds of waits* for which it's
> practical to report additional detail.  For example, suppose that when
> a heavyweight lock wait first happens, we just report the lock type
> (relation, tuple, etc.) but then when the deadlock detector expires,
> if we're still waiting, we report the entire lock tag.  Well, that's
> going to happen infrequently enough, and is expensive enough anyway,
> that the cost doesn't matter.  But if, every time we read a disk
> block, we take a lock (or bump a changecount and do a write barrier),
> dump the whole block tag in there, release the lock (or do another
> write barrier and bump the changecount again) that sounds kind of
> expensive to me.  Maybe we can prove that it doesn't matter on any
> workload, but I doubt it.  We're fighting for every cycle in some of
> these code paths, and there's good evidence that we're burning too
> many of them compared to competing products already.
>

Yes, but some competing products also provides comprehensive waits
monitoring too. That makes me think it should be possible for us too.

I am not a big fan of hooks as a way of resolving disagreements about
> the design.  We may find that there are places where it's useful to
> have hooks so that different extensions can do different things, and
> that is fine.  But we shouldn't use that as a way of punting the
> difficult questions.  There isn't enough common understanding here of
> what we're all trying to get done and why we're trying to do it in
> particular ways rather than in other ways to jump to the conclusion
> that a hook is the right answer.  I'd prefer to have a nice, built-in
> system that everyone agrees represents a good set of trade-offs than
> an extensible system.
>

I think the reason for hooks could be not only disagreements about design,
but platform dependent issues too.
Next step after we have view with current wait events will be gathering
some statistics of them. We can oppose at least two approaches here:
1) Periodical sampling of current wait events.
2) Measure each wait event duration. We could collect statistics for short
period locally and update shared memory structure periodically (using some
synchronization protocol).

In the previous attempt to gather lwlocks statistics, you predict that
sampling could have a significant overhead [1]. In contrast, on many
systems time measurements are cheap. We have implemented both approaches
and it shows that sampling every 1 milliseconds produce higher overhead
than individual duration measurements for each wait event. We can share
another version of waits monitoring based on sampling to make these results
reproducible for everybody. However, cheap time measurements are available
not for each platform. For instance, ISTM that on Windows time measurements
are too expensive [2].

That makes me think that we need pluggable solution, at least for
statistics: direct measuring of events durations for majority of systems
and sampling for others as the least harm.

I think it's reasonable to consider reporting this data in the PGPROC
> using a 4-byte integer rather than reporting it through a singe byte
> in the backend status structure.  I believe that addresses the
> concerns about reporting from auxiliary processes, and it also allows
> a little more data to be reported.  For anything in excess of that, I
> think we should think rather harder.  Most likely, such addition

[HACKERS] LW_SHARED_MASK macro

2015-09-17 Thread Alexander Korotkov

Hackers,

while exploring lwlock.c I found following macro to be strange.

#define LW_SHARED_MASK ((uint32)(1 << 23))

This is macro is used to extract number of shared locks from state.

ereport(LOG,
(errhidestmt(true),
errhidecontext(true),
errmsg("%d: %s(%s): excl %u shared %u haswaiters %u waiters %u rOK %d",
MyProcPid,
where, MainLWLockNames[id],
!!(state & LW_VAL_EXCLUSIVE),
state & LW_SHARED_MASK,
!!(state & LW_FLAG_HAS_WAITERS),
pg_atomic_read_u32(>nwaiters),
!!(state & LW_FLAG_RELEASE_OK;


Should it be ((uint32) ((1 << 24)-1)) instead?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


lw_shared_mask.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

On Sat, Sep 12, 2015 at 2:05 PM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Thu, Aug 6, 2015 at 3:31 PM, Ildus Kurbangaliev <
> i.kurbangal...@postgrespro.ru> wrote:
> >
> > On 08/05/2015 09:33 PM, Robert Haas wrote:
> >>
> >>
> >> You're missing the point.  Those multi-byte fields have additional
> >> synchronization requirements, as I explained in some detail in my
> >> previous email. You can't just wave that away.
> >
> > I see that now. Thank you for the point.
> >
> > I've looked deeper and I found PgBackendStatus to be not a suitable
> > place for keeping information about low level waits. Really,
> PgBackendStatus
> > is used to track high level information about backend. This is why
> auxiliary
> > processes don't have PgBackendStatus, because they don't have such
> information
> > to expose. But when we come to the low level wait events then auxiliary
> > processes are as useful for monitoring as backends are. WAL writer,
> > checkpointer, bgwriter etc are using LWLocks as well. This is certainly
> unclear
> > why they can't be monitored.
> >
>
> I think the chances of background processes stuck in LWLock is quite less
> as compare to backends as they do the activities periodically.  As an
> example
> WALWriter will take WALWriteLock to write the WAL, but actually there will
> never
> be any much contention for WALWriter. In synchronous_commit = on, the
> backends themselves write the WAL so WALWriter won't do much in that
> case and for synchronous_commit = off, backends won't write the WAL so
> WALWriter won't face any contention unless some buffers have to be written
> by bgwriter or checkpoint for which WAL is not flushed which I don't think
> would lead to any contention.
>

Hmm, synchronous_commit is per session variable: some transactions could
run with synchronous_commit = on, but some with synchronous_commit = off.
This is very popular feature of PostgreSQL: achieve better performance by
making non-critical transaction asynchronous while leaving critical
transactions synchronous. Thus, contention for WALWriteLock between
backends and WALWriter could be real.

I am not denying from the fact that there could be some contention in rare
> scenarios for background processes, but I think tracking them is not as
> important as tracking the LWLocks for backends.
>

I would be more careful in calling some of scenarios rare. As DBMS
developers we should do our best to evade contention for LWLocks: any
contention, not only between backends and background processes.  One may
assume that high LWLock contention is rare scenario in general. Once we're
here we doesn't think so, though.
You claims that there couldn't be contention for WALWriteLock between
backends and WALWriter. This is unclear for me: I think it could be. Nobody
opposes tracking wait events for backends and tracking them for background
processes. I think we need to track both in order to provide full picture
to DBA.

Also as we are planning to track the wait_event information in
> pg_stat_activity
> along with other backends information, it will not make sense to include
> information about backend processes in this variable as pg_stat_activity
> just displays information of backend processes.
>

I'm not objecting that we should track only backends information in
pg_stat_activity. I think we should have also some other way of tracking
wait events for background processes. We should think it out before
extending pg_stat_activity to evade design issues later.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive