Re: [HACKERS] KNN-GiST with recheck

2015-05-15 Thread Alexander Korotkov
On Fri, May 15, 2015 at 2:49 PM, Alexander Korotkov aekorot...@gmail.com
wrote:

 On Fri, May 15, 2015 at 2:48 PM, Heikki Linnakangas hlinn...@iki.fi
 wrote:

 On 05/15/2015 11:31 AM, Alexander Korotkov wrote:

 On Fri, May 15, 2015 at 2:30 AM, Heikki Linnakangas hlinn...@iki.fi
 wrote:

  On 05/15/2015 02:28 AM, Heikki Linnakangas wrote:

  I think this is now ready for committing, but I'm pretty tired now so
 I'll read through this one more time in the morning, so that I won't
 wake up to a red buildfarm.


 Forgot to attach the latest patch, here you go.



 Looks good for me.


 Ok, pushed after some further minor cleanup.


 Great! Thank you!


BTW, I found that now IndexScan node lackof copy and output support for
indexorderbyops.
Attached patch fixes that. Copy and output functions assume that
indexorderbyops has the same length as indexorderby. In order to make this
more evident I move check for best_path-path.pathkeys in create_plan from
if into assertion. AFAICS, pathkeys should always present where there are
indexorderby.

--
With best regards,
Alexander Korotkov.


fix-indexscan-node.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Enhanced ALTER OPERATOR

2015-05-18 Thread Alexander Korotkov
On Mon, May 18, 2015 at 5:44 PM, Uriy Zhuravlev u.zhurav...@postgrespro.ru
wrote:

 On Monday 18 May 2015 10:21:10 Tom Lane wrote:
  There are fairly significant reasons why we have not done this, based
  on the difficulty of updating existing cached plans that might have
  incidentally depended on those operator properties during creation.
  Perhaps it's all right to simply ignore such concerns, but I would like
  to see a defense of why.

 Then need to prohibit the use of shells operators (stub operators) to
 create
 self-linkage. Implicitly changing commutators and negators working for a
 long
 time through the shell operators.


I could give another motivation. AFAICS, typically ALTER OPERATOR should
introduce enchancements. For instance, some version of extension didn't
have negator for operator. In the next version extension introduce such
negator. Or the same situation with selectivity estimation. If ALTER
OPERATOR introduce only enchancements then old plans could be not optimal
but they don't lead to invalid query answers. From this point of view cache
invalidation after ALTER OPERATOR is term of optimization. We could include
into the patch documentation statement about possible side effects with
cached query plans.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] KNN-GiST with recheck

2015-05-14 Thread Alexander Korotkov
On Wed, May 13, 2015 at 10:17 PM, Alexander Korotkov aekorot...@gmail.com
wrote:

 One quick comment:

 It would be good to avoid the extra comparisons of the distances, when
 the index doesn't return any lossy items. As the patch stands, it adds one
 extra copyDistances() call and a cmp_distances() call for each tuple (in a
 knn-search), even if there are no lossy tuples.


 I will fix it until Friday.


Attached patch is rebased against current master. Extra extra
copyDistances() call and a cmp_distances() call for each tuple are avoided
in the case of no lossy tuples.

--
With best regards,
Alexander Korotkov.


knn-gist-recheck-9.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] KNN-GiST with recheck

2015-04-17 Thread Alexander Korotkov
On Wed, Feb 25, 2015 at 12:15 PM, Alexander Korotkov aekorot...@gmail.com
wrote:

 Hi!

 On Tue, Feb 24, 2015 at 5:39 PM, Tomas Vondra 
 tomas.von...@2ndquadrant.com wrote:

 On 17.2.2015 14:21, Alexander Korotkov wrote:
  On Sun, Feb 15, 2015 at 2:08 PM, Alexander Korotkov
  aekorot...@gmail.com mailto:aekorot...@gmail.com wrote:
 
  Revised patch with reordering in GiST is attached
  (knn-gist-recheck-in-gist.patch) as well as testing script (test.py).

 I meant to do a bit of testing on this (assuming it's still needed), but
 the patches need rebasing - Heikki fixed a few issues, so they don't
 apply cleanly.


 Both patches are revised.


Both patches are rebased against current master.

--
With best regards,
Alexander Korotkov.


knn-gist-recheck-8.patch
Description: Binary data


knn-gist-recheck-in-gist-3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why no jsonb_exists_path()?

2015-06-10 Thread Alexander Korotkov
Josh,

On Tue, Jun 9, 2015 at 9:16 PM, Josh Berkus j...@agliodbs.com wrote:

 Dmitry, Alexander:

 I'm noticing a feature gap for JSONB operators; we have no way to do this:

 jsonb_col ? ARRAY['key1','key2','key3']


What documents do you expect to match this operator?
Such syntax can be interpreted in very different semantics. Checking keys
only at top level, any level, sequence starting from top level ... etc.

'{key1: value1, key2: value2, key3: value3}'
'{key1:{ key2: {key3: value}}}'
'{key1: value1}
'[{key1: value1}, {key2: value2}, {key3: value3}]'

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: Enhanced ALTER OPERATOR

2015-05-29 Thread Alexander Korotkov
On Thu, May 28, 2015 at 6:43 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov a.korot...@postgrespro.ru writes:
  Could we address both this problems by denying changing existing
  commutators and negator? ISTM that setting absent commutator and negator
 is
  quite enough for ALTER OPERATOR. User extensions could need setting of
  commutator and negator because they could add new operators which don't
  exist before. But it's rather uncommon to unset or change commutator or
  negator.

 Note that this functionality is already covered, in that you can specify
 the commutator/negator linkage when you create the second operator.
 I'm not particularly convinced that we need to have it in ALTER OPERATOR.


Yeah, I didn't notice that CREATE OPERATOR sets commutator and negator on
opposite side as well. Then we probably can leave ALTER OPERATOR without
altering commutator and negator.

BTW, does DROP OPERATOR works correctly?

# create operator == (procedure = int8eq, leftarg = bigint, rightarg =
bigint);
CREATE OPERATOR
# create operator !== (procedure = int8ne, leftarg = bigint, rightarg =
bigint, negator = ==);
CREATE OPERATOR
# select oid, * from pg_operator where oprnamespace = (select oid from
pg_namespace where nspname = 'public');
  oid  | oprname | oprnamespace | oprowner | oprkind | oprcanmerge |
oprcanhash | oprleft | oprright | oprresult | oprcom | oprnegate | oprcode
| oprrest | oprjoin
---+-+--+--+-+-++-+--+---++---+-+-+-
 46355 | !== | 2200 |   10 | b   | f   | f
 |  20 |   20 |16 |  0 | 46354 | int8ne  | -
| -
 46354 | ==  | 2200 |   10 | b   | f   | f
 |  20 |   20 |16 |  0 | 46355 | int8eq  | -
| -
(2 rows)
# create table test (id bigint);
CREATE TABLE
# explain verbose select * from test where not id == 10::bigint;
  QUERY PLAN
---
 Seq Scan on public.test  (cost=0.00..38.25 rows=1130 width=8)
   Output: id
   Filter: (test.id !== '10'::bigint)
(3 rows)
# drop operator !== (int8, int8);
DROP OPERATOR
# select oid, * from pg_operator where oprnamespace = (select oid from
pg_namespace where nspname = 'public');
  oid  | oprname | oprnamespace | oprowner | oprkind | oprcanmerge |
oprcanhash | oprleft | oprright | oprresult | oprcom | oprnegate | oprcode
| oprrest | oprjoin
---+-+--+--+-+-++-+--+---++---+-+-+-
 46354 | ==  | 2200 |   10 | b   | f   | f
 |  20 |   20 |16 |  0 | 46355 | int8eq  | -
| -
(1 row)
# explain verbose select * from test where not id == 10::bigint;
ERROR:  cache lookup failed for function 0

DROP OPERATOR leaves broken reference in oprnegate. Should we fix it?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] PGCon hacker lounge

2015-05-27 Thread Alexander Korotkov
On Wed, May 27, 2015 at 7:00 PM, Dan Langille d...@langille.org wrote:

 Have you been to PGCon before?  Do you remember the hacker lounge?  Do you
 remember going there to work on stuff?  Do you recall anything about it?


I remember I've tried to visit it in 2012 or 2013. That time I found empty
room and nobody there. Didn't try to visit it anytime after.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: Enhanced ALTER OPERATOR

2015-05-28 Thread Alexander Korotkov
On Wed, May 27, 2015 at 9:28 PM, Robert Haas robertmh...@gmail.com wrote:

 On Mon, May 18, 2015 at 10:21 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Uriy Zhuravlev u.zhurav...@postgrespro.ru writes:
  I have attached a patch that extends ALTER OPERATOR to support
 COMMUTATOR,
  NEGATOR, RESTRICT and JOIN.
 
  There are fairly significant reasons why we have not done this, based
  on the difficulty of updating existing cached plans that might have
  incidentally depended on those operator properties during creation.
  Perhaps it's all right to simply ignore such concerns, but I would like
  to see a defense of why.

 I don't think there's a direct problem with cached plans, because it
 looks like plancache.c blows away the entire plan cache for any change
 to pg_operator.  OperatorUpd() can already update oprcom and
 oprnegate, which seems to stand for the proposition that it's OK to
 set those from InvalidOid to something else.  But that doesn't prove
 that other kinds of changes are safe.

 A search of other places where oprcom is used in the code led me to
 ComputeIndexAttrs().  If an operator whose commutator is itself were
 changed so that the commutator was anything else, then we'd end up
 with a broken exclusion constraint.  That's a problem.
 targetIsInSortList is run during parse analysis, and that too tests
 whether sortop == get_commutator(scl-sortop).  Those decisions
 wouldn't get reevaluated if the truth of that expression changed after
 the fact, which I suspect is also a problem.


Could we address both this problems by denying changing existing
commutators and negator? ISTM that setting absent commutator and negator is
quite enough for ALTER OPERATOR. User extensions could need setting of
commutator and negator because they could add new operators which don't
exist before. But it's rather uncommon to unset or change commutator or
negator.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: Rework access method interface

2015-08-24 Thread Alexander Korotkov
On Mon, Aug 24, 2015 at 5:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov a.korot...@postgrespro.ru writes:
  On Mon, Aug 10, 2015 at 7:50 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Hm.  So one way or the other we're going to end up violating relational
  theory somewhere.  OK, I yield: let's say that pg_am has amname, amkind,
  amhandler, and nothing else.  Then we will need SQL functions to expose
  whatever information we think needs to be available to SQL code.

  There is second revision of this patch. Changes are so:

   * AmRoutine was renamed to IndexAmRoutine assuming there could be other
  access methods in the future.
   * amhandlers now return index_am_handler pseudotype.
   * CHECK_PROCEDUREs are now is the place of original GET_REL_PROCEDUREs.
   * amstrategies, amsupport, amcanorderbyop, amstorage, amkeytype are in
  both pg_am and IndexAmRoutine. Consistence of amhandler answer and pg_am
  tuple is checking.

 [ scratches head... ]  I thought we'd just agreed we weren't going to keep
 any of those pg_am columns?  If we keep them, we'll have to define what
 they mean for sequence AMs etc.  (Let them be NULL would likely break
 enough stuff that we might as well not have them.)


From the previous discussion I see following options:
1) Non-index access methods don't reuse pg_class.relam nor pg_am. Fully
relational compliant but complex catalog structure.
2) Non-index access methods reuse pg_class.relam but don't reuse pg_am.
This violates relational theory because single column reference multiple
tables.
3) Non-index access methods reuse both pg_class.relam and pg_am. This
violates relational theory because we store different objects in the same
table.

I'd say we already have precedent of #2. It's pg_depend which reference
objects of arbitrary types.
In the #3 we really shouldn't keep any specific to index am in pg_am.

But what we assume to be an access method in general? For instance, we have
foreign data wrappers which aren't access methods (but looks quite similar
from some degree).

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] [PATCH] Microvacuum for gist.

2015-07-30 Thread Alexander Korotkov
Hi!

On Thu, Jul 30, 2015 at 2:51 PM, Anastasia Lubennikova 
lubennikov...@gmail.com wrote:

 I have written microvacuum support for gist access method.
 Briefly microvacuum includes two steps:
 1. When search tells us that the tuple is invisible to all transactions it
 is marked LP_DEAD and page is marked as has dead tuples,
 2. Then, when insert touches full page which has dead tuples it calls
 microvacuum instead of splitting page.
 You can find a kind of review here [1].

 [1]
 http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/ivanitskiy_ilya/5629499534213120

 Patch is in attachements. Please review it.


Nice!

Some notes about this patch.

1) Could you give same test case demonstrating that microvacuum really work
with patch? Finally, we should get index less growing with microvacuum.

2) Generating notices for every dead tuple would be too noisy. I suggest to
replace notice with one of debug levels.

+  elog(NOTICE, gistkillitems. Mark Item Dead offnum %hd, blkno %d,
offnum, BufferGetBlockNumber(buffer));


3) Please, recheck coding style. For instance, this line needs more spaces
and open brace should be on the next line.

+ if ((scan-kill_prior_tuple)(so-curPageData  0)(so-curPageData ==
so-nPageData)) {

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


[HACKERS] 64-bit XIDs again

2015-07-30 Thread Alexander Korotkov
Hackers,

I know there were already couple of threads about 64bit XIDs.
http://www.postgresql.org/message-id/42cccfe9.9040...@intellilink.co.jp
http://www.postgresql.org/message-id/4f6c0e13.3080...@wiesinger.com
I read them carefully, but I didn't find all the arguments for 64bit XIDs
mentioned. That's why I'd like to raise this subject again.

Now hardware capabilities are much higher than when Postgres was designed.
In the modern PostgreSQL scalability tests it's typical to achieve 400 000
- 500 000 tps with pgbench. With such tps it takes only few minutes to
achieve default autovacuum_freeze_max_age = 200 millions.

Notion of wraparound is evolutioning during the time. Initially it was
something that almost never happens. Then it becomes something that could
happen rarely, and we should be ready to it (freeze tuples in advance).
Now, it becomes quite frequent periodic event for high load database. DB
admins should take into account its performance impact.

Typical scenario that I've faced in real life was so. Database is divided
into operative and archive parts. Operative part is small (dozens of
gigabytes) and it serves most of transactions. Archive part is relatively
large (some terabytes) and it serves rare selects and bulk inserts.
Autovacuum work very active for operative part and very lazy for archive
part (as it's expected). System works well until one day age of archive
tables exceeds autovacuum_freeze_max_age. Then all autovacuum workers
starts to do autovacuum to prevent wraparound on archive tables. If even
system IO survive this, operative tables get bloated because all autovacuum
workers are busy with archive tables. In such situation I typically advise
to increase autovacuum_freeze_max_age and run vacuum freeze manually when
system have enough of free resources.

As I mentioned in CSN thread, it would be nice to replace XID with CSN when
setting hint bits for tuple. In this case when hint bits are set we don't
need any additional lookups to check visibility.
http://www.postgresql.org/message-id/CAPpHfdv7BMwGv=ofug3s-jgvfkqhi79pr_zk1wsk-13oz+c...@mail.gmail.com
Introducing 32-bit CSN doesn't seem reasonable for me, because it would
double our troubles with wraparound.

Also, I think it's possible to migrate to 64-bit XIDs without breaking
pg_upgrade. Old tuples can be leaved with 32-bit XIDs while new tuples
would be created with 64-bit XIDs. We can use free bits in t_infomask2 to
distinguish old and new formats.

Any thoughts? Do you think 64-bit XIDs worth it?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] 64-bit XIDs again

2015-07-30 Thread Alexander Korotkov
On Thu, Jul 30, 2015 at 5:24 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 07/30/2015 04:26 PM, Alexander Korotkov wrote:

 Also, I think it's possible to migrate to 64-bit XIDs without breaking
 pg_upgrade. Old tuples can be leaved with 32-bit XIDs while new tuples
 would be created with 64-bit XIDs. We can use free bits in t_infomask2 to
 distinguish old and new formats.


 I think we should move to 64-bit XIDs in in-memory structs snapshots, proc
 array etc. And expand clog to handle 64-bit XIDs. But keep the xmin/xmax
 fields on heap pages at 32-bits, and add an epoch-like field to the page
 header so that logically the xmin/xmax fields on the page are 64 bits wide,
 but physically stored in 32 bits. That's possible as long as no two XIDs on
 the same page are more than 2^31 XIDs apart. So you still need to freeze
 old tuples on the page when that's about to happen, but it would make it
 possible to have more than 2^32 XID transactions in the clog. You'd never
 be forced to do anti-wraparound vacuums, you could just let the clog grow
 arbitrarily large.


Nice idea. Storing extra epoch would be extra 4 bytes per heap tuple
instead of extra 8 bytes per tuple if storing 64 bits xmin/xmax.
But if first column is aligned to 8 bytes (i.e. bigserial) would we loose
this 4 bytes win for alignment?


There is a big downside to expanding xmin/xmax to 64 bits: it takes space.
 More space means more memory needed for caching, more memory bandwidth,
 more I/O, etc.



--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] 64-bit XIDs again

2015-07-31 Thread Alexander Korotkov
On Fri, Jul 31, 2015 at 12:23 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 But the elephant in the room is on-disk compatibility.  There is
 absolutely no way that we can just change xmin/xmax to 64 bits without a
 disk format break.


That seems problematic. But I'm not yet convinced that there is absolutely
no way to do this.


 However, if we do something like what Heikki is
 suggesting, it's at least conceivable that we could convert incrementally
 (ie, if you find a page with the old header format, assume all tuples in
 it are part of epoch 0; and do not insert new tuples into it unless there
 is room to convert the header to new format ... but I'm not sure what we
 do about tuple deletion if the old page is totally full and we need to
 write an xmax that's past 4G).


If use upgrade database cluster with pg_upgrade, he would stop old
postmaster, pg_upgrade, start new postmaster. That means we start from the
point when there is no running transactions. Thus, between tuples of old
format there are two kinds: visible for everybody and invisible for
everybody. When update or delete old tuple of first kind, we actually don't
need to store its xmin anymore. We can store 64bit xmax in the place of
xmin/xmax.

So, in order to switch to 64bit xmin/xmax, we have to take both free bits
form t_infomask2 in order to implements it. They should indicate one of 3
possible tuple formats:
1) Old format: both xmin/xmax are 32bit
2) Intermediate format: xmax is 64bit, xmin is frozen.
3) New format: both xmin/xmax are 64bit.

But we can use same idea to implement epoch in heap page header as well. If
new page header doesn't fits the page, then we don't have to insert
something to this page, we just need to set xmax and flags to existing
tuples. Then we can use two format from listed above: #1 and #2, and take
one free bit from t_infomask2 for format indication.

Probably I'm missing something, but I think keeping on-disk compatibility
should be somehow possible.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Patch for ginCombineData

2015-08-05 Thread Alexander Korotkov
Hi!

On Wed, Aug 5, 2015 at 1:17 PM, Robert Abraham 
robert.abraha...@googlemail.com wrote:

 we are using gin indexes on big tables. these tables happen to have
 several billion rows.
 the index creation fails in ginCombineData in src/backend/access/ginbulk.c
 because repalloc is limited to 1 gb.
 this limitation makes no sense in this context (compare comments in
 src/include/utils/memutils.h).
 To overcome this limitation on tables with lots of rows repalloc_huge is
 used.
 The test suite still succeeds on my machine.
 Find the patch attached,


Thank you for notice and for the patch!
You should have maintenance_work_mem  1gb and some very frequent entry so
that it's posting list exceeds 1 gb itself.
These circumstances shouldn't be very rare in modern systems. I think it
could be backpatched.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] [PATCH] Microvacuum for gist.

2015-08-03 Thread Alexander Korotkov
On Mon, Aug 3, 2015 at 12:27 PM, Anastasia Lubennikova 
a.lubennik...@postgrespro.ru wrote:

 1) Test and results are in attachments. Everything seems to work as
 expected.
 2) I dropped these notices. It was done only for debug purposes. Updated
 patch is attached.
 3) fixed


Good! Another couple of notes from me:
1) I think gistvacuumpage() and gistkillitems() need function-level
comments.
2) ItemIdIsDead() can be used in index scan like it's done
in _bt_checkkeys().

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] 64-bit XIDs again

2015-07-31 Thread Alexander Korotkov
On Fri, Jul 31, 2015 at 1:27 AM, Petr Jelinek p...@2ndquadrant.com wrote:

 On 2015-07-30 23:23, Tom Lane wrote:

 But the elephant in the room is on-disk compatibility.  There is
 absolutely no way that we can just change xmin/xmax to 64 bits without a
 disk format break.  However, if we do something like what Heikki is
 suggesting, it's at least conceivable that we could convert incrementally
 (ie, if you find a page with the old header format, assume all tuples in
 it are part of epoch 0; and do not insert new tuples into it unless there
 is room to convert the header to new format ...


 We could theoretically do similar thing with 64bit xmin/xmax though -
 detect page is in old format and convert all tuples there to 64bit
 xmin/xmax.

 But I agree that we don't want to increase bloat per tuple as it's already
 too big.

 but I'm not sure what we
 do about tuple deletion if the old page is totally full and we need to
 write an xmax that's past 4G).


 If the page is too full we could move some data to different (or new) page.

 For me bigger issue is that we'll still have to refreeze pages because
 if tuples are updated or deleted in different epoch than the one they were
 inserted in, the new version of tuple has to go to different page and the
 old page will have free space that can't be used by new tuples since the
 system is now in different epoch.


It is not so easy to move heap tuple to the different page. When table has
indexes each tuple is referenced by index tuples as (blockNumber; offset).
And we can't remove these references without vacuum. Thus, we would have to
invent something like multipage HOT in order to move tuples between pages.
And that seems to be a complicated kludge.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-14 Thread Alexander Korotkov
On Thu, Aug 6, 2015 at 1:01 PM, Ildus Kurbangaliev 
i.kurbangal...@postgrespro.ru wrote:

 On 08/05/2015 09:33 PM, Robert Haas wrote:

 On Wed, Aug 5, 2015 at 1:10 PM, Ildus Kurbangaliev
 i.kurbangal...@postgrespro.ru wrote:

 About `memcpy`, PgBackendStatus struct already have a bunch of multi-byte
 variables,  so it will be
 not consistent anyway if somebody will want to copy it in that way. On
 the
 other hand two bytes in this case
 give less overhead because we can avoid the offset calculations. And as
 I've
 mentioned before the class
 of wait will be useful when monitoring of waits will be extended.

 You're missing the point.  Those multi-byte fields have additional
 synchronization requirements, as I explained in some detail in my
 previous email. You can't just wave that away.

 I see that now. Thank you for the point.

 I've looked deeper and I found PgBackendStatus to be not a suitable
 place for keeping information about low level waits. Really,
 PgBackendStatus
 is used to track high level information about backend. This is why
 auxiliary
 processes don't have PgBackendStatus, because they don't have such
 information
 to expose. But when we come to the low level wait events then auxiliary
 processes are as useful for monitoring as backends are. WAL writer,
 checkpointer, bgwriter etc are using LWLocks as well. This is certainly
 unclear
 why they can't be monitored.

 This is why I think we shoudn't place wait event into PgBackendStatus. It
 could be placed into PGPROC or even separate data structure with different
 concurrency model which would be most suitable for monitoring.


+1 for tracking wait events not only for backends

Ildus, could you do following?
1) Extract LWLocks refactoring into separate patch.
2) Make a patch with storing current wait event information in PGPROC.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: Rework access method interface

2015-08-10 Thread Alexander Korotkov
On Mon, Aug 10, 2015 at 6:36 PM, Petr Jelinek p...@2ndquadrant.com wrote:

 On 2015-08-10 16:58, Alexander Korotkov wrote:

 That should work, thanks! Also we can have SQL-visible functions to get
 amsupport and amstrategies and use them in the regression tests.


 SQL-visible functions would be preferable to storing it in pg_am as
 keeping the params in pg_am would limit the extensibility of pg_am itself.


I actually meant just two particular functions (not per AM) which both get
AM oid as argument. They should call amhandler and return amsupport and
amstrategies correspondingly.
These functions can be used in regression tests to check opclass and
opfamilies correctness.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: Rework access method interface

2015-08-10 Thread Alexander Korotkov
On Mon, Aug 10, 2015 at 5:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov a.korot...@postgrespro.ru writes:
  On Mon, Aug 10, 2015 at 1:12 PM, Petr Jelinek p...@2ndquadrant.com
 wrote:
  I don't understand this, there is already AmRoutine in RelationData, why
  the need for additional field for just amsupport?

  We need amsupport in load_relcache_init_file() which reads
  pg_internal.init. I'm not sure this is correct place to call
 am_handler.
  It should work in the case of built-in AM. But if AM is defined in the
  extension then we wouldn't be able to do catalog lookup for am_handler on
  this stage of initialization.

 This is an issue we'll have to face before there's much hope of having
 index AMs as extensions: how would you locate any extension function
 without catalog access?  Storing raw function pointers in pg_internal.init
 is not an answer in an ASLR world.

 I think we can dodge the issue so far as pg_internal.init is concerned by
 decreeing that system catalogs can only have indexes with built-in AMs.
 Calling a built-in function doesn't require catalog access, so there
 should be no problem with re-calling the handler function by OID during
 load_relcache_init_file().


That should work, thanks! Also we can have SQL-visible functions to get
amsupport and amstrategies and use them in the regression tests.


 We could also have problems with WAL replay, though I think the consensus
 there is that extension AMs have to use generic WAL records that don't
 require any index-specific replay code.


Yes, I've already showed one version of generic WAL records. The main
objecting against them was it's hard insure that composed WAL-record is
correct.
Now I'm working on new version which would be much easier and safe to use.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: Rework access method interface

2015-08-10 Thread Alexander Korotkov
On Mon, Aug 10, 2015 at 6:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Petr Jelinek p...@2ndquadrant.com writes:
  On 2015-08-10 16:58, Alexander Korotkov wrote:
  That should work, thanks! Also we can have SQL-visible functions to get
  amsupport and amstrategies and use them in the regression tests.

  SQL-visible functions would be preferable to storing it in pg_am as
  keeping the params in pg_am would limit the extensibility of pg_am
 itself.

 I don't see any particularly good reason to remove amsupport and
 amstrategies from pg_am.  Those are closely tied to the other catalog
 infrastructure for indexes (pg_amproc, pg_amop) which I don't think are
 candidates for getting changed by this patch.

 There are a couple of other pg_am columns, such as amstorage and
 amcanorderbyop, which similarly bear on what's legal to appear in
 related catalogs such as pg_opclass.  I'd be sort of inclined to
 leave those in the catalog as well.  I do not see that exposing
 a SQL function is better than exposing a catalog column; either
 way, that property is SQL-visible.


That answers my question, thanks!

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: Rework access method interface

2015-08-10 Thread Alexander Korotkov
On Mon, Aug 10, 2015 at 1:12 PM, Petr Jelinek p...@2ndquadrant.com wrote:

 On 2015-08-09 23:56, Alexander Korotkov wrote:

 Hacker,

 some time before I proposed patches implementing CREATE ACCESS METHOD.

 http://www.postgresql.org/message-id/capphfdsxwzmojm6dx+tjnpyk27kt4o7ri6x_4oswcbyu1rm...@mail.gmail.com
 As I get from comments to my patches and also from Tom's comment about
 AM interface in tablesampling thread – AM interface needs reworking. And
 AFAICS AM interface rework is essential to have CREATE ACCESS METHOD
 command.
 http://www.postgresql.org/message-id/5438.1436740...@sss.pgh.pa.us


 Cool, I was planning to take a stab at this myself when I have time, so I
 am glad you posted this.

 This is why I'd like to show a WIP patch implementing AM interface
 rework. Patch is quite dirty yet, but I think it illustrated the idea
 quite clear. AM now have single parameter – handler function. This
 handler returns pointer to AmRoutine which have the same data as pg_am
 had before. Thus, API is organized very like FDW.


 I wonder if it would make sense to call it IndexAmRoutine instead in case
 we add other AMs (like the sequence am, or maybe column store if that's
 done as AM) in the future.


Good point!


 The patch should probably add the am_handler type which should be return
 type of the am handler function (see fdw_handler and tsm_handler types).


Sounds reasonable!


 I also think that the CHECK_PROCEDUREs should be in the place of the
 original GET_REL_PROCEDUREs  (just after relation check). If the interface
 must exist we may as well check for it at the beginning and not after we
 did some other work which is useless without the interface.


Ok, good point too.

However, this patch appears to take more work than I expected. It have
 to do many changes spread among many files.


 Yeah you need to change the definition and I/O handling of every AM
 function, but that's to be expected.


Yes, this is why I decided to get some feedback on this stage of work.


 Also, it appears not so easy
 to hide amsupport into AmRoutine, because it's needed for relcache. As a
 temporary solution it's duplicated in RelationData.


 I don't understand this, there is already AmRoutine in RelationData, why
 the need for additional field for just amsupport?


We need amsupport in load_relcache_init_file() which reads
pg_internal.init. I'm not sure this is correct place to call am_handler.
It should work in the case of built-in AM. But if AM is defined in the
extension then we wouldn't be able to do catalog lookup for am_handler on
this stage of initialization.

Another point is that amsupport and amstrategies are used for regression
tests of opclasses and opfamilies. Thus, we probably can keep them in pg_am.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Proposal for CSN based snapshots

2015-07-27 Thread Alexander Korotkov
On Sat, Jul 25, 2015 at 11:39 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On 24 July 2015 at 19:21, Robert Haas robertmh...@gmail.com wrote:

 On Fri, Jul 24, 2015 at 1:00 PM, Simon Riggs si...@2ndquadrant.com
 wrote:
  It depends on the exact design we use to get that. Certainly we do not
 want
  them if they cause a significant performance regression.

 Yeah.  I think the performance worries expressed so far are:

 - Currently, if you see an XID that is between the XMIN and XMAX of
 the snapshot, you hit CLOG only on first access.  After that, the
 tuple is hinted.  With this approach, the hint bit doesn't avoid
 needing to hit CLOG anymore, because it's not enough to know whether
 or not the tuple committed; you have to know the CSN at which it
 committed, which means you have to look that up in CLOG (or whatever
 SLRU stores this data).  Heikki mentioned adding some caching to
 ameliorate this problem, but it sounds like he was worried that the
 impact might still be significant.


 This seems like the heart of the problem. Changing a snapshot from a list
 of xids into one number is easy. Making XidInMVCCSnapshot() work is the
 hard part because there needs to be a translation/lookup from CSN to
 determine if it contains the xid.

 That turns CSN into a reference to a cached snapshot, or a reference by
 which a snapshot can be derived on demand.


I got the problem. Currently, once we set hint bits don't have to visit
CLOG anymore. With CSN snapshots that is not so. We still have to translate
XID into CSN in order to compare it with snapshot CSN. In version of CSN
patch in this thread we still have XMIN and XMAX in the snapshot. AFAICS
with CSN snapshots XMIN and XMAX are not necessary required to express
snapshot, they were kept for optimization. That restricts usage of XID =
CSN map with given range of XIDs. However, with long running transactions
[XMIN; XMAX] range could be very wide and we could use XID = CSN map
heavily in wide range of XIDs.

As I can see in Huawei PGCon talk Dense Map in shared memory is proposed
for XID = CSN transformation. Having large enough Dense Map we can do
most of XID = CSN transformations with single shared memory access. PGCon
talk gives us result of pgbench. However, pgbench doesn't run any long
transactions. With long running transaction we can run out of Dense Map
for significant part of XID = CSN transformations. Dilip, did you test
your patch with long transactions?

I'm also thinking about different option for optimizing this. When we set
hint bits we can also change XMIN/XMAX with CSN. In this case we wouldn't
need to do XID = CSN transformation for this tuple anymore. However, we
have 32-bit XIDs for now. We could also have 32-bit CSNs. However, that
would doubles our troubles with wraparound: we will have 2 counters that
could wraparound. That could return us to thoughts about 64-bit XIDs.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Selectivity estimation for intarray with @@

2015-07-21 Thread Alexander Korotkov
On Tue, Jul 21, 2015 at 6:49 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 07/21/2015 03:44 PM, Alexander Korotkov wrote:

 While Uriy is on vacation, I've revised this patch a bit.


 I whacked this around quite a bit, and I think it's in a committable state
 now. But if you could run whatever tests you were using before on this, to
 make sure it still produces the same estimates, that would be great. I
 didn't change the estimates it should produce, only the code structure.

 One thing that bothers me slightly with this patch is the way it peeks
 into the Most-Common-Elements arrays, which is produced by the built-in
 type analyze function. If we ever change what statistics are collected for
 arrays, or the way they are stored, this might break. In matchsel, why
 don't we just call the built-in estimator function for each element that we
 need to probe, and not look into the statistics ourselves at all? I
 actually experimented with that, and it did slash much of the code, and it
 would be more future-proof. However, it was also a lot slower for queries
 that contain multiple values. That's understandable: the built-in estimator
 will fetch the statistics tuple, parse the arrays, etc. separately for each
 value in the query_int, while this patch will do it only once for the whole
 query, and perform a simple binary search for each value. So overall, I
 think this is OK as it is. But if we find that we need to use the MCE list
 in this fashion in more places in the future, it might be worthwhile to add
 some support code for this in the backend to allow extracting the stats
 once, and doing multiple lightweight estimations using the extracted
 stats.


Yeah, I see. We could end up with something like this. But probably we
would need something more general for extensions which wants to play with
statistics.
For instance, pg_trgm could estimate selectivity for text % text
operator. But in order to provide that it needs trigram statistics. Now it
could be implemented by defining separate datatype, but it's a kluge.
Probably, we would end up with custom additional statistics for datatypes.


 Some things I fixed/changed:

 * I didn't like that transformOperator() function, which looked up the
 function's name. I replaced it with separate wrapper functions for each
 operator, so that the built-in operator's OID can be hardcoded into each.

 * I refactored the matchsel function heavily. I think it's more readable
 now.

 * I got rid of the Int4Freq array. It didn't seem significantly easier to
 work with than the separate values/numbers arrays, so I just used those
 directly.

 * Also use the matchsel estimator for ~~ (the commutator of @@)


In this version of patch it's not checked if variable is actually and int[]
not query_int. See following test case.

# create table test2 as (select '1'::query_int val from
generate_series(1,100));
# analyze test2;

# explain select * from test2 where '{1}'::int[] @@ val;
ERROR:  unrecognized int query item type: 0

I've added this check.

* Also use the estimators for the obsolete @ and ~ operators. Not that I
 care much about those as they are obsolete, but seems strange not to, as
 it's a trivial matter of setting the right estimator function.

 * I added an ANALYZE in the regression test. It still won't systematically
 test all the cost estimation code, and there's nothing to check that the
 estimates make sense, but at least more of the code will now run.


You also forgot to include intarray--1.0--1.1.sql into patch. I've also
added it.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


intarray-sel-4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-07-22 Thread Alexander Korotkov
On Wed, Jul 22, 2015 at 8:48 AM, Michael Paquier michael.paqu...@gmail.com
wrote:

 On Mon, Jul 20, 2015 at 9:18 PM, Alexander Korotkov
 a.korot...@postgrespro.ru wrote:
  attached patch allows pg_rewind to work when target timeline was
 switched.
  Actually, this patch fixes TODO from pg_rewind comments.
 
/*
 * Trace the history backwards, until we hit the target timeline.
 *
 * TODO: This assumes that there are no timeline switches on the target
 * cluster after the fork.
 */
 
  This patch allows pg_rewind to handle data directory synchronization is
 much
  more general way. For instance, user can return promoted standby to old
  master.

 Yes. That would be useful.

  In this patch target timeline history is exposed as global variable.
 Index
  in target timeline history is used in function interfaces instead of
  specifying TLI directly. Thus, SimpleXLogPageRead() can easily start
 reading
  XLOGs from next timeline when current timeline ends.

 The implementation looks rather neat by having a first look at it, but
 the attached patch fails to compile:
 pg_rewind.c:488:4: error: use of undeclared identifier 'targetEntry'
 targetEntry = targetHistory[i];
 ^
 Nice to see as well that this has been added to the CF of September.


Uh, sorry. Fixed version is attached.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pg-rewind-target-switch-2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selectivity estimation for intarray with @@

2015-07-21 Thread Alexander Korotkov
On Tue, Jul 21, 2015 at 5:14 PM, Teodor Sigaev teo...@sigaev.ru wrote:

 Some notices about code:

 1 Near function transformOperator() there is wrong operator name @


Fixed.


 2 int_query (and intquery) should be replaced to query_int to be
 consistent with actual type name. At least where it's used as separate
 lexeme.


Fixed.


 3 For historical reasons @@ doesn't commutate with itself, it has a
 commutator ~~. Patch assumes that @@ is self-commutator, but ~~ will use
 'contsel' as a restrict estimation. Suppose, we need to declare ~~ as
 deprecated and introduce commutating @@ operator.


I think deprecating ~~ is a subject of separate patch. I fixed patch
behavior in the different way. @@ and ~~ are now handled by the same
function. The function handles var @@ const and const ~~ var, but
doesn't handle const @@ var and var  ~~ const. It determines the case
by type of variable: it should be int[].

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


intarray_sel-3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fillfactor for GIN indexes

2015-07-21 Thread Alexander Korotkov
On Tue, Jul 21, 2015 at 7:20 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 07/21/2015 04:14 PM, Alexander Korotkov wrote:

 On Tue, Jul 21, 2015 at 3:52 PM, Heikki Linnakangas hlinn...@iki.fi
 wrote:

  On 07/21/2015 02:56 PM, Alexander Korotkov wrote:

  Probably, but currently we are in quite unlogical situation. We have
 default fillfactor = 90 for GiST where it has no use cases at all and
 effectively is just a waste of space.


 Why is it useless for GiST?



 It's because all of GiST pages are produced by page splits. So, just after
 CREATE INDEX GiST pages aren't tightly packed in average. Actually, they
 could be tightly packed by incredible coincidence, but for large indexes
 it's quite safe assumption that they are not. With GiST we don't have
 storm
 of page splits after index creation with fillfactor = 100. So, why should
 we reserve additional space with fillfactor = 90?


 Aha, I see. Yeah, that's pretty useless. Ideally, we would make the GiST
 build algorithm smarter so that it would pack the pages more tightly. I
 have no idea how to do that, however.

 Anyway, the fact that fillfactor is useless for GiST is more of an
 argument for removing it from GiST, than for adding it to GIN.


Yes it is. I've just think that fillfactor is more useful for GIN than for
GiST now. However, it's probably doesn't worth it for both of them.
One of our customers complains that freshly built GIN indexes get bloat
very fast. I've asked them a test case. Using that test case I would try to
see if fillfactor for GIN could help in practice. For now we can mark it as
Returned with feedback where feedback is Real life use cases needed.
Removing fillfactor for GiST is now not easy. Once it's exposed for users,
removing it completely would break compatibility. I would propose to chage
default fillfactor for GiST from 90 to 100.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-07-17 Thread Alexander Korotkov
On Fri, Jul 17, 2015 at 6:05 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Peter Geoghegan p...@heroku.com writes:
  I've heard that clock_gettime() with CLOCK_REALTIME_COARSE, or with
  CLOCK_MONOTONIC_COARSE can have significantly lower overhead than
  gettimeofday().

 It can, but it also has *much* lower precision, typically 1ms or so.


I've write simple benchmark of QueryPerformanceCounter() for Windows. The
source code is following.

#include stdio.h
#include windows.h
#include Winbase.h

int _main(int argc, char* argv[])
{
LARGE_INTEGER start, freq, current;
long count = 0;
QueryPerformanceFrequency(freq);
QueryPerformanceCounter(start);
current = start;
while (current.QuadPart  start.QuadPart + freq.QuadPart)
{
QueryPerformanceCounter(current);
count++;
}
printf(QueryPerformanceCounter() per second:  %lu\n, count);
return 0;
}

On my virtual machine in runs 1532746 QueryPerformanceCounter() per second.
In a contrast my MacBook can natively run 26260236 gettimeofday() per
second.
So, performance of PostgreSQL instr_time.h can vary in more than order of
magnitude. It's possible that we can found systems where measurements of
time are even much slower.
In general, there could be some systems where accurate measurements of time
intervals is impossible or slow. That means we should provide them some
different solution like sampling. But does it means we should force
majority of systems use sampling which is both slower and less accurate for
them? Could we end up with different options for user?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Proposal for CSN based snapshots

2015-07-24 Thread Alexander Korotkov
On Wed, Aug 27, 2014 at 10:42 AM, Heikki Linnakangas 
hlinnakan...@vmware.com wrote:

 On 08/27/2014 09:40 AM, Heikki Linnakangas wrote:

 On 08/27/2014 08:23 AM, Jeff Davis wrote:

 On Tue, 2014-08-26 at 13:45 +0300, Heikki Linnakangas wrote:

 Yeah. This patch in the current state is likely much much slower than
 unpatched master, except in extreme cases where you have thousands of
 connections and short transactions so that without the patch, you spend
 most of the time acquiring snapshots.


 What else are you looking to accomplish with this patch during this
 'fest? Bug finding? Design review? Performance testing?


 Design review, mostly. I know the performance still sucks. Although if
 you can foresee some performance problems, aside from the extra CSNLOG
 lookups, it would be good to know.


 I think for this commitfest, I've gotten as much review of this patch that
 I can hope for. Marking as Returned with Feedback. But of course, feel
 free to continue reviewing and commenting ;-).


What is current state of this patch? Does community want CSN based
snapshots?
Last email in lists was in August 2014. Huawei did talk about their further
research on this idea at PGCon and promised to publish their patch in open
source.
http://www.pgcon.org/2015/schedule/events/810.en.html
However, it isn't published yet, and we don't know how long we could wait
for it.
Now, our company have resources to work on CSN based snapshots for 9.6. If
Huawei will not publish their patch, we can pick up this work from last
community version of this patch.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


[HACKERS] Use pg_rewind when target timeline was switched

2015-07-20 Thread Alexander Korotkov
Hackers,

attached patch allows pg_rewind to work when target timeline was switched.
Actually, this patch fixes TODO from pg_rewind comments.

  /*
   * Trace the history backwards, until we hit the target timeline.
   *
   * TODO: This assumes that there are no timeline switches on the target
   * cluster after the fork.
   */

This patch allows pg_rewind to handle data directory synchronization is
much more general way. For instance, user can return promoted standby to
old master.

In this patch target timeline history is exposed as global variable. Index
in target timeline history is used in function interfaces instead of
specifying TLI directly. Thus, SimpleXLogPageRead() can easily start
reading XLOGs from next timeline when current timeline ends.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pg_rewind_target_switch.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_trgm version 1.2

2015-07-20 Thread Alexander Korotkov
On Wed, Jul 15, 2015 at 12:31 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Tue, Jul 7, 2015 at 6:33 AM, Alexander Korotkov 
 a.korot...@postgrespro.ru wrote:



 See Tom Lane's comment about downgrade scripts. I think just remove it is
 a right solution.


 The new patch removes the downgrade path and the ability to install the
 old version.

 (If anyone wants an easy downgrade path for testing, they can keep using
 the prior patch--no functional changes)

 It also added a comment before the trigramsMatchGraph call.

 I retained the palloc and the loop to promote the ternary array to a
 binary array.  While I also think it is tempting to get rid of that by
 abusing the type system and would do it that way in my own standalone code,
 it seems contrary to way the project usually does things.  And I couldn't
 measure a difference in performance.


Ok!


 



 Let's consider '^(?!.*def).*abc' regular expression as an example. It
 matches strings which contains 'abc' and don't contains 'def'.

 # SELECT 'abc' ~ '^(?!.*def).*abc';
  ?column?
 --
  t
 (1 row)

 # SELECT 'def abc' ~ '^(?!.*def).*abc';
  ?column?
 --
  f
 (1 row)

 # SELECT 'abc def' ~ '^(?!.*def).*abc';
  ?column?
 --
  f
 (1 row)

 Theoretically, our trigram regex processing could support negative
 matching of 'def' trigram, i.e. trigramsMatchGraph(abc = true, def = false)
 = true but trigramsMatchGraph(abc = true, def = true) = false. Actually, it
 doesn't because trigramsMatchGraph() implements a monotonic function. I
 just think it should be stated explicitly.


 Do you think it is likely to change to stop being monotonic and so support
 the (def=GIN_TRUE) = false case?

 ^(?!.*def)   seems like a profoundly niche situation.  (Although one that
 I might actually start using myself now that I know it isn't just a
 Perl-ism).

 It doesn't make any difference to this patch, other than perhaps how to
 word the comments.


Yes, it was just about the comments.

I also run few tests on real-life dataset: set of dblp paper titles. As it
was expected, there is huge speedup when pattern is long enough.

# SELECT count(*) FROM dblp_titles WHERE s ILIKE '%Multidimensional Data%';
 count
---
   318
(1 row)

Time: 29,114 ms

# SELECT count(*) FROM dblp_titles WHERE s ~* '.*Multidimensional Data.*';
 count
---
   318
(1 row)

Time: 26,273 ms

# SELECT count(*) FROM dblp_titles WHERE s ILIKE '%Multidimensional Data%';
 count
---
   318
(1 row)

Time: 249,725 ms

# SELECT count(*) FROM dblp_titles WHERE s ~* '.*Multidimensional Data.*';
 count
---
   318
(1 row)

Time: 218,627 ms

For me, patch is in the pretty good shape. I'm going to mark it Ready for
committer.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Fillfactor for GIN indexes

2015-07-21 Thread Alexander Korotkov
 not as useful as btree
fillfactor, but it still could be useful in some cases...
Probably, we could leave 100 as default fillfactor, but provide an option.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Fillfactor for GIN indexes

2015-07-21 Thread Alexander Korotkov
On Tue, Jul 21, 2015 at 2:49 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 Hmm. That's slightly different from the test case I used, in that the
 update is changing the indexed value, which means that the new value goes
 to different posting tree than the old one. I'm not sure if that makes any
 difference. You're also updating more rows, 1/50 vs. 1/100.

 This case is closer to my earlier one:

 postgres=# create table foo (id int4, i int4, t text) with (fillfactor=90);
 CREATE TABLE
 postgres=# insert into foo select g, 1 from  generate_series(1, 100) g;
 INSERT 0 100
 postgres=# create index btree_foo_id on foo (id); -- to force non-HOT
 updates
 CREATE INDEX
 postgres=# create index gin_i on foo using gin (i) with (fastupdate=off);
 CREATE INDEX
 postgres=# \di+
List of relations
  Schema | Name | Type  | Owner  | Table |  Size   | Description
 +--+---++---+-+-
  public | btree_foo_id | index | heikki | foo   | 21 MB   |
  public | gin_i| index | heikki | foo   | 1072 kB |
 (2 rows)

 postgres=# update foo set id=-id where id % 100 = 0;
 UPDATE 1
 postgres=# \di+
List of relations
  Schema | Name | Type  | Owner  | Table |  Size   | Description
 +--+---++---+-+-
  public | btree_foo_id | index | heikki | foo   | 22 MB   |
  public | gin_i| index | heikki | foo   | 1080 kB |
 (2 rows)

 I'm not sure what's making the difference to your test case. Could be
 simply that your case happens to leave less free space on each page,
 because of the different data.


Yeah, it's likely because of different data.

 Based on this, I think we should just drop this patch. It's not useful in
 practice.


 I wouldn't say it's not useful at all. It's for sure not as useful as
 btree
 fillfactor, but it still could be useful in some cases...
 Probably, we could leave 100 as default fillfactor, but provide an option.


 Doesn't seem worth the trouble to me...


Probably, but currently we are in quite unlogical situation. We have
default fillfactor = 90 for GiST where it has no use cases at all and
effectively is just a waste of space. On the other had we're rejecting
fillfactor for GIN where it could have at least some use cases... Should we
don't have fillfactor for both GiST and GIN?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Fillfactor for GIN indexes

2015-07-21 Thread Alexander Korotkov
On Tue, Jul 21, 2015 at 3:52 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 07/21/2015 02:56 PM, Alexander Korotkov wrote:

 Probably, but currently we are in quite unlogical situation. We have
 default fillfactor = 90 for GiST where it has no use cases at all and
 effectively is just a waste of space.


 Why is it useless for GiST?


It's because all of GiST pages are produced by page splits. So, just after
CREATE INDEX GiST pages aren't tightly packed in average. Actually, they
could be tightly packed by incredible coincidence, but for large indexes
it's quite safe assumption that they are not. With GiST we don't have storm
of page splits after index creation with fillfactor = 100. So, why should
we reserve additional space with fillfactor = 90?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Selectivity estimation for intarray with @@

2015-07-21 Thread Alexander Korotkov
Hi!

While Uriy is on vacation, I've revised this patch a bit.

On Tue, Jul 14, 2015 at 8:55 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 Hi Uriy,

 This patch looks pretty good.

 The first line of intarray--1.1.sql mis-identifies itself as /*
 contrib/intarray/intarray--1.0.sql */


Fixed.


 The real file intarray--1.0.sql file probably should not be included in
 the final patch, but I like having it for testing.


I've removed intarray--1.0.sql since it shouldn't be in final commit.


 It applies and builds cleanly over the alter operator patch (and now the
 commit as well), passes make check of the contrib module both with and
 without cassert.

 I could succesfully upgrade from version 1.0 to 1.1 without having to drop
 the gin__int_ops indexes in the process.

 I could do pg_upgrade from 9.2 and 9.4 to 9.6devel with large indexes in
 place, and then upgrade the extension to 1.1, and it worked without having
 to rebuild the index.

 It does what it says, and I think we want this.


Good.


 There were some cases where the estimates were not very good, but that
 seems to be limitation of what pg_stats makes available, not of this
 patch.  Specifically if the elements listed in the query text are not part
 of most_common_elems (or worse yet, most_common_elems is null) then it is
 left to guess with no real data, and those guesses can be pretty bad.  It
 is not this patches job to fix that, however.



 It assumes all the stats are independent and so doesn't account for
 correlation between members.  This is also how the core selectivity
 estimates work between columns, so I can't really complain about that.  It
 is easy to trick it with things like @@ '(!300  300)'::query_int, but I
 don't think that is necessary to fix that.


Analysis of all the dependencies inside query is NP-complete task. We could
try to workout simple cases, but postgres optimizer currently doesn't care
about it.

# explain select * from test where val = 'val' and val != 'val';
 QUERY PLAN
-
 Seq Scan on test  (cost=0.00..39831.00 rows=45 width=8)
   Filter: ((val  'val'::text) AND (val = 'val'::text))
(2 rows)

I think we could do the same inside intquery until we figure out some
better solution for postgres optimizer in general.


 I have not been creative enough to come up with queries for which this
 improvement in selectivity estimate causes the execution plan to change in
 important ways.  I'm sure the serious users of this module would have such
 cases, however.

 I haven't tested gist__int_ops as I can't get those indexes to build in a
 feasible amount of time.  But the selectivity estimates are independent of
 any actual index, so testing those doesn't seem to be critical.

 There is no documentation change, which makes sense as this internal stuff
 which isn't documented to start with.


Yes. For instance, tsquery make very similar selectivity estimation as
intquery, but it's assumed to be internal and isn't mentioned in
documentation.

There are no regression test changes.  Not sure about that, it does seem
 like regression tests would be desirable.


It would be nice to cover selectivity estimation with regression tests, but
AFAICS we didn't do it for any selectivity estimation functions yet.
Problem is that selectivity estimation is quite complex process and its
result depending on random sampling during analyze, floating points
operations and so on. We could make a test like with very high level of
confidence, estimate number of rows here should be in [10;100]. But it's
hard to fit such assumptions into our current regression tests
infrastructure.

I haven't gone through the C code.


I also did some coding style and comments modifications.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


intarray_sel-2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] clearing opfuncid vs. parallel query

2015-10-23 Thread Alexander Korotkov
On Fri, Oct 23, 2015 at 12:31 PM, YUriy Zhuravlev <
u.zhurav...@postgrespro.ru> wrote:

> On Thursday 22 October 2015 09:26:46 David Fetter wrote:
> > On Thu, Oct 22, 2015 at 07:15:35PM +0300, YUriy Zhuravlev wrote:
> > > Hello.
> > > Currently using nodeToString and stringToNode you can not pass a
> > > full plan. In this regard, what is the plan to fix it? Or in the
> > > under task parallel query does not have such a problem?
> > >
> > > > This turns out not to be straightforward to code, because we don't
> > > > have a generic plan tree walker,
> > >
> > > I have an inner development. I am using python analyzing header
> > > files and generates a universal walker (parser, paths ,executer and
> > > etc trees), as well as the serializer and deserializer to jsonb.
> > > Maybe I should publish this code?
> >
> > Please do.
> Tom Lane and Robert Haas are very unhappy with a python. Is there any
> reason?
>

Requirement of python with pycparser as build dependency is a
serious cataclysm. For instance, how many buildfarms will survive it?
This is why Tom and Robert are looking for ways to evade it.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-10-29 Thread Alexander Korotkov
On Thu, Sep 24, 2015 at 6:36 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Thu, Sep 24, 2015 at 6:32 PM, Andres Freund <and...@anarazel.de> wrote:
>
>> On 2015-09-15 20:16:10 +0300, YUriy Zhuravlev wrote:
>> > We will be tested.
>>
>> Did you have a chance to run some benchmarks?
>>
>
> Yes, we now have 60 physical cores intel server and we're running
> benchmarks on it.
>

We got a consensus with Andres that we should commit the CAS version first
and look to other optimizations.
Refactored version of atomic state patch is attached. The changes are
following:
1) Macros are used for access refcount and usagecount.
2) likely/unlikely were removed. I think introducing of likely/unlikely
should be a separate patch since it touches portability. Also, I didn't see
any performance effect of this.
3) LockBufHdr returns the state after taking lock. Without using atomic
increments it still can save some loops on skip atomic value reading.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pinunpin-cas.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some questions about the array.

2015-11-11 Thread Alexander Korotkov
On Mon, Nov 9, 2015 at 8:23 PM, Pavel Stehule <pavel.steh...@gmail.com>
wrote:

> 2015-11-09 17:55 GMT+01:00 Alexander Korotkov <a.korot...@postgrespro.ru>:
>
>> On Mon, Nov 9, 2015 at 4:53 PM, Pavel Stehule <pavel.steh...@gmail.com>
>> wrote:
>>
>>> 2015-11-09 14:44 GMT+01:00 YUriy Zhuravlev <u.zhurav...@postgrespro.ru>:
>>>
>>>> On Monday 09 November 2015 13:50:20 Pavel Stehule wrote:
>>>> > New symbols increase a complexity of our code and our documentation.
>>>> >
>>>> > If some functionality can be implemented via functions without
>>>> performance
>>>> > impacts, we should not to create new operators or syntax - mainly for
>>>> > corner use cases.
>>>> >
>>>> > Regards
>>>> >
>>>> > Pavel
>>>>
>>>> Ok we can use {:} instead [:] for zero array access.
>>>> The function is the solution half.
>>>>
>>>
>>> It isn't solution. The any syntax/behave change have to have stronger
>>> motivation. We had so talk about it 20 years ago :(
>>>
>>
>> Assuming array[~n] has a current meaning, could we give a try to new
>> syntax which doesn't have current meaning? Not yet sure what exactly it
>> could be...
>>
>
> Using this syntax can introduce compatibility issues -
> http://www.postgresql.org/docs/9.1/static/sql-createoperator.html
>

I actually meant some other syntax which doesn't introduce compatibility
issues. For instance, array{n} doesn't have meaning in current syntax
AFAICS.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Some questions about the array.

2015-11-09 Thread Alexander Korotkov
On Mon, Nov 9, 2015 at 4:53 PM, Pavel Stehule <pavel.steh...@gmail.com>
wrote:

> 2015-11-09 14:44 GMT+01:00 YUriy Zhuravlev <u.zhurav...@postgrespro.ru>:
>
>> On Monday 09 November 2015 13:50:20 Pavel Stehule wrote:
>> > New symbols increase a complexity of our code and our documentation.
>> >
>> > If some functionality can be implemented via functions without
>> performance
>> > impacts, we should not to create new operators or syntax - mainly for
>> > corner use cases.
>> >
>> > Regards
>> >
>> > Pavel
>>
>> Ok we can use {:} instead [:] for zero array access.
>> The function is the solution half.
>>
>
> It isn't solution. The any syntax/behave change have to have stronger
> motivation. We had so talk about it 20 years ago :(
>

Assuming array[~n] has a current meaning, could we give a try to new syntax
which doesn't have current meaning? Not yet sure what exactly it could be...

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: Rework access method interface

2015-11-03 Thread Alexander Korotkov
On Tue, Nov 3, 2015 at 2:36 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Alvaro Herrera <alvhe...@2ndquadrant.com> writes:
> > Tom Lane wrote:
> >> I'm kind of inclined to just let the verifiers read the catalogs for
> >> themselves.  AFAICS, a loop around the results of SearchSysCacheList
> >> is not going to be significantly more code than what this patch does,
> >> and it avoids presuming that we know what the verifiers will wish to
> >> look at.
>
> > Hmm, so this amounts to saying the verifier can only run after the
> > catalog rows are written.  Is that okay?
>
> Why not?  Surely we are not interested in performance-optimizing for
> the case of a detectably incorrect CREATE OP CLASS command.
>
> I don't actually care that much about running the verifiers during
> CREATE/etc OP CLASS at all.  It's at least as important to be able
> to run them across the built-in opclasses, considering all the chances
> for human error in constructing those things.  Even in the ALTER OP CLASS
> case, the verifier would likely need to look at existing catalog rows as
> well as the new ones.  So basically, I see zero value in exposing CREATE/
> ALTER OP CLASS's internal working representation to the verifiers.
>

I'm OK with validating opclass directly by system catalog, i.e. looping
over SearchSysCacheList results. Teodor was telling me something similar
personally.
I'll also try to rearrange header according to the comments upthread.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-10-30 Thread Alexander Korotkov
On Thu, Oct 29, 2015 at 8:18 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Thu, Sep 24, 2015 at 6:36 PM, Alexander Korotkov <
> a.korot...@postgrespro.ru> wrote:
>
>> On Thu, Sep 24, 2015 at 6:32 PM, Andres Freund <and...@anarazel.de>
>> wrote:
>>
>>> On 2015-09-15 20:16:10 +0300, YUriy Zhuravlev wrote:
>>> > We will be tested.
>>>
>>> Did you have a chance to run some benchmarks?
>>>
>>
>> Yes, we now have 60 physical cores intel server and we're running
>> benchmarks on it.
>>
>
> We got a consensus with Andres that we should commit the CAS version first
> and look to other optimizations.
> Refactored version of atomic state patch is attached. The changes are
> following:
> 1) Macros are used for access refcount and usagecount.
> 2) likely/unlikely were removed. I think introducing of likely/unlikely
> should be a separate patch since it touches portability. Also, I didn't see
> any performance effect of this.
> 3) LockBufHdr returns the state after taking lock. Without using atomic
> increments it still can save some loops on skip atomic value reading.
>

pinunpin-cas-original-fix.patch is just original patch by Andres Freund
with fixed bug which causes hang.
Performance comparison on 72-cores Intel server in attached. On this
machine we see no regression in version of patch in previous letter.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
n_clients,master,pinunpin-cas-original-fix,pinunpin-cas
   1,   18860,   19421,   19972
   8,  162503,  166901,  164363
  16,  316429,  321103,  319214
  36,  627215,  641028,  642854
  56,  745108,  774195,  786642
  90,  847150,  995036, 1022657
 100,  802922, 1077732, 1083769
 110,  615070, 1014446, 1034496
 120,  611956, 1074345, 1094610
 130,  571697, 1082626, 1100578
 140,  579909, 1075855, 1092749
 150,  540442, 1070737, 1089283
 160,  545942, 1068991, 1096360
 170,  515444, 1073508, 1089891
 180,  520867, 1076281, 1093987
 190,  488836, 1071283, 1097871
 200,  493795, 1082849, 1103040
 210,  463765,  967185,  988692
 220,  467661,  972180,  993521
 230,  437297,  976865,  980113
 250,  420711,  956151,  980965


pinunpin-cas-original-fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Summary of Vienna sharding summit, new TODO item

2015-11-07 Thread Alexander Korotkov
On Sat, Nov 7, 2015 at 7:52 PM, Bruce Momjian <br...@momjian.us> wrote:

> On Wed, Nov  4, 2015 at 07:58:52AM -0500, Bruce Momjian wrote:
> > [BCC to pgsql-cluster-hack...@postgresql.org]
> >
> > I have summarized the Vienna cluster summit meeting from last week by
> > adding italicized answers to the agenda questions, even though we didn't
> > follow the agenda ;-) :
> >
> >   https://wiki.postgresql.org/wiki/PG-EU_2015_Cluster_Summit
> >
> > I would like to propose the following TODO item:
> >
> >   *  Enhance foreign data wrappers, parallelism, partitioning,
> >   and perhaps add a global snapshot/transaction manager to allow
> >   creation of a proof-of-concept built-in sharding solution
> >
> >   Ideally these enhancements and new facilities will be available to
> >   external sharding solutions as well.
> >
> > Is this acceptable?
>
> Added to TODO.


Great, thank you, Bruce!

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: Access method extendability

2015-11-02 Thread Alexander Korotkov
Hi!

Thank you for review!

On Mon, Sep 7, 2015 at 6:41 PM, Teodor Sigaev <teo...@sigaev.ru> wrote:

> Some notices:
>
> 1) create-am.3.patch.gz
>   As I understand, you didn't add schema name to access method. Why?
> Suppose, if we implement SQL-like interface for AM screation/dropping then
> we should provide a schema qualification for them
>

Per subsequent discussion it's unclear why we need to make access methods
schema qualified.


> 2) create-am.3.patch.gz get_object_address_am()
> +   switch (list_length(objname))
> ...
> +   case 2:
> +   catalogname = strVal(linitial(objname));
> +   amname = strVal(lthird(objname));
> ^^ seems, it should be
> lsecond()
>

Fixed.


> 3) create-am.3.patch.gz
>  Suppose, RM_GENERIC_ID is part of generic-xlog.3.patch.gz
>

Fixed.

4) Makefile(s)
> As I can see, object files are lexicographically ordered
>

Fixed.

5) gindesc.c -> genericdesc.c in file header
>

Fixed.


> 6) generic-xlog.3.patch.gz
> I don't like an idea to split START_CRIT_SECTION and END_CRIT_SECTION to
> GenericXLogStart and GenericXLogFinish. User's code could throw a error or
> allocate memory between them and error will become a panic.


Fixed. Now, critical section is only inside GenericXLogFinish.
GenericXLogRegister returns pointer to the page copies which could be
modified. GenericXLogFinish swaps the data between page copy and page
itself. That allow us to avoid critical section in used code.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


generic-xlog.4.patch.gz
Description: GNU Zip compressed data


create-am.4.patch.gz
Description: GNU Zip compressed data


bloom-contrib.4.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-10-14 Thread Alexander Korotkov
On Wed, Sep 30, 2015 at 9:46 AM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> On Sat, Sep 19, 2015 at 8:27 AM, Michael Paquier wrote:
> > On Fri, Sep 18, 2015 at 6:25 PM, Michael Paquier wrote:
> >> The refactoring of getTimelineHistory as you propose looks like a good
> >> idea to me, I tried to remove by myself the difference between source
> >> and target in copy_fetch.c and friends but this gets uglier,
> >> particularly because of datadir_source in copy_file_range. Not worth
> >> it.
> >
> > Forgot that:
> > if (ControlFile_target.state != DB_SHUTDOWNED)
> > pg_fatal("target server must be shut down cleanly\n");
> > We may want to allow a target node shutdowned in recovery as well here.
>
> So, attached is a more polished version of this patch, cleaned up of
> its typos with as well other things. I have noticed for example that
> it would be more useful to add the debug information of a timeline
> file fetched from the source or a target server directly in
> getTimelineHistory. I have as well updated a couple of comments in the
> code regarding the fact that we do not necessarily use a master as a
> target node, and mentioned in findCommonAncestorTimeline that we check
> as well the start position of a timeline to cover the case where both
> target and source node forked at the same timeline number but with a
> different WAL fork position.
> I am marking this patch as ready for committer. It would be cool in
> the future to use the recovery test suite to have more advanced
> scenarios tested, but it seems a shame to block this patch because of
> that.
>

Thanks a lot. Now patch looks much better.
I found that it doesn't applies cleanly on the current master. Rebased
version is attached.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pg-rewind-target-switch-6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-10-14 Thread Alexander Korotkov
On Sat, Sep 19, 2015 at 2:25 AM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> OK, I see your point and you are right. This additional check allows
> pg_rewind to switch one timeline back and make the scan of blocks
> begin at the real origin of both timelines. I had in mind the case
> where you tried to actually rewind node 2 to node 3 actually which
> would not be possible with your patch, and by thinking more about that
> I think that it could be possible to remove the check I am listing
> below and rely on the checks in the history files, basically what is
> in findCommonAncestorTimeline:
> if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
> ControlFile_source.checkPointCopy.ThisTimeLineID)
> pg_fatal("source and target cluster are on the same
> timeline\n");
> Alexander, what do you think about that? I think that we should be
> able to rewind with for example node 2 as target and node 3 as source,
> and vice-versa as per the example you gave even if both nodes are on
> the same timeline, just that they forked at a different point. Could
> you test this particular case? Using my script, simply be sure to
> switch archive_mode to on/off depending on the node, aka only 3 and 4
> do archiving.
>

I think relying on different fork point is not safe enough. Imagine more
complex case.

  1
 / \
2   3
|   |
4   5

At first, nodes 2 and 3 are promoted at the same point and they both get
timeline 2.
Then nodes 4 and 5 are promoted at different points and they both get
timeline 3.
Then we can try to rewind node 4 with node 5 as the source or vice versa.
In this case we can't find collision of timeline 2.

The same collision could happen even when source and target are on the
different timeline number. However, having the on the same timeline numbers
is suspicious enough to disallow it until we have additional checks.

I could propose following plan:

   1. Commit this patch without allowing rewind when target and source are
   on the same timelines.
   2. Make additional checks for distinguish different timelines with the
   same numbers.
   3. Allow rewind when target and source are on the same timelines.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] PoC: Partial sort

2015-10-16 Thread Alexander Korotkov
On Sun, Jun 7, 2015 at 11:01 PM, Peter Geoghegan <p...@heroku.com> wrote:

> On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson <andr...@proxel.se>
> wrote:
> > Are you planning to work on this patch for 9.6?
>
> FWIW I hope so. It's a nice patch.
>

I'm trying to to whisk dust. Rebased version of patch is attached. This
patch isn't passing regression tests because of plan changes. I'm not yet
sure about those changes: why they happens and are they really regression?
Since I'm not very familiar with planning of INSERT ON CONFLICT and RLS,
any help is appreciated.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


partial-sort-basic-3.patch
Description: Binary data


regression.diffs
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Partial sort

2015-10-20 Thread Alexander Korotkov
On Fri, Oct 16, 2015 at 7:11 PM, Alexander Korotkov <aekorot...@gmail.com>
wrote:

> On Sun, Jun 7, 2015 at 11:01 PM, Peter Geoghegan <p...@heroku.com> wrote:
>
>> On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson <andr...@proxel.se>
>> wrote:
>> > Are you planning to work on this patch for 9.6?
>>
>> FWIW I hope so. It's a nice patch.
>>
>
> I'm trying to to whisk dust. Rebased version of patch is attached. This
> patch isn't passing regression tests because of plan changes. I'm not yet
> sure about those changes: why they happens and are they really regression?
> Since I'm not very familiar with planning of INSERT ON CONFLICT and RLS,
> any help is appreciated.
>

Planner regression is fixed in the attached version of patch. It appears
that get_cheapest_fractional_path_for_pathkeys() behaved wrong when no
ordering is required.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


partial-sort-basic-4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] point_ops for GiST

2015-10-12 Thread Alexander Korotkov
Hi, Stefan!

On Sun, Oct 11, 2015 at 10:00 PM, Stefan Keller <sfkel...@gmail.com> wrote:

> Pls. don't misunderstand my questions: They are directed to get an
> even more useful spatial data handling of PostgreSQL. I'm working with
> PostGIS since years and are interested in any work regarding spatial
> types...
>
> Can anyone report use cases or applications of these built-in geometric
> types?
>
> Would'nt it be even more useful to concentrate to PostGIS
> geometry/geography types and extend BRIN to these types?
>

Note, that PostGIS is a different project which is maintained by separate
team. PostGIS have its own priorities, development plan etc.
PostgreSQL have to be self-consistent. In particular, it should have
reference implementation of operator classes which extensions can use as
the pattern. This is why it's important to maintain built-in geometric
types.

In short: once we implement it for built-in geometric types, you can ask
PostGIS team to do the same for their geometry/geography.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] point_ops for GiST

2015-10-12 Thread Alexander Korotkov
On Mon, Oct 12, 2015 at 12:47 PM, Emre Hasegeli <e...@hasegeli.com> wrote:

> > This was already fixed for GiST.
> > See following discussion
> >
> http://www.postgresql.org/message-id/capphfdvgticgniaj88vchzhboxjobuhjlm6c09q_op_u9eo...@mail.gmail.com
> > and commit
> >
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3c29b196b0ce46662cb9bb7a1f91079fbacbcabb
> > "Consistent" method of GiST influences only search and can't lead to
> corrupt
> > indexes. However, "same" method can lead to corrupt indexes.
> > However, this is not the reason to not backpatch the changes and preserve
> > buggy behaviour; this is the reason to recommend reindexing to users.
> And it
> > was already backpatched.
>
> Fixing it on the opclass is not an option for BRIN.  We designed BRIN
> opclasses extensible using extra SQL level support functions and
> operators.  It is possible to support point datatype using box vs
> point operators.  Doing so would lead to wrong results, because point
> operators use FP macros, but box_contain_pt() doesn't.
>

You still can workaround this problem in opclass. For instance, you can
assign different strategy number for this case. And call another support
function instead of overlap operator in brin_inclusion_consistent. For
sure, this would be a kluge.


> GiST opclass could be more clean and extensible, if we wouldn't have
> those macros.
>

In my opinion it would be cool remove FP macros. I see absolutely no sense
in them. But since it break compatibility it would be quite hard though.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] point_ops for GiST

2015-10-12 Thread Alexander Korotkov
Hi, Alvaro!

On Tue, May 12, 2015 at 9:13 PM, Alvaro Herrera <alvhe...@2ndquadrant.com>
wrote:

> Robert Haas wrote:
> > 2009/12/30 Teodor Sigaev <teo...@sigaev.ru>:
> > > Sync with current CVS
> >
> > I have reviewed this patch and it looks good to me.  The only
> > substantive question I have is why gist_point_consistent() uses a
> > different coding pattern for the box case than it does for the polygon
> > and circle cases?  It's not obvious to me on the face of it why these
> > aren't consistent.
>
> Emre Hasegeli just pointed out to me that this patch introduced
> box_contain_pt() and in doing so used straight C comparison (<= etc)
> instead of FPlt() and friends.  I would think that that's a bug and
> needs to be changed -- but certainly not backpatched, because gist
> indexes would/might become corrupt.
>

This was already fixed for GiST.
See following discussion
http://www.postgresql.org/message-id/capphfdvgticgniaj88vchzhboxjobuhjlm6c09q_op_u9eo...@mail.gmail.com
and
commit
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3c29b196b0ce46662cb9bb7a1f91079fbacbcabb
"Consistent" method of GiST influences only search and can't lead to
corrupt indexes. However, "same" method can lead to corrupt indexes.
However, this is not the reason to not backpatch the changes and preserve
buggy behaviour; this is the reason to recommend reindexing to users. And
it was already backpatched.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Some questions about the array.

2015-10-09 Thread Alexander Korotkov
On Fri, Oct 9, 2015 at 6:27 PM, Andrew Dunstan <and...@dunslane.net> wrote:

> On 10/09/2015 08:02 AM, YUriy Zhuravlev wrote:
>
>> We were some of the issues associated with the behavior of arrays.
>> 1. We would like to implement arrays negative indices (from the end) like
>> in
>> Python or Ruby: arr[-2]  or arr[1: -1]
>> but as an array can be indexed in the negative area so it probably can
>> not be
>> done.
>> 2. We would like to add the ability be omitted boundaries in the slice.
>> Example: arr[2:] or arr[:2]. But there was a problem with the update of an
>> empty array:
>> arr[1:][1:] = {1,2,3,4,5,6} can be interpreted as
>> arr[1:3][1:2] or arr[1:2] [1:3] or [1:1], [1:6]
>>
>> What is the history of the emergence of such arrays? Maybe something can
>> be
>> improved?
>>
>> P.S. I would like List datatype as in Python.  Is there any fundamental
>> objections? Or we just did not have the time and enthusiasm before?
>> The current implementation I would call vectors or matrices but not
>> arrays.
>> IMHO
>>
>>
>>
> The name array is now far too baked in to change it.
>
> jsonb and json arrays have many of the characteristics you seem to want.
> They are always 0-based and negative indexes count from the end. They also
> don't have to be regular, unlike our native arrays.
>

jsonb and json arrays support very limited number of types. Casting other
datatypes to/from text is an option, but it is both awkward and not
space-compact.

Omitted boundaries in the slice looks nice for me. Considering problem with
empty array, current behaviour of empty array updating doesn't look
consistent for me.
When updating non-empty array its boundaries isn't extending. If one update
non-empty array out of its boundaries then he get an error "ERROR:  array
subscript out of range".
If we extrapolate this logic to empty arrays then we this error should be
thrown on any update of empty array. Despite this, we allow any update of
empty array.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Fillfactor for GIN indexes

2015-07-10 Thread Alexander Korotkov
On Fri, Jul 10, 2015 at 4:54 AM, Michael Paquier michael.paqu...@gmail.com
wrote:



 On Thu, Jul 9, 2015 at 10:33 PM, Alexander Korotkov wrote:
  [...]

 + /* Caclculate max data size on page according to fillfactor */
 s/Caclculate/Calculate

 When creating a simple gin index, I am seeing that GinGetMaxDataSize gets
 called with ginEntryInsert:
   * frame #0: 0x00010a49d72e
 postgres`GinGetMaxDataSize(index=0x000114168378, isBuild='\x01') + 14
 at gindatapage.c:436
 frame #1: 0x00010a49ecbe
 postgres`createPostingTree(index=0x000114168378,
 items=0x000114a9c038, nitems=35772, buildStats=0x7fff55787350) +
 302 at gindatapage.c:1754
 frame #2: 0x00010a493423
 postgres`buildFreshLeafTuple(ginstate=0x7fff55784d90, attnum=1,
 key=2105441, category='\0', items=0x000114a9c038, nitem=35772,
 buildStats=0x7fff55787350) + 339 at gininsert.c:158
 frame #3: 0x00010a492f84
 postgres`ginEntryInsert(ginstate=0x7fff55784d90, attnum=1, key=2105441,
 category='\0', items=0x000114a9c038, nitem=35772,
 buildStats=0x7fff55787350) + 916 at gininsert.c:228

 And as far as I can see GinGetMaxDataSize uses isBuild:
 + int
 + GinGetMaxDataSize(Relation index, bool isBuild)
 + {
 + int fillfactor, result;
 +
 + /* Grab option value which affects only index build */
 + if (!isBuild)
 However isBuild is not set in this code path, see
 http://www.postgresql.org/message-id/cab7npqsc4vq9mhkqm_yvafcteho-iuy8skbxydnmgnai1xn...@mail.gmail.com
 where I reported the problem. So this code is somewhat broken, no? I am
 attaching to this email the patch in question that should be applied on top
 of Alexander's latest version.


I didn't get what is problem. Was this stacktrace during index build? If
so, GinGetMaxDataSize really should get isBuild='\x01'. It is set by
following line

maxdatasize = GinGetMaxDataSize(index, buildStats ? true: false);


 + #define GIN_MIN_FILLFACTOR20
 + #define GIN_DEFAULT_FILLFACTOR90
 I am still worrying about using a default fillfactor at 90, as we did a
 lot of promotion about compression of GIN indexes in 9.4. Feel free to
 ignore this comment if you think that 90 is a good default. The difference
 is visibly in order of MBs even for large indexes still, this is changing
 the default of 9.4 and 9.5.


On the other side, it's unclear why should we have fillfactor distinct from
btree..

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] pg_trgm version 1.2

2015-07-07 Thread Alexander Korotkov
On Tue, Jun 30, 2015 at 11:28 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Tue, Jun 30, 2015 at 2:46 AM, Alexander Korotkov 
 a.korot...@postgrespro.ru wrote:

 On Sun, Jun 28, 2015 at 1:17 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 This patch implements version 1.2 of contrib module pg_trgm.

 This supports the triconsistent function, introduced in version 9.4 of
 the server, to make it faster to implement indexed queries where some keys
 are common and some are rare.


 Thank you for the patch! Lack of pg_trgm triconsistent support was
 significant miss after fast scan implementation.


 I've included the paths to both upgrade and downgrade between 1.1 and
 1.2, although after doing so you must close and restart the session before
 you can be sure the change has taken effect. There is no change to the
 on-disk index structure


 pg_trgm--1.1.sql andpg_trgm--1.1--1.2.sql are useful for debug, but do
 you expect them in final commit? As I can see in other contribs we have
 only last version and upgrade scripts.



 I had thought that pg_trgm--1.1.sql was needed for pg_upgrade to work, but
 I see that that is not the case.

 I did see another downgrade path for different module, but on closer
 inspection it was one that I wrote while trying to analyze that module, and
 then never removed.  I have no objection to removing pg_trgm--1.2--1.1.sql
 before the commit, but I don't see what we gain by doing so.  If a
 downgrade is feasible and has been tested, why not include it?


See Tom Lane's comment about downgrade scripts. I think just remove it is a
right solution.


 You could get the same benefit just by increasing MAX_MAYBE_ENTRIES (in
 core) from 4 to some higher value (which it probably should be anyway, but
 there will always be a case where it needs to be higher than you can afford
 it to be, so a real solution is needed).


 Actually, it depends on how long it takes to calculate consistent
 function. The cheaper consistent function is, the higher MAX_MAYBE_ENTRIES
 could be. Since all functions in PostgreSQL may define its cost, GIN could
 calculate MAX_MAYBE_ENTRIES from the cost of consistent function.


 The consistent function gets called on every candidate tuple, so if it is
 expensive then it is also all the more worthwhile to reduce the set of
 candidate tuples.  Perhaps MAX_MAYBE_ENTRIES could be calculated from the
 log of the maximum of the predictNumberResult entries? Anyway, a subject
 for a different day


Yes, that also a point. However, we optimize not only costs of consistent
calls but also costs of getting candidate item pointers which are both CPU
and IO.

There may also be some gains in the similarity and regex cases, but I
 didn't really analyze those for performance.



 I've thought about how to document this change.  Looking to other
 example of other contrib modules with multiple versions, I decided that we
 don't document them, other than in the release notes.


 The same patch applies to 9.4 code with a minor conflict in the
 Makefile, and gives benefits there as well.


 Some other notes about the patch:
 * You allocate boolcheck array and don't use it.


 That was a bug (at least notionally).  trigramsMatchGraph was supposed to
 be getting boolcheck, not check, passed in to it.

 It may not have been a bug in practise, because GIN_MAYBE and GIN_TRUE
 both test as true when cast to booleans. But it seems wrong to rely on
 that.  Or was it intended to work this way?

 I'm surprised the compiler suite doesn't emit some kind of warning on that.


I'm not sure. It's attractive to get rid of useless memory allocation and
copying.
You can also change trigramsMatchGraph() to take GinTernaryValue *
argument. Probably casting bool * as GinTernaryValue * looks better than
inverse casting.




 * Check coding style and formatting, in particular check[i]==GIN_TRUE
 should be check[i] == GIN_TRUE.


 Sorry about that, fixed.  I also changed it in other places to check[i] !=
 GIN_FALSE, rather than checking against both GIN_TRUE and GIN_MAYBE.  At
 first I was concerned we might decide to add a 4th option to the type which
 would render  != GIN_FALSE the wrong way to test for it.  But since it is
 called GinTernaryValue, I think we wouldn't add a fourth thing to it.  But
 perhaps the more verbose form is clearer to some people.


Thanks!


 * I think some comments needed in gin_trgm_triconsistent() about
 trigramsMatchGraph(). gin_trgm_triconsistent() may use trigramsMatchGraph()
 that way because trigramsMatchGraph() implements monotonous boolean
 function.


 I have a function-level comment that in all cases, GIN_TRUE is at least as
 favorable to inclusion of a tuple as GIN_MAYBE.  Should I reiterate that
 comment before trigramsMatchGraph() as well?  Calling it a monotonic
 boolean function is precise and concise, but probably less understandable
 to people reading the code.  At least, if I saw that, I would need to go do
 some reading before I knew what it meant.


Let's

Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-07-07 Thread Alexander Korotkov
 (offset 71 lines).
patching file src/include/nodes/plannodes.h
Hunk #1 succeeded at 586 (offset 43 lines).
patching file src/include/nodes/relation.h
Hunk #1 succeeded at 1045 (offset 14 lines).
Hunk #2 succeeded at 1422 (offset 14 lines).
patching file src/include/optimizer/cost.h
Hunk #1 succeeded at 114 (offset 1 line).
patching file src/include/optimizer/pathnode.h
Hunk #1 succeeded at 91 (offset 2 lines).
Hunk #2 succeeded at 104 (offset 2 lines).
Hunk #3 succeeded at 119 (offset 2 lines).
patching file src/include/optimizer/planmain.h
Hunk #1 succeeded at 124 (offset 2 lines).
patching file src/test/regress/expected/equivclass.out
patching file src/test/regress/expected/join.out
Hunk #1 succeeded at 2656 (offset 42 lines).
Hunk #2 succeeded at 3428 (offset 90 lines).
Hunk #3 succeeded at 4506 (offset 90 lines).
patching file src/test/regress/expected/rowsecurity.out
Hunk #1 succeeded at 274 (offset 26 lines).
patching file src/test/regress/expected/select_views.out
Hunk #1 succeeded at 1411 (offset 46 lines).
Hunk #2 succeeded at 1432 (offset 46 lines).
Hunk #3 succeeded at 1466 (offset 46 lines).
Hunk #4 succeeded at 1497 (offset 46 lines).
patching file src/test/regress/expected/select_views_1.out
Hunk #1 succeeded at 1411 (offset 46 lines).
Hunk #2 succeeded at 1432 (offset 46 lines).
Hunk #3 succeeded at 1466 (offset 46 lines).
Hunk #4 succeeded at 1497 (offset 46 lines).
patching file src/test/regress/sql/join.sql
Hunk #1 succeeded at 1344 (offset 37 lines).

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Fillfactor for GIN indexes

2015-07-09 Thread Alexander Korotkov
Hi!

On Wed, Jul 8, 2015 at 10:27 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 In dataPlaceToPageLeaf-function:

  if (append)
 {
 /*
  * Even when appending, trying to append more items than
 will fit is
  * not completely free, because we will merge the new
 items and old
  * items into an array below. In the best case, every new
 item fits in
  * a single byte, and we can use all the free space on
 the old page as
  * well as the new page. For simplicity, ignore segment
 overhead etc.
  */
 maxitems = Min(maxitems, freespace +
 GinDataPageMaxDataSize);
 }


 Hmm. So after splitting the page, there is freespace +
 GinDataPageMaxDataSize bytes available on both pages together. freespace
 has been adjusted with the fillfactor, but GinDataPageMaxDataSize is not.
 So this overshoots, because when leafRepackItems actually distributes the
 segments on the pages, it will fill both pages only up to the fillfactor.
 This is an upper bound, so that's harmless, it only leads to some
 unnecessary work in dealing with the item lists. But I think that should be:

 maxitems = Min(maxitems, freespace + leaf-maxdatasize);


Good catch! Fixed.



  else
 {
 /*
  * Calculate a conservative estimate of how many new
 items we can fit
  * on the two pages after splitting.
  *
  * We can use any remaining free space on the old page to
 store full
  * segments, as well as the new page. Each full-sized
 segment can hold
  * at least MinTuplesPerSegment items
  */
 int nnewsegments;

 nnewsegments = freespace / GinPostingListSegmentMaxSize;
 nnewsegments += GinDataPageMaxDataSize /
 GinPostingListSegmentMaxSize;
 maxitems = Min(maxitems, nnewsegments *
 MinTuplesPerSegment);
 }


 This branch makes the same mistake, but this is calculating a lower bound.
 It's important that maxitems is not set to higher value than what actually
 fits on the page, otherwise you can get an ERROR later in the function,
 when it turns out that not all the items actually fit on the page. The
 saving grace here is that this branch is never taken when building a new
 index, because index build inserts all the TIDs in order, but that seems
 pretty fragile. Should use leaf-maxdatasize instead of
 GinDataPageMaxDataSize here too.


Fixed.


 But that can lead to funny things, if fillfactor is small, and BLCKSZ is
 small too. With the minimums, BLCKSZ=1024 and fillfactor=0.2, the above
 formula will set nnewsegments to zero. That's not going to end up well. The
 problem is that maxdatasize becomes smaller than
 GinPostingListSegmentMaxSize, which is a problem. I think
 GinGetMaxDataSize() needs to make sure that the returned value is always =
 GinPostingListSegmentMaxSize.


Fixed.


 Now that we have a fillfactor, shouldn't we replace the 75% heuristic
 later in that function, when inserting to the rightmost page rather than
 building it from scratch? In B-tree, the fillfactor is applied to that case
 too.


Sounds reasonable. Now it works like btree.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


gin_fillfactor_6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cube extension kNN support

2015-07-09 Thread Alexander Korotkov
Hi!

On Sat, May 9, 2015 at 6:53 AM, Stas Kelvich stas.kelv...@gmail.com wrote:

 Patch is pretty ready, last issue was about changed extension interface,
 so there should be migration script and version bump.
 Attaching a version with all migration stuff.


I can't see cube--1.0--1.1.sql in the patch. Did forget to include it?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-07-10 Thread Alexander Korotkov
On Fri, Jun 26, 2015 at 6:39 AM, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Jun 25, 2015 at 9:23 AM, Peter Eisentraut pete...@gmx.net wrote:
  On 6/22/15 1:37 PM, Robert Haas wrote:
  Currently, the only time we report a process as waiting is when it is
  waiting for a heavyweight lock.  I'd like to make that somewhat more
  fine-grained, by reporting the type of heavyweight lock it's awaiting
  (relation, relation extension, transaction, etc.).  Also, I'd like to
  report when we're waiting for a lwlock, and report either the specific
  fixed lwlock for which we are waiting, or else the type of lock (lock
  manager lock, buffer content lock, etc.) for locks of which there is
  more than one.  I'm less sure about this next part, but I think we
  might also want to report ourselves as waiting when we are doing an OS
  read or an OS write, because it's pretty common for people to think
  that a PostgreSQL bug is to blame when in fact it's the operating
  system that isn't servicing our I/O requests very quickly.
 
  Could that also cover waiting on network?

 Possibly.  My approach requires that the number of wait states be kept
 relatively small, ideally fitting in a single byte.  And it also
 requires that we insert pgstat_report_waiting() calls around the thing
 that is notionally blocking.  So, if there are a small number of
 places in the code where we do network I/O, we could stick those calls
 around those places, and this would work just fine.  But if a foreign
 data wrapper, or any other piece of code, does network I/O - or any
 other blocking operation - without calling pgstat_report_waiting(), we
 just won't know about it.


Idea of fitting wait information into single byte and avoid both locking
and atomic operations is attractive.
But how long we can go with it?
Could DBA make some conclusion by single querying of pg_stat_activity or
double querying?
In order to make a conclusion about system load one have to run daemon or
background worker which is continuously sampling current wait events.
Sampling current wait event with high rate also gives some overhead to the
system as well as locking or atomic operations.
Checking if backend is stuck isn't easy as well. If you don't expose how
long last wait event continues it's hard to distinguish getting stuck on
particular lock and high concurrency on that lock type.

I can propose following:

1) Expose more information about current lock to user. For instance, having
duration of current wait event, user can determine if backend is getting
stuck on particular event without sampling.
2) Accumulate per backend statistics about each wait event type: number of
occurrences and total duration. With this statistics user can identify
system bottlenecks again without sampling.

Number #2 will be provided as a separate patch.
Number #1 require different concurrency model. ldus will extract it from
waits monitoring patch shortly.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: Rework access method interface

2015-08-27 Thread Alexander Korotkov
On Wed, Aug 26, 2015 at 7:20 PM, Alexander Korotkov 
a.korot...@postgrespro.ru wrote:

 On Wed, Aug 26, 2015 at 6:50 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov a.korot...@postgrespro.ru writes:
  OK. So, as we mentioned before, if we need to expose something of am
  parameters at SQL-level then we need to write special functions which
 would
  call amhandler and expose it.
  Did we come to the agreement on this solution?

 I think we were agreed that we should write functions to expose whatever
 needs to be visible at SQL level.  I'm not sure that we had a consensus
 on exactly which things need to be visible.

 One thought here is that we might not want to just blindly duplicate
 the existing pg_am behavior anyway.  For example, the main use of the
 amstrategies column was to allow validation of pg_amop.amopstrategy
 entries --- but in 4 out of the 6 existing AMs, knowledge of the AM alone
 isn't sufficient information to determine the valid set of strategy
 numbers anyway.  So inventing a pg_amstrategies(am oid) function seems
 like it would be repeating a previous failure.  Not quite sure what to
 do instead, though.  We could imagine something like pg_amstrategies(am
 oid, opclass oid), but I don't see how to implement it without asking
 opclasses to provide a validation function, which maybe is a change we
 don't want to take on here.


 Could we add another function to access method interface which would
 validate opclass?
 Am could validate this way not only strategies, but also supporting
 functions. For instance, in GIN, we now require opclass to specify at least
 one of consistent and triconsistent. ISTM I would be nice to let the access
 method check such conditions. Also, we would be able to check opclass
 correction on its creation. Now one can create opclass with missing support
 functions which doesn't work.
 In the SQL-level we can create function which validates opclass using this
 new method. This function can be used in regression tests.


Should I try to implement such new access method function, say 'amvalidate'?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: Rework access method interface

2015-08-26 Thread Alexander Korotkov
On Wed, Aug 26, 2015 at 6:50 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov a.korot...@postgrespro.ru writes:
  OK. So, as we mentioned before, if we need to expose something of am
  parameters at SQL-level then we need to write special functions which
 would
  call amhandler and expose it.
  Did we come to the agreement on this solution?

 I think we were agreed that we should write functions to expose whatever
 needs to be visible at SQL level.  I'm not sure that we had a consensus
 on exactly which things need to be visible.

 One thought here is that we might not want to just blindly duplicate
 the existing pg_am behavior anyway.  For example, the main use of the
 amstrategies column was to allow validation of pg_amop.amopstrategy
 entries --- but in 4 out of the 6 existing AMs, knowledge of the AM alone
 isn't sufficient information to determine the valid set of strategy
 numbers anyway.  So inventing a pg_amstrategies(am oid) function seems
 like it would be repeating a previous failure.  Not quite sure what to
 do instead, though.  We could imagine something like pg_amstrategies(am
 oid, opclass oid), but I don't see how to implement it without asking
 opclasses to provide a validation function, which maybe is a change we
 don't want to take on here.


Could we add another function to access method interface which would
validate opclass?
Am could validate this way not only strategies, but also supporting
functions. For instance, in GIN, we now require opclass to specify at least
one of consistent and triconsistent. ISTM I would be nice to let the access
method check such conditions. Also, we would be able to check opclass
correction on its creation. Now one can create opclass with missing support
functions which doesn't work.
In the SQL-level we can create function which validates opclass using this
new method. This function can be used in regression tests.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: Rework access method interface

2015-08-26 Thread Alexander Korotkov
On Tue, Aug 25, 2015 at 7:20 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Jim Nasby jim.na...@bluetreble.com writes:
  On 8/25/15 10:56 AM, Tom Lane wrote:
  I'm good with this as long as all the things that get stored in pg_am
  are things that pg_class.relam can legitimately reference.  If somebody
  proposed adding an access method kind that was not a relation access
  method, I'd probably push back on whether that should be in pg_am or
  someplace else.

  Would fields in pg_am be overloaded then?

 No, because the proposal was to reduce pg_am to just amname, amkind
 (which would be something like 'i' or 's'), and amhandler.  Everything
 specific to a particular type of access method would be shoved down to
 the level of the C APIs.


OK. So, as we mentioned before, if we need to expose something of am
parameters at SQL-level then we need to write special functions which would
call amhandler and expose it.
Did we come to the agreement on this solution?

 From a SQL standpoint it'd be
  much nicer to have child tables, though that could potentially be faked
  with views.

 I've looked into having actual child tables in the system catalogs, and
 I'm afraid that the pain-to-reward ratio doesn't look very good.


Agree. Teach syscache about inheritance would be overengeneering for this
problem.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Horizontal scalability/sharding

2015-08-31 Thread Alexander Korotkov
On Mon, Aug 31, 2015 at 9:48 PM, Mason S <masonli...@gmail.com> wrote:

>
>>  We also a bit disappointed by Huawei position about CSN patch, we hoped
>> to use for  our XTM.
>>
>
> Disappointed in what way? Moving to some sort of CSN approach seems to
> open things up for different future ideas. In the short term, it would mean
> replacing potentially large snapshots and longer visibility checks. In the
> long term, perhaps CSN could help simplify the design of multi-master
> replication schemes.
>

We are disappointed because at PGCon talk Huawei announced publishing of
their CSN patch and further work in this direction together with community.
However, it's even not published yet despite all the promises. Nobody from
Huawei answers CSN thread in the hackers.
So, I think we got nothing from Huawei except teasing and should rely only
on ourselves. That is disappointing.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: Access method extendability

2015-08-31 Thread Alexander Korotkov
Hackers,

there is next revision of patches providing access method extendability.
Now it's based on another patch which reworks access method interface.
http://www.postgresql.org/message-id/capphfdsxwzmojm6dx+tjnpyk27kt4o7ri6x_4oswcbyu1rm...@mail.gmail.com

Besides access method interface, major change is generic xlog interface.
Now, generic xlog interface is more user friendly. Generic xlog compares
initial and changed versions of page by itself. The only thing it can't do
is to find data moves inside page, because it would be too high overhead.
So in order to get compact WAL records one should use
GenericXLogMemmove(dst, src, size) in order to move data inside page. If
this function wasn't used then WAL records would just not so compact.

In general pattern of generic WAL usage is following.

1) Start using generic WAL: specify relation

GenericXLogStart(index);

2) Register buffers

GenericXLogRegister(0, buffer1, false);
GenericXLogRegister(1, buffer2, true);

first argument is a slot number, second is the buffer, third is flag
indicating new buffer

3) Do changes in the pages. Use GenericXLogMemmove() if needed.

4) Finish using GenericXLogFinish(), or abort using GenericXLogAbort(). In
the case of abort initial state of pages will be reverted.

Generic xlog takes care about critical section, unlogged relation, setting
lsn, making buffer dirty. User code is just simple and clear.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


create-am.2.patch.gz
Description: GNU Zip compressed data


generic-xlog.2.patch.gz
Description: GNU Zip compressed data


bloom-contrib.2.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Commitfest remaining "Needs Review" items

2015-08-31 Thread Alexander Korotkov
On Mon, Aug 31, 2015 at 4:31 PM, Andres Freund <and...@anarazel.de> wrote:

> On 2015-08-31 16:22:54 +0300, Alexander Korotkov wrote:
> > Is it correct to switch 2015-09 commitfest to inprogress now?
>
> Yea, isn't it only starting the 15th?


AFICS, on the last developer meeting it was decided to start commitfests in
1st.
https://wiki.postgresql.org/wiki/PgCon_2015_Developer_Meeting#9.6_Schedule


> Can we add an option to display
> days in the CF app?
>

+1 for making these dates more explicit

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Commitfest remaining "Needs Review" items

2015-08-31 Thread Alexander Korotkov
On Mon, Aug 31, 2015 at 3:16 PM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> On Mon, Aug 31, 2015 at 9:13 PM, Magnus Hagander wrote:
> > Anyway - that CF is closed, but we seem to not have *any* CF open at this
> > point. Should we not make 2015-11 open?
>
> Yeah, switched.


Is it correct to switch 2015-09 commitfest to inprogress now?
I thought we should still accept patches to 2015-09 since it's 2015-08-31
now.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] [PROPOSAL] Effective storage of duplicates in B-tree index.

2015-09-01 Thread Alexander Korotkov
Hi, Tomas!

On Mon, Aug 31, 2015 at 6:26 PM, Tomas Vondra <tomas.von...@2ndquadrant.com>
wrote:

> On 08/31/2015 09:41 AM, Anastasia Lubennikova wrote:
>
>> I'm going to begin work on effective storage of duplicate keys in B-tree
>> index.
>> The main idea is to implement posting lists and posting trees for B-tree
>> index pages as it's already done for GIN.
>>
>> In a nutshell, effective storing of duplicates in GIN is organised as
>> follows.
>> Index stores single index tuple for each unique key. That index tuple
>> points to posting list which contains pointers to heap tuples (TIDs). If
>> too many rows having the same key, multiple pages are allocated for the
>> TIDs and these constitute so called posting tree.
>> You can find wonderful detailed descriptions in gin readme
>> <
>> https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README
>> >
>> and articles <http://www.cybertec.at/gin-just-an-index-type/>.
>> It also makes possible to apply compression algorithm to posting
>> list/tree and significantly decrease index size. Read more in
>> presentation (part 1)
>> <http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.
>>
>> Now new B-tree index tuple must be inserted for each table row that we
>> index.
>> It can possibly cause page split. Because of MVCC even unique index
>> could contain duplicates.
>> Storing duplicates in posting list/tree helps to avoid superfluous splits.
>>
>> So it seems to be very useful improvement. Of course it requires a lot
>> of changes in B-tree implementation, so I need approval from community.
>>
>
> In general, index size is often a serious issue - cases where indexes need
> more space than tables are not quite uncommon in my experience. So I think
> the efforts to lower space requirements for indexes are good.
>
> But if we introduce posting lists into btree indexes, how different are
> they from GIN? It seems to me that if I create a GIN index (using
> btree_gin), I do get mostly the same thing you propose, no?
>

Yes, In general GIN is a btree with effective duplicates handling + support
of splitting single datums into multiple keys.
This proposal is mostly porting duplicates handling from GIN to btree.

Sure, there are differences - GIN indexes don't handle UNIQUE indexes,


The difference between btree_gin and btree is not only UNIQUE feature.
1) There is no gingettuple in GIN. GIN supports only bitmap scans. And it's
not feasible to add gingettuple to GIN. At least with same semantics as it
is in btree.
2) GIN doesn't support multicolumn indexes in the way btree does.
Multicolumn GIN is more like set of separate singlecolumn GINs: it doesn't
have composite keys.
3) btree_gin can't effectively handle range searches. "a < x < b" would be
hangle as "a < x" intersect "x < b". That is extremely inefficient. It is
possible to fix. However, there is no clear proposal how to fit this case
into GIN interface, yet.


> but the compression can only be effective when there are duplicate rows.
> So either the index is not UNIQUE (so the b-tree feature is not needed), or
> there are many updates.
>

>From my observations users can use btree_gin only in some cases. They like
compression, but can't use btree_gin mostly because of #1.

Which brings me to the other benefit of btree indexes - they are designed
> for high concurrency. How much is this going to be affected by introducing
> the posting lists?
>

I'd notice that current duplicates handling in PostgreSQL is hack over
original btree. It is designed so in btree access method in PostgreSQL, not
btree in general.
Posting lists shouldn't change concurrency much. Currently, in btree you
have to lock one page exclusively when you're inserting new value.
When posting list is small and fits one page you have to do similar thing:
exclusive lock of one page to insert new value.
When you have posting tree, you have to do exclusive lock on one page of
posting tree.

One can say that concurrency would became worse because index would become
smaller and number of pages would became smaller too. Since number of pages
would be smaller, backends are more likely concur for the same page. But
this argument can be user against any compression and for any bloat.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-09-07 Thread Alexander Korotkov
On Thu, Aug 20, 2015 at 9:57 AM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> On Wed, Jul 22, 2015 at 4:28 PM, Alexander Korotkov <
> a.korot...@postgrespro.ru> wrote:
>
>> On Wed, Jul 22, 2015 at 8:48 AM, Michael Paquier <
>> michael.paqu...@gmail.com> wrote
>>
>>> On Mon, Jul 20, 2015 at 9:18 PM, Alexander Korotkov
>>> <a.korot...@postgrespro.ru> wrote:
>>> > attached patch allows pg_rewind to work when target timeline was
>>> switched.
>>> > Actually, this patch fixes TODO from pg_rewind comments.
>>> >
>>> >   /*
>>> >* Trace the history backwards, until we hit the target timeline.
>>> >*
>>> >* TODO: This assumes that there are no timeline switches on the
>>> target
>>> >* cluster after the fork.
>>> >*/
>>> >
>>> > This patch allows pg_rewind to handle data directory synchronization
>>> is much
>>> > more general way.
>>> For instance, user can return promoted standby to old master.
>>
>>
> +   /*
> +* Since incomplete segments are copied into next
> timelines, find the
> +* lastest timeline holding required segment.
> +*/
> +   while (private->tliIndex < targetNentries - 1 &&
> +  targetHistory[private->tliIndex].end <
> targetSegEnd)
> +   {
> +   private->tliIndex++;
> +   tli_index++;
> +   }
> It seems to me that the patch is able to handle timeline switches onwards,
> but not backwards and this is what would be required to return a promoted
> standby, that got switched to let's say timeline 2 when promoted, to an old
> master, that is still on timeline 1. This code actually fails when scanning
> for the last checkpoint before WAL forked as it will be on the timeline 1
> of the old master. Imagine for example that the WAL has forked at
> 0/30X which is saved in segment 00020003 (say 2/0/3) on
> the promoted standby, and that the last checkpoint record is on 0/20X,
> which is part of 00010002 (1/0/2). I think that we should
> scan 2/0/3 (not the partial segment 1/0/3), and then 1/0/2 when looking for
> the last checkpoint record. Hence the startup index TLI should be set to
> the highest timeline and should be decremented depending on what is in the
> history file.
> The code above looks correct to me when scanning the WAL history onwards
> though, which is what is done when extracting the page map, but not
> backwards when we try to find the last common checkpoint record. This code
> actually fails trying to open 2/0/2 that does not exist in the promoted
> standby's pg_xlog in my test case.
>
> Attached is a small script I have used to reproduce the failure.
>

Right, thanks! It should be fixed in the attached version of patch.

I think that the documentation needs a brush up as well to outline the fact
> that pg_rewind would be able to put back as well a standby in a cluster,
> after for example an operator mistake when promoting a node that should not
> be.
>

Yes. But from current pg_rewind documentation I can't figure out that it
can't do so. However, we can add an explicit statement which claims that it
can.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pg-rewind-target-switch-3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Access method extendability

2015-09-07 Thread Alexander Korotkov
On Tue, Sep 1, 2015 at 6:50 PM, Teodor Sigaev <teo...@sigaev.ru> wrote:

> In general pattern of generic WAL usage is following.
>>
>> 1) Start using generic WAL: specify relation
>>
>
> M-m, what about extensions which wants to use WAL but WAL record doesn't
> connected to any relation? For example, transaction manager or kind of FDW.
>
>
>> GenericXLogStart(index);
>>
>> 2) Register buffers
>>
>> GenericXLogRegister(0, buffer1, false);
>> GenericXLogRegister(1, buffer2, true);
>>
>> first argument is a slot number, second is the buffer, third is flag
>> indicating
>> new buffer
>>
>
> Why do we need a slot number? to replace already registered buffer?


For further simplification slot number could be omitted. In the attached
patches, generic xlog replay applies changes in the same order
GenericXLogRegister was called.
Patches was rebased against latest version of am interface rework patch.
http://www.postgresql.org/message-id/CAPpHfduGY=KZSBPZN5+USTXev-9M2PAUp3Yi=syfdo2n244...@mail.gmail.com

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


create-am.3.patch.gz
Description: GNU Zip compressed data


generic-xlog.3.patch.gz
Description: GNU Zip compressed data


bloom-contrib.3.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Rework access method interface

2015-09-07 Thread Alexander Korotkov
On Mon, Sep 7, 2015 at 10:02 PM, Petr Jelinek <p...@2ndquadrant.com> wrote:

> On 2015-09-07 20:56, Alexander Korotkov wrote:
>
>> On Mon, Sep 7, 2015 at 9:17 PM, Petr Jelinek <p...@2ndquadrant.com
>>
>> However I don't like the naming differences between validate_opclass
>> and amvalidate. If you expect that the current amvalidate will only
>> be used for opclass validation then it should be renamed accordingly.
>>
>>
>> I'm not yet sure if we need separate validation of opfamilies.
>>
>>
> Well either the amvalidate or the validate_opclass should be renamed IMHO,
> depending on which way the checking goes (one interface for everything with
> generic name or multiple interfaces for multiple validations).


Yes, I agree with you about naming.
I'm not sure about separate validation of opfamilies independent of its
naming. I'd like to get any arguments/advises about it.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Waits monitoring

2015-09-07 Thread Alexander Korotkov
On Mon, Sep 7, 2015 at 6:28 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:

> On Sun, Sep 6, 2015 at 5:58 PM, Andres Freund <and...@anarazel.de> wrote:
> >
> > On 2015-09-04 23:44:21 +0100, Simon Riggs wrote:
> > > I see the need for both current wait information and for cumulative
> > > historical detail.
> > >
> > > I'm willing to wait before reviewing this, but not for more than 1
> more CF.
> > >
> > > Andres, please decide whether we should punt to next CF now, based upon
> > > other developments. Thanks
> >
> > I think we can do some of the work concurrently - the whole lwlock
> > infrastructure piece is rather independent and what currently most of
> > the arguments are about. I agree that the actual interface will need to
> > be coordinated.
> >
> > Ildus, could you please review Amit & Robert's patch?
> >
>
> Are you talking about patch where I have fixed few issues in Robert's
> patch [1] or the original patch [3] written by me.
>

AFAICS, Andres meant [3].

[1] -
> http://www.postgresql.org/message-id/caa4ek1kdec1tm5ya9gkv85vtn4qqsrxzkjru-tu70g_tl1x...@mail.gmail.com
> [2] -
> http://www.postgresql.org/message-id/3f71da37-a17b-4961-9908-016e6323e...@postgrespro.ru
> [3] -
> http://www.postgresql.org/message-id/caa4ek1kt2e6xhvigisr5o1ac9nfo0j2wtb8n0ggd1_jklde...@mail.gmail.com
>

For me, major issues of [3] are:
1) We fit all information about current wait event into single byte.
2) We put this byte into PgBackendStatus.
Both of these issues are essential for design of [3].

Issue #1 means that we actually can expose to user only event type (until
there are less then 256 types) with no parameters [4].
Issue #2 means that we have to monitor only backend not auxiliary processes
[5].

Both of these issues seems to be serious limitations for me. It makes me
think we either need a different design or need to expose current wait
event in some other way additionally. Anyway, I think attracting more
attention to this problem is good. It would be very nice if Simon or Andres
give review of this.

[4] -
http://www.postgresql.org/message-id/CAPpHfdtdF8LyR0zBA_tzAwYq00GFZyVbh_XfFAABRQQ=mbn...@mail.gmail.com
[5] - http://www.postgresql.org/message-id/55c3306b.5010...@postgrespro.ru

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: Rework access method interface

2015-09-07 Thread Alexander Korotkov
On Mon, Sep 7, 2015 at 9:17 PM, Petr Jelinek <p...@2ndquadrant.com> wrote:

> On 2015-09-04 16:26, Alexander Korotkov wrote:
>
>>
>> Attached patch is implementing this. It doesn't pretend to be fully
>> correct implementation, but it should be enough for proof the concept.
>> In this patch access method exposes another function: amvalidate. It
>> takes data structure representing opclass and throws error if it finds
>> it invalid.
>> This method is used on new opclass definition (alter operator family
>> etc. are not yet implemented but planned). Also, there is SQL function
>> validate_opclass(oid) which is used in regression tests.
>> Any thoughts?
>>
>>
> This is starting to look good
>

Thanks!


> However I don't like the naming differences between validate_opclass and
> amvalidate. If you expect that the current amvalidate will only be used for
> opclass validation then it should be renamed accordingly.
>

I'm not yet sure if we need separate validation of opfamilies.

Also GetIndexAmRoutine should check the return type of the amhandler.


Will be fixed.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-09-08 Thread Alexander Korotkov
On Tue, Sep 8, 2015 at 10:28 AM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> On Tue, Sep 8, 2015 at 1:14 AM, Alexander Korotkov wrote:
> > On Thu, Aug 20, 2015 at 9:57 AM, Michael Paquier wrote:
> >> The code above looks correct to me when scanning the WAL history onwards
> >> though, which is what is done when extracting the page map, but not
> >> backwards when we try to find the last common checkpoint record. This
> code
> >> actually fails trying to open 2/0/2 that does not exist in the promoted
> >> standby's pg_xlog in my test case.
> >>
> >> Attached is a small script I have used to reproduce the failure.
> >
> >
> > Right, thanks! It should be fixed in the attached version of patch.
>
> So, instead of a code review, I have been torturing your patch and did
> advanced tests on it with the attached script, that creates a cluster
> as follows:
>  master (5432)
>   /\
>  1 (5433)   2 (5434)
>  |
>  3 (5435)
> Once cluster is created, nodes are promoted in a certain order giving
> them different timeline jump properties:
> - master, stays on tli 1
> - standby 1, tli 1->2
> - standby 2, tli 1->3
> - standby 3, tli 1->2->4
> And data is inserted on each of them to make WAL fork properly.
> Finally the script tries to rewind one node using another node as
> source, and then tries to link this target node back to the source
> node via streaming replication.
>
> I have tested all the one-one permutations possible in the structure
> above (see commented portions at the bottom of my script), and all of
> them worked. I have to say that from the testing prospective this
> patch looks in great shape, and will greatly improve the use cases of
> pg_rewind!
>

Great! Many thanks for your testing!

I am planning to do as well a detailed code review rather soon.
>

Good, thanks.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Waits monitoring

2015-09-09 Thread Alexander Korotkov
On Tue, Sep 8, 2015 at 8:01 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Mon, Sep 7, 2015 at 7:33 AM, Ildus Kurbangaliev
> <i.kurbangal...@postgrespro.ru> wrote:
> >> > Ildus, could you please review Amit & Robert's patch?
> >> [1] -
> >>
> http://www.postgresql.org/message-id/caa4ek1kdec1tm5ya9gkv85vtn4qqsrxzkjru-tu70g_tl1x...@mail.gmail.com
> > About [1] and [2]. They are slightly conflicting, but if [1] is
> > going to be committed I can easily use it in [2]. And it will simplify
> > my patch, so I have no objections here. And the same time [3] can be
> > significantly simplified and improved on top of [1] and [2].
>
> Great - let's try to deal with [1] first, then.
>
> Does anyone wish to object to me committing that part?
>

No objections from me.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-09-09 Thread Alexander Korotkov
On Wed, Sep 9, 2015 at 9:01 AM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

>
>
> On Wed, Sep 9, 2015 at 3:27 AM, Alexander Korotkov
> <a.korot...@postgrespro.ru> wrote:
>
>> On Tue, Sep 8, 2015 at 10:28 AM, Michael Paquier
>> <michael.paqu...@gmail.com>wrote:
>>
>>> I am planning to do as well a detailed code review rather soon.
>>>
>>
>> Good, thanks.
>>
>
> When testing a bit more complex structures, it occurred to me that we may
> want as well to use as a source node a standby. For example based on the
> case of my cluster above:
>  master (5432)
>   /  \
>  1 (5433)   2 (5434)
>  |
>  3 (5435)
> If I try for example to rebuild the cluster as follows there will be
> failures:
> 1) Rewind with source = 3, target = 2
> 2) Start 3 and 2
> 3) Shutdown 2
> 3) Rewind source = 2, target = 1, failure with:
> source data directory must be shut down cleanly
>
> It seems to me that we should allow a node that has been shutdowned in
> recovery to be used as a source for rewind as well, as follows:
> -   if (datadir_source && ControlFile_source.state != DB_SHUTDOWNED)
> +   if (datadir_source &&
> +   ControlFile_source.state != DB_SHUTDOWNED &&
> +   ControlFile_source.state != DB_SHUTDOWNED_IN_RECOVERY)
> pg_fatal("source data directory must be shut down
> cleanly\n");
> At least your patch justifies in my eyes such a change.
>

It's source is not required to be √ in recovery if we specify it by
connection string. This is why I think DB_SHUTDOWNED_IN_RECOVERY is OK when
we specify source by data directory. Included in attached patch.

 /*
> + * Find minimum from two xlog pointers assuming invalid pointer is
> greatest
> + * possible pointer.
> + */
> +static XLogRecPtr
> +xlPtrMin(XLogRecPtr a, XLogRecPtr b)
> +{
> +   if (XLogRecPtrIsInvalid(a))
> +   return b;
> +   else if (XLogRecPtrIsInvalid(b))
> +   return a;
> +   else
> +   return Min(a, b);
> +}
> This is true as timeline.h tells so, still I think that it would be better
> to mention that this is this assumption is held in this header file, or
> simply that timeline history entries at the top have their end position set
> as InvalidXLogRecPtr which is a synonym of infinity.
>

Agree. Comment is adjusted.


> The code building the target history file is a duplicate of what is done
> with the source. Perhaps we could refactor that as a single routine in
> pg_rewind.c?
>

Do you mean duplication between rewind_parseTimeLineHistory()
and readTimeLineHistory(). We could put readTimeLineHistory() into common
with some refactoring. This is the subject of separate patch, though.

Except that, the patch looks pretty neat to me. I was wondering as well:
> what tests did you run up to now with this patch? I am attaching an updated
> version of my test script I used for some more complex scenarios. Feel free
> to play with it.
>

I was running manual tests like a noob :) When you attached you bash
script, I've switched to it.

A also added additional check in findCommonAncestorTimeline(). Two standbys
could be independently promoted and get the same new timeline ID. Now, it's
checked that timelines, that we assume to be the same, have equal begins.
Begins could still match by coincidence. But the same risk exists in
original pg_rewind, though.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pg-rewind-target-switch-4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-14 Thread Alexander Korotkov
On Mon, Sep 14, 2015 at 2:12 PM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Mon, Sep 14, 2015 at 2:25 PM, Alexander Korotkov <aekorot...@gmail.com>
> wrote:
>
>> On Sat, Sep 12, 2015 at 2:05 PM, Amit Kapila <amit.kapil...@gmail.com>
>> wrote:
>>
>>> On Thu, Aug 6, 2015 at 3:31 PM, Ildus Kurbangaliev <
>>> i.kurbangal...@postgrespro.ru> wrote:
>>> >
>>> > On 08/05/2015 09:33 PM, Robert Haas wrote:
>>> >>
>>> >>
>>> >> You're missing the point.  Those multi-byte fields have additional
>>> >> synchronization requirements, as I explained in some detail in my
>>> >> previous email. You can't just wave that away.
>>> >
>>> > I see that now. Thank you for the point.
>>> >
>>> > I've looked deeper and I found PgBackendStatus to be not a suitable
>>> > place for keeping information about low level waits. Really,
>>> PgBackendStatus
>>> > is used to track high level information about backend. This is why
>>> auxiliary
>>> > processes don't have PgBackendStatus, because they don't have such
>>> information
>>> > to expose. But when we come to the low level wait events then auxiliary
>>> > processes are as useful for monitoring as backends are. WAL writer,
>>> > checkpointer, bgwriter etc are using LWLocks as well. This is
>>> certainly unclear
>>> > why they can't be monitored.
>>> >
>>>
>>> I think the chances of background processes stuck in LWLock is quite less
>>> as compare to backends as they do the activities periodically.  As an
>>> example
>>> WALWriter will take WALWriteLock to write the WAL, but actually there
>>> will never
>>> be any much contention for WALWriter. In synchronous_commit = on, the
>>> backends themselves write the WAL so WALWriter won't do much in that
>>> case and for synchronous_commit = off, backends won't write the WAL so
>>> WALWriter won't face any contention unless some buffers have to be
>>> written
>>> by bgwriter or checkpoint for which WAL is not flushed which I don't
>>> think
>>> would lead to any contention.
>>>
>>
>> Hmm, synchronous_commit is per session variable: some transactions could
>> run with synchronous_commit = on, but some with synchronous_commit = off.
>> This is very popular feature of PostgreSQL: achieve better performance by
>> making non-critical transaction asynchronous while leaving critical
>> transactions synchronous. Thus, contention for WALWriteLock between
>> backends and WALWriter could be real.
>>
>>
> I think it is difficult to say that can lead to contention due to periodic
> nature of WALWriter, but I don't deny that there is chance for
> background processes to have contention.
>

We don't know if there could be contention in advance. This is why we need
monitoring.


> I am not denying from the fact that there could be some contention in rare
>>> scenarios for background processes, but I think tracking them is not as
>>> important as tracking the LWLocks for backends.
>>>
>>
>> I would be more careful in calling some of scenarios rare. As DBMS
>> developers we should do our best to evade contention for LWLocks: any
>> contention, not only between backends and background processes.  One may
>> assume that high LWLock contention is rare scenario in general. Once we're
>> here we doesn't think so, though.
>> You claims that there couldn't be contention for WALWriteLock between
>> backends and WALWriter. This is unclear for me: I think it could be.
>>
>
> I think there would be more things where background processes could wait
> than LWLocks and I think they are important to track, but could be done
> separately
> from tracking them for pg_stat_activity.  Example, we have a
> pg_stat_bgwriter
> view, can't we think of tracking bgwriter/checkpointer wait information in
> that
> view and similarly for other background processes we can track in other
> views
> if any related view exists or create a new one to track for all background
> processes.
>
>
>> Nobody opposes tracking wait events for backends and tracking them for
>> background processes. I think we need to track both in order to provide
>> full picture to DBA.
>>
>>
> Sure, that is good to do, but can't we do it separately in another patch.
> I think in this patch lets just work for wait_events for backends.
>

Yes, but I think we should have a design of tracking wait event for

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-14 Thread Alexander Korotkov
On Mon, Sep 14, 2015 at 2:25 PM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Mon, Sep 14, 2015 at 3:02 PM, Alexander Korotkov <aekorot...@gmail.com>
> wrote:
>
>> On Sat, Sep 12, 2015 at 3:24 PM, Amit Kapila <amit.kapil...@gmail.com>
>> wrote:
>>
>>> On Fri, Aug 14, 2015 at 7:23 PM, Alexander Korotkov <
>>> aekorot...@gmail.com> wrote:
>>> >
>>> > On Thu, Aug 6, 2015 at 1:01 PM, Ildus Kurbangaliev <
>>> i.kurbangal...@postgrespro.ru> wrote:
>>> >>
>>> >>
>>> >> I've looked deeper and I found PgBackendStatus to be not a suitable
>>> >> place for keeping information about low level waits. Really,
>>> PgBackendStatus
>>> >> is used to track high level information about backend. This is why
>>> auxiliary
>>> >> processes don't have PgBackendStatus, because they don't have such
>>> information
>>> >> to expose. But when we come to the low level wait events then
>>> auxiliary
>>> >> processes are as useful for monitoring as backends are. WAL writer,
>>> >> checkpointer, bgwriter etc are using LWLocks as well. This is
>>> certainly unclear
>>> >> why they can't be monitored.
>>> >>
>>> >> This is why I think we shoudn't place wait event into
>>> PgBackendStatus. It
>>> >> could be placed into PGPROC or even separate data structure with
>>> different
>>> >> concurrency model which would be most suitable for monitoring.
>>> >
>>> >
>>> > +1 for tracking wait events not only for backends
>>> >
>>> > Ildus, could you do following?
>>> > 1) Extract LWLocks refactoring into separate patch.
>>> > 2) Make a patch with storing current wait event information in PGPROC.
>>> >
>>>
>>> Now as Robert has committed standardization of lwlock names in
>>> commit - aa65de04, let us try to summarize and work on remaining parts
>>> of the patch. So I think now the next set of work is as follows:
>>>
>>> 1. Modify the tranche mechanism so that information about LWLocks
>>> can be tracked easily.  For this already there is some discussion, ideas
>>> and initial patch is floated in this thread and there doesn't seem to be
>>> much
>>> conflicts, so we can write the patch for it.  I am planning to write or
>>> modify
>>> the existing patch unless you, IIdus or anyone has objections or want to
>>> write it, please let me know to avoid duplication of work.
>>>
>>> 2. Track wait_event in pg_stat_activity.  Now here the main point where
>>> we doesn't seem to be in complete agreement is that shall we keep it
>>> as one byte in PgBackendStatus or use two or more bytes to store
>>> wait_event information and still there is another point made by you to
>>> store it in PGProc rather than in PgBackendStatus so that we can display
>>> information of background processes as well.
>>>
>>> Now as a matter of consensus, I think Tom has raised a fair point [1]
>>> against
>>> storing this information in PGProc and I feel that it is relatively less
>>> important to have such information about background processes and the
>>> reason for same is explained upthread [2].  About having more than
>>> one-byte
>>> to store information about various wait_events, I think now we will not
>>> have
>>> more than 100 or so events to track, do you really think that anytime in
>>> forseeable
>>> future we will have more than 256 important events which we would like
>>> to track?
>>>
>>> So I think about this lets first try to build the consensus and then
>>> attempt to
>>> write or modify the patch.
>>>
>>>
>>> [1] - http://www.postgresql.org/message-id/4067.1439561...@sss.pgh.pa.us
>>> [2] -
>>> http://www.postgresql.org/message-id/CAA4eK1JRQGTgRMRao6H65rA=fhifucnqhx7ongy58um6rn1...@mail.gmail.com
>>>
>>
>> In order to build the consensus we need the roadmap for waits monitoring.
>> Would single byte in PgBackendStatus be the only way for tracking wait
>> events? Could we have pluggable infrastructure in waits monitoring: for
>> instance, hooks for wait event begin and end?
>> Limit of 256 wait events is probably OK. But we have no room for any
>> additional parameters of wait event. For instance, if you notice high
>> contention for buffer content lock th

Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-09-16 Thread Alexander Korotkov
On Thu, Sep 10, 2015 at 8:33 AM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> On Wed, Sep 9, 2015 at 7:13 PM, Alexander Korotkov wrote:
> > On Wed, Sep 9, 2015 at 9:01 AM, Michael Paquier wrote:
> >> The code building the target history file is a duplicate of what is done
> >> with the source. Perhaps we could refactor that as a single routine in
> >> pg_rewind.c?
> >
> >
> > Do you mean duplication between rewind_parseTimeLineHistory() and
> > readTimeLineHistory(). We could put readTimeLineHistory() into common
> with
> > some refactoring. This is the subject of separate patch, though.
>
> Well, there is that :) But I was referring to this part (beware of
> useless newlines in your code btw):
> +   if (targettli == 1)
> {
> -   TimeLineHistoryEntry *entry = [i];
> +   targetHistory = (TimeLineHistoryEntry *)
> pg_malloc(sizeof(TimeLineHistoryEntry));
> +   targetHistory->tli = targettli;
> +   targetHistory->begin = targetHistory->end =
> InvalidXLogRecPtr;
> +   targetNentries = 1;
> +
> +   }
> Your patch generates a timeline history array for the target node, and
> HEAD does exactly the same for the source node, so IMO your patch is
> duplicating code, the only difference being that fetchFile is used for
> the source history file and slurpFile is used for the target history
> file. Up to now the code duplication caused by the difference that the
> target is always a local instance, and the source may be either local
> or remote has created the interface that we have now, but I think that
> we should refactor fetch.c such as the routines it contains do not
> rely anymore on datadir_source, and use instead a datadir argument. If
> datadir is NULL, the remote code path is used. If it is not NULL,
> local code path is used. This way we can make the target node use
> fetchFile only, and remove the duplication your patch is adding, as
> well as future ones like that. Thoughts?
>

OK, I get it. I tried to get rid of code duplication in the attached patch.
There is getTimelineHistory() function which gets control file as argument
and fetches history from appropriate path.


> >> Except that, the patch looks pretty neat to me. I was wondering as well:
> >> what tests did you run up to now with this patch? I am attaching an
> updated
> >> version of my test script I used for some more complex scenarios. Feel
> free
> >> to play with it.
> >
> > I was running manual tests like a noob :) When you attached you bash
> script,
> > I've switched to it.
>
> [product_placement]The patch to improve the recovery test suite
> submitted in this CF would have eased the need of bash-ing test cases
> here, and we could have test cases directly included in your
> patch.[/product_placement]
>
> > A also added additional check in findCommonAncestorTimeline(). Two
> standbys
> > could be independently promoted and get the same new timeline ID. Now,
> it's
> > checked that timelines, that we assume to be the same, have equal begins.
> > Begins could still match by coincidence. But the same risk exists in
> > original pg_rewind, though.
>
> Really? pg_rewind blocks attempts to rewind two nodes that are already
> on the same timeline before checking if their timeline history map at
> some point or not:
> /*
>  * If both clusters are already on the same timeline, there's
> nothing to
>  * do.
>  */
> if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
> ControlFile_source.checkPointCopy.ThisTimeLineID)
> pg_fatal("source and target cluster are on the same
> timeline\n");
> And this seems really justified to me. Imagine that you have one
> master, with two standbys linked to it. If both standbys are promoted
> to the same timeline, you could easily replug them to the master, but
> I fail to see the point to be able to replug one promoted standby with
> the other in this case: those nodes have segment and history files
> that map with each other, an operator could easily mess up things in
> such a configuration.
>

Imagine following configuration of server.
  1
 / \
2   3
|
4

Initially, node 1 is master, nodes 2 and 3 are standbys for node 1. node 4
is cascaded standby for node 3.
Then node 2 and node 3 are promoted. They both got timeline number 2. Then
node 3 is promoted and gets timeline number 3.
Then we could try to rewind node 4 with node 2 as source. How pg_rewind
could know that timeline number 2 for those nodes is not the same?
We can only check that those timelines are forked from timeline 1 at the
same place. But coincidence is still possible.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pg-rewind-target-switch-5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] statistics for array types

2015-09-16 Thread Alexander Korotkov
On Mon, Aug 24, 2015 at 8:26 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:

> On Thu, Aug 20, 2015 at 6:00 PM, Tomas Vondra <
> tomas.von...@2ndquadrant.com> wrote:
>
>> Hi,
>>
>> On 08/11/2015 04:38 PM, Jeff Janes wrote:
>>
>>> When reviewing some recent patches, I decided the statistics gathered
>>>  for arrays had some pre-existing shortcomings.
>>>
>>> The main one is that when the arrays contain rare elements there is
>>> no histogram to fall back upon when the MCE array is empty, the way
>>> there is for scalar stats.  So it has to punt completely and resort
>>> to saying that it is 0.5% selectivity without recourse to any data at
>>> all.
>>>
>>> The rationale for applying the threshold before things are eligible
>>> for inclusion in the MCE array seems to be that this puts some
>>> theoretical bound on the amount of error we are likely to have in
>>> that element.  But I think it is better to exceed that theoretical
>>> bound than it is to have no data at all.
>>>
>>> The attached patch forces there to be at least one element in MCE,
>>> keeping the one element with the highest predicted frequency if the
>>> MCE would otherwise be empty.  Then any other element queried for is
>>> assumed to be no more common than this most common element.
>>>
>>
>> We only really need the frequency, right? So do we really need to keep
>> the actual MCV element? I.e. most_common_elem_freqs does not have the
>> same number of values as most_common_elems anyway:
>>
>>   A list of the frequencies of the most common element values, i.e., the
>>   fraction of rows containing at least one instance of the given value.
>>   Two or three additional values follow the per-element frequencies;
>>   these are the minimum and maximum of the preceding per-element
>>   frequencies, and optionally the frequency of null elements.
>>   (Null when most_common_elems is.)
>>
>> So we might modify it so that it's always defined - either it tracks the
>> same values as today (when most_common_elems is defined), or the
>> frequency of the most common element (when most_common_elems is NULL).
>>
>
> I had also considered that.  It requires more changes to make it happen,
> and it seems to create a more complex contract on what those columns mean,
> but without giving a corresponding benefit.
>
>
>>
>> This way we can keep the current theoretical error-bound on the MCE
>> frequencies, and if that's not possible we can have at least the new
>> value without confusing existing code.
>
>
> But if the frequency of the most common element was grossly wrongly, then
> whatever value we stick in there is still going to be grossly wrong.
> Removing the value associated with it isn't going to stop it from being
> wrong.  When we do query with the (incorrectly thought) first most common
> element, either it will find and use the wrong value from slot 1, or it
> will find nothing and fall back on the same wrong value from slot 3.
>

Hmm, I think we should store cutoff_freq / nonnull_cnt as minfreq when we
collect no MCEs. Moreover, I think we should store it even when num_mcelem
>= track_len and we haven't cut MCEs we find. In this case we can get more
precise estimation for rare element using the knowledge that all MCEs which
exceeds the threshold are present (assuming their frequecies could be much
higher than threshold).

When there are no MCEs then we should use assumption that there are no
elements more frequent than cutoff_freq / nonnull_cnt. Using lower values
wouldn't be statistically correct.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-16 Thread Alexander Korotkov
On Mon, Sep 14, 2015 at 3:03 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Mon, Sep 14, 2015 at 5:32 AM, Alexander Korotkov
> <aekorot...@gmail.com> wrote:
> > In order to build the consensus we need the roadmap for waits monitoring.
> > Would single byte in PgBackendStatus be the only way for tracking wait
> > events? Could we have pluggable infrastructure in waits monitoring: for
> > instance, hooks for wait event begin and end?
>
> No, it's not the only way of doing it.  I proposed doing that way
> because it's simple and cheap, but I'm not hell-bent on it.  My basic
> concern here is about the cost of this.  I think that the most data we
> can report without some kind of synchronization protocol is one 4-byte
> integer.  If we want to report anything more than that, we're going to
> need something like the st_changecount protocol, or a lock, and that's
> going to add very significantly - and in my view unacceptably - to the
> cost.  I care very much about having this facility be something that
> we can use in lots of places, even extremely frequent operations like
> buffer reads and contended lwlock acquisition.
>

Yes, the major question is cost. But I think we should validate our
thoughts by experiments assuming there are more possible synchronization
protocols. Ildus posted implemention of double buffering approach that
showed quite low cost.

I think that there may be some *kinds of waits* for which it's
> practical to report additional detail.  For example, suppose that when
> a heavyweight lock wait first happens, we just report the lock type
> (relation, tuple, etc.) but then when the deadlock detector expires,
> if we're still waiting, we report the entire lock tag.  Well, that's
> going to happen infrequently enough, and is expensive enough anyway,
> that the cost doesn't matter.  But if, every time we read a disk
> block, we take a lock (or bump a changecount and do a write barrier),
> dump the whole block tag in there, release the lock (or do another
> write barrier and bump the changecount again) that sounds kind of
> expensive to me.  Maybe we can prove that it doesn't matter on any
> workload, but I doubt it.  We're fighting for every cycle in some of
> these code paths, and there's good evidence that we're burning too
> many of them compared to competing products already.
>

Yes, but some competing products also provides comprehensive waits
monitoring too. That makes me think it should be possible for us too.

I am not a big fan of hooks as a way of resolving disagreements about
> the design.  We may find that there are places where it's useful to
> have hooks so that different extensions can do different things, and
> that is fine.  But we shouldn't use that as a way of punting the
> difficult questions.  There isn't enough common understanding here of
> what we're all trying to get done and why we're trying to do it in
> particular ways rather than in other ways to jump to the conclusion
> that a hook is the right answer.  I'd prefer to have a nice, built-in
> system that everyone agrees represents a good set of trade-offs than
> an extensible system.
>

I think the reason for hooks could be not only disagreements about design,
but platform dependent issues too.
Next step after we have view with current wait events will be gathering
some statistics of them. We can oppose at least two approaches here:
1) Periodical sampling of current wait events.
2) Measure each wait event duration. We could collect statistics for short
period locally and update shared memory structure periodically (using some
synchronization protocol).

In the previous attempt to gather lwlocks statistics, you predict that
sampling could have a significant overhead [1]. In contrast, on many
systems time measurements are cheap. We have implemented both approaches
and it shows that sampling every 1 milliseconds produce higher overhead
than individual duration measurements for each wait event. We can share
another version of waits monitoring based on sampling to make these results
reproducible for everybody. However, cheap time measurements are available
not for each platform. For instance, ISTM that on Windows time measurements
are too expensive [2].

That makes me think that we need pluggable solution, at least for
statistics: direct measuring of events durations for majority of systems
and sampling for others as the least harm.

I think it's reasonable to consider reporting this data in the PGPROC
> using a 4-byte integer rather than reporting it through a singe byte
> in the backend status structure.  I believe that addresses the
> concerns about reporting from auxiliary processes, and it also allows
> a little more data to be reported.  For anything in excess of that, I
> think we should think rather harder.  Most likely, such addition

[HACKERS] LW_SHARED_MASK macro

2015-09-17 Thread Alexander Korotkov
Hackers,

while exploring lwlock.c I found following macro to be strange.

#define LW_SHARED_MASK ((uint32)(1 << 23))

This is macro is used to extract number of shared locks from state.

ereport(LOG,
(errhidestmt(true),
errhidecontext(true),
errmsg("%d: %s(%s): excl %u shared %u haswaiters %u waiters %u rOK %d",
MyProcPid,
where, MainLWLockNames[id],
!!(state & LW_VAL_EXCLUSIVE),
state & LW_SHARED_MASK,
!!(state & LW_FLAG_HAS_WAITERS),
pg_atomic_read_u32(>nwaiters),
!!(state & LW_FLAG_RELEASE_OK;


Should it be ((uint32) ((1 << 24)-1)) instead?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


lw_shared_mask.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-14 Thread Alexander Korotkov
On Sat, Sep 12, 2015 at 2:05 PM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Thu, Aug 6, 2015 at 3:31 PM, Ildus Kurbangaliev <
> i.kurbangal...@postgrespro.ru> wrote:
> >
> > On 08/05/2015 09:33 PM, Robert Haas wrote:
> >>
> >>
> >> You're missing the point.  Those multi-byte fields have additional
> >> synchronization requirements, as I explained in some detail in my
> >> previous email. You can't just wave that away.
> >
> > I see that now. Thank you for the point.
> >
> > I've looked deeper and I found PgBackendStatus to be not a suitable
> > place for keeping information about low level waits. Really,
> PgBackendStatus
> > is used to track high level information about backend. This is why
> auxiliary
> > processes don't have PgBackendStatus, because they don't have such
> information
> > to expose. But when we come to the low level wait events then auxiliary
> > processes are as useful for monitoring as backends are. WAL writer,
> > checkpointer, bgwriter etc are using LWLocks as well. This is certainly
> unclear
> > why they can't be monitored.
> >
>
> I think the chances of background processes stuck in LWLock is quite less
> as compare to backends as they do the activities periodically.  As an
> example
> WALWriter will take WALWriteLock to write the WAL, but actually there will
> never
> be any much contention for WALWriter. In synchronous_commit = on, the
> backends themselves write the WAL so WALWriter won't do much in that
> case and for synchronous_commit = off, backends won't write the WAL so
> WALWriter won't face any contention unless some buffers have to be written
> by bgwriter or checkpoint for which WAL is not flushed which I don't think
> would lead to any contention.
>

Hmm, synchronous_commit is per session variable: some transactions could
run with synchronous_commit = on, but some with synchronous_commit = off.
This is very popular feature of PostgreSQL: achieve better performance by
making non-critical transaction asynchronous while leaving critical
transactions synchronous. Thus, contention for WALWriteLock between
backends and WALWriter could be real.

I am not denying from the fact that there could be some contention in rare
> scenarios for background processes, but I think tracking them is not as
> important as tracking the LWLocks for backends.
>

I would be more careful in calling some of scenarios rare. As DBMS
developers we should do our best to evade contention for LWLocks: any
contention, not only between backends and background processes.  One may
assume that high LWLock contention is rare scenario in general. Once we're
here we doesn't think so, though.
You claims that there couldn't be contention for WALWriteLock between
backends and WALWriter. This is unclear for me: I think it could be. Nobody
opposes tracking wait events for backends and tracking them for background
processes. I think we need to track both in order to provide full picture
to DBA.

Also as we are planning to track the wait_event information in
> pg_stat_activity
> along with other backends information, it will not make sense to include
> information about backend processes in this variable as pg_stat_activity
> just displays information of backend processes.
>

I'm not objecting that we should track only backends information in
pg_stat_activity. I think we should have also some other way of tracking
wait events for background processes. We should think it out before
extending pg_stat_activity to evade design issues later.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-14 Thread Alexander Korotkov
On Sat, Sep 12, 2015 at 3:24 PM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Fri, Aug 14, 2015 at 7:23 PM, Alexander Korotkov <aekorot...@gmail.com>
> wrote:
> >
> > On Thu, Aug 6, 2015 at 1:01 PM, Ildus Kurbangaliev <
> i.kurbangal...@postgrespro.ru> wrote:
> >>
> >>
> >> I've looked deeper and I found PgBackendStatus to be not a suitable
> >> place for keeping information about low level waits. Really,
> PgBackendStatus
> >> is used to track high level information about backend. This is why
> auxiliary
> >> processes don't have PgBackendStatus, because they don't have such
> information
> >> to expose. But when we come to the low level wait events then auxiliary
> >> processes are as useful for monitoring as backends are. WAL writer,
> >> checkpointer, bgwriter etc are using LWLocks as well. This is certainly
> unclear
> >> why they can't be monitored.
> >>
> >> This is why I think we shoudn't place wait event into PgBackendStatus.
> It
> >> could be placed into PGPROC or even separate data structure with
> different
> >> concurrency model which would be most suitable for monitoring.
> >
> >
> > +1 for tracking wait events not only for backends
> >
> > Ildus, could you do following?
> > 1) Extract LWLocks refactoring into separate patch.
> > 2) Make a patch with storing current wait event information in PGPROC.
> >
>
> Now as Robert has committed standardization of lwlock names in
> commit - aa65de04, let us try to summarize and work on remaining parts
> of the patch. So I think now the next set of work is as follows:
>
> 1. Modify the tranche mechanism so that information about LWLocks
> can be tracked easily.  For this already there is some discussion, ideas
> and initial patch is floated in this thread and there doesn't seem to be
> much
> conflicts, so we can write the patch for it.  I am planning to write or
> modify
> the existing patch unless you, IIdus or anyone has objections or want to
> write it, please let me know to avoid duplication of work.
>
> 2. Track wait_event in pg_stat_activity.  Now here the main point where
> we doesn't seem to be in complete agreement is that shall we keep it
> as one byte in PgBackendStatus or use two or more bytes to store
> wait_event information and still there is another point made by you to
> store it in PGProc rather than in PgBackendStatus so that we can display
> information of background processes as well.
>
> Now as a matter of consensus, I think Tom has raised a fair point [1]
> against
> storing this information in PGProc and I feel that it is relatively less
> important to have such information about background processes and the
> reason for same is explained upthread [2].  About having more than one-byte
> to store information about various wait_events, I think now we will not
> have
> more than 100 or so events to track, do you really think that anytime in
> forseeable
> future we will have more than 256 important events which we would like to
> track?
>
> So I think about this lets first try to build the consensus and then
> attempt to
> write or modify the patch.
>
>
> [1] - http://www.postgresql.org/message-id/4067.1439561...@sss.pgh.pa.us
> [2] -
> http://www.postgresql.org/message-id/CAA4eK1JRQGTgRMRao6H65rA=fhifucnqhx7ongy58um6rn1...@mail.gmail.com
>

In order to build the consensus we need the roadmap for waits monitoring.
Would single byte in PgBackendStatus be the only way for tracking wait
events? Could we have pluggable infrastructure in waits monitoring: for
instance, hooks for wait event begin and end?
Limit of 256 wait events is probably OK. But we have no room for any
additional parameters of wait event. For instance, if you notice high
contention for buffer content lock then it would be nice to figure out: how
many blocks are involved?, which blocks? We need to track additional
parameters in order to answer this question.
Comments about tracking only backends are in [1] and [2].
PGProc is not the only way to track events with parameters for every
process. We could try to extend PgBackendStatus or other build secondary
data structure.

However, this is depending of our general roadmap. If pg_stat_activity is
just for default while suitable waits monitoring could be a plugin, then
it's pobably OK.

[1] -
http://www.postgresql.org/message-id/capphfdsnju40qud2sqjs_iu+z-jm-b_f39foc7eydjvza5e...@mail.gmail.com
[2] -
http://www.postgresql.org/message-id/9a99c2a7-1760-419f-bdc9-a2cf99ecd...@simply.name

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: Rework access method interface

2015-09-20 Thread Alexander Korotkov
On Sun, Sep 20, 2015 at 5:02 PM, Petr Jelinek <p...@2ndquadrant.com> wrote:

> On 2015-09-18 14:58, Alexander Korotkov wrote:
>
>> On Wed, Sep 16, 2015 at 8:44 PM, Teodor Sigaev <teo...@sigaev.ru
>> <mailto:teo...@sigaev.ru>> wrote:
>>
>> validate_opclass was renamed to amvalidate.
>>
>>
>> It seems to me, that amvalidate method of AM should get as argument
>> only Oid of operator family. Layout and meaning of amproc/amop
>> fields are differ for different AM and there isn't an AM which
>> implements all possible features.
>>
>> After, further personal discussion with Teodor, we decided that
>> amvalidate is out of scope for this patch.
>> It's not evident what should we validate in amvalidate and which way. I
>> think if we need amvalidate it should be subject of separate patch.
>> The attached patch exposes index access method parameters to SQL using
>> following fucntions:
>>   * get_am_param_oid(oid, text)
>>   * get_am_param_int(oid, text)
>>   * get_am_param_bool(oid, text)
>>
>>
> Hmm, we might want these functons in any case (although I think just one
> function which would return all am params would be better).
>
> But why is it not evident? We do the validations in regression tests, even
> if we just copy those then it's enough for a start
>

The reason is that those validations were used only in regression tests
yet. It wasn't used for user-defined operator classes. User might define
invalid opclass and then alter it to valid. Or user might upgrade opclass
in two steps where intermediate step is invalid. This is why I think
validating opclasses in CREATE/ALTER command is not evident since it
changes user visible behavior and compatibility.
Simultaneously, implementing new API function just for regression tests
doesn't seem to worth it for me.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: Rework access method interface

2015-09-20 Thread Alexander Korotkov
On Sun, Sep 20, 2015 at 5:18 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Petr Jelinek <p...@2ndquadrant.com> writes:
> > On 2015-09-18 14:58, Alexander Korotkov wrote:
> >> After, further personal discussion with Teodor, we decided that
> >> amvalidate is out of scope for this patch.
> >> It's not evident what should we validate in amvalidate and which way. I
> >> think if we need amvalidate it should be subject of separate patch.
>
> > But why is it not evident? We do the validations in regression tests,
> > even if we just copy those then it's enough for a start.
>
> I think the main reason this question is in-scope for this patch is
> precisely the problem of what do we do about the regression tests.
>
> I'm not in favor of exposing some SQL-level functions whose sole purpose
> is to support those regression test queries, because while those queries
> are very useful for detecting errors in handmade opclasses, they're hacks,
> and always have been.  They don't work well (or at all, really) for
> anything more than btree/hash cases.  It'd be better to expose amvalidate
> functions, even if we don't yet have full infrastructure for them.
>

I'm OK about continuing work on amvalidate if we can build consuensus on
its design.
Could you give some feedback on amvalidate version of patch please?
http://www.postgresql.org/message-id/capphfds8zywenz9vw6te5rzxbol1vu_wsw181veq+mu+v1d...@mail.gmail.com

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-09-24 Thread Alexander Korotkov
On Thu, Sep 24, 2015 at 6:32 PM, Andres Freund <and...@anarazel.de> wrote:

> On 2015-09-15 20:16:10 +0300, YUriy Zhuravlev wrote:
> > We will be tested.
>
> Did you have a chance to run some benchmarks?
>

Yes, we now have 60 physical cores intel server and we're running
benchmarks on it.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: Rework access method interface

2015-09-25 Thread Alexander Korotkov
On Fri, Sep 25, 2015 at 6:25 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Petr Jelinek <p...@2ndquadrant.com> writes:
> > On 2015-09-25 16:11, Teodor Sigaev wrote:
> >> In attach a bit modified patch based on 4-th version, and if there are
> >> no strong objections, I will commit it. Waiting this patch stops
> >> Alexander's development on CREATE ACCESS METHOD and he needs to move
> >> forward.
>
> > The code itself looks ok to me, but before we can think about committing
> > this the documentation has to be updated.
>
> Yes.  Also, for a major change like this, it would be a good thing if
> the documentation got a review from a native English speaker.  I will
> volunteer to handle it if someone else does the first draft.
>

Great! I'll write this documentation.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-09-18 Thread Alexander Korotkov
On Wed, Sep 16, 2015 at 7:47 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Thu, Sep 10, 2015 at 8:33 AM, Michael Paquier <
> michael.paqu...@gmail.com> wrote:
>
>> On Wed, Sep 9, 2015 at 7:13 PM, Alexander Korotkov wrote:
>
> > A also added additional check in findCommonAncestorTimeline(). Two
>> standbys
>> > could be independently promoted and get the same new timeline ID. Now,
>> it's
>> > checked that timelines, that we assume to be the same, have equal
>> begins.
>> > Begins could still match by coincidence. But the same risk exists in
>> > original pg_rewind, though.
>>
>> Really? pg_rewind blocks attempts to rewind two nodes that are already
>> on the same timeline before checking if their timeline history map at
>> some point or not:
>> /*
>>  * If both clusters are already on the same timeline, there's
>> nothing to
>>  * do.
>>  */
>> if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
>> ControlFile_source.checkPointCopy.ThisTimeLineID)
>> pg_fatal("source and target cluster are on the same
>> timeline\n");
>> And this seems really justified to me. Imagine that you have one
>> master, with two standbys linked to it. If both standbys are promoted
>> to the same timeline, you could easily replug them to the master, but
>> I fail to see the point to be able to replug one promoted standby with
>> the other in this case: those nodes have segment and history files
>> that map with each other, an operator could easily mess up things in
>> such a configuration.
>>
>
> Imagine following configuration of server.
>   1
>  / \
> 2   3
> |
> 4
>
> Initially, node 1 is master, nodes 2 and 3 are standbys for node 1. node 4
> is cascaded standby for node 3.
> Then node 2 and node 3 are promoted. They both got timeline number 2. Then
> node 3 is promoted and gets timeline number 3.
> Then we could try to rewind node 4 with node 2 as source. How pg_rewind
> could know that timeline number 2 for those nodes is not the same?
> We can only check that those timelines are forked from timeline 1 at the
> same place. But coincidence is still possible.
>

BTW, it would be an option to generate system_identifier to each new
timeline, by analogy of initdb do for the whole WAL.
Having such system_identifiers we can distinguish different timeline which
have assigned same ids.
Any thoughts?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-12-08 Thread Alexander Korotkov
On Tue, Dec 8, 2015 at 1:04 PM, Andres Freund <and...@anarazel.de> wrote:

> On 2015-12-08 12:53:49 +0300, Alexander Korotkov wrote:
> > ​This is why atomic increment *could be* cheaper than loop over CAS and,
> it
> > worth having experiments. ​Another idea is that we can put arbitrary
> logic
> > between lwarx and stwcx. Thus, we can implement PinBuffer using single
> loop
> > of lwarx and stwcx which could be better than loop of CAS.
>
> You can't really put that much between an ll/sc - the hardware is only
> able to track a very limited number of cacheline references.
>

​I have some doubts about this, but I didn't find the place where it's​
​ explicitly documented. In the case of LWLockAttemptLock it not very much
between
 lwarx/stwcx
​: 4 instructions while CAS have 2 instructions.
Could you please share some link to docs, if any?​


> > 3) lwlock-increment.patch – LWLockAttemptLock change state using atomic
> > increment operation instead of loop of CAS. This patch does it for
> > LWLockAttemptLock like pinunpin-increment.patch does for PinBuffer.
> > Actually, this patch is not directly related to buffer manager. However,
> > it's nice to test loop of CAS vs atomic increment in different places.
>
> Yea, that's a worthwhile improvement. Actually it's how the first
> versions of the lwlock patches worked - unfortunately I couldn't see big
> differences on hardware I had available at the time.
>
> There's some more trickyness required than what you have in your patch
> (afaics at least). The problem is that when you 'optimistically'
> increment by LW_VAL_SHARED and notice that there actually was another
> locker, you possibly, until you've 'fixed' the state, are blocking new
> exclusive lockers from acquiring the locks.  So you additionally need to
> do special handling in these cases, and check the queue more.
>

​Agree. This patch need to be carefully verified. Current experiments just
show that it is promising direction for improvement. I'll come with better
version of this patch.

Also, after testing on large machines I have another observation to share.
For now, LWLock doesn't guarantee that exclusive lock would be ever
acquired (assuming each shared lock duration is finite). It because when
there is no exclusive lock, new shared locks aren't queued and LWLock state
is changed directly. Thus, process which tries to acquire exclusive lock
have to wait for gap in shared locks. But with high concurrency for shared
lock that could happen very rare, say never.

We did see this on big Intel machine in practice. pgbench -S gets shared
ProcArrayLock very frequently. Since some number of connections is
achieved, new connections hangs on getting exclusive ProcArrayLock. I think
we could do some workaround for this problem. For instance, when exclusive
lock waiter have some timeout it could set some special bit which prevents
others to get new shared locks.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-12-09 Thread Alexander Korotkov
On Tue, Dec 8, 2015 at 6:00 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:

> On Tue, Dec 8, 2015 at 3:56 PM, Alexander Korotkov <
> a.korot...@postgrespro.ru> wrote:
>>
>> ​Agree. This patch need to be carefully verified. Current experiments
>> just show that it is promising direction for improvement. I'll come with
>> better version of this patch.
>>
>> Also, after testing on large machines I have another observation to
>> share. For now, LWLock doesn't guarantee that exclusive lock would be ever
>> acquired (assuming each shared lock duration is finite). It because when
>> there is no exclusive lock, new shared locks aren't queued and LWLock state
>> is changed directly. Thus, process which tries to acquire exclusive lock
>> have to wait for gap in shared locks.
>>
>
> I think this has the potential to starve exclusive lockers in worst case.
>
>
>> But with high concurrency for shared lock that could happen very rare,
>> say never.
>>
>> We did see this on big Intel machine in practice. pgbench -S gets shared
>> ProcArrayLock very frequently. Since some number of connections is
>> achieved, new connections hangs on getting exclusive ProcArrayLock. I think
>> we could do some workaround for this problem. For instance, when exclusive
>> lock waiter have some timeout it could set some special bit which prevents
>> others to get new shared locks.
>>
>>
> I think timeout based solution would lead to giving priority to
> exclusive lock waiters (assume a case where each of exclusive
> lock waiter timesout one after another) and make shared lockers
> wait and a timer based solution might turn out to be costly for
> general cases where wait is not so long.
>

​Since all lwlock waiters are ordered in the queue, we can let only first
waiter to set this bit.​
Anyway, once bit is set, shared lockers would be added to the queue. They
would get the lock in queue order.


> Another way could be to
> check if the Exclusive locker needs to go for repeated wait for a
> couple of times, then we can set such a bit.
>

​I'm not sure what do you mean by repeated wait. Do you mean exclusive
locker was waked twice up by timeout? Because now, without timeout,
exclusive locker wouldn't be waked up until all shared locks are released.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


[HACKERS] Proposal: custom compression methods

2015-12-13 Thread Alexander Korotkov
Hackers,

I'd like to propose a new feature: "Custom compression methods".

*Motivation*

Currently when datum doesn't fit the page PostgreSQL tries to compress it
using PGLZ algorithm. Compression of particular attributes could be turned
on/off by tuning storage parameter of column. Also, there is heuristics
that datum is not compressible when its first KB is not compressible. I can
see following reasons for improving this situation.

 * Heuristics used for detection of compressible data could be not optimal.
We already met this situation with jsonb.
 * For some data distributions there could be more effective compression
methods than PGLZ. For example:
 * For natural languages we could use predefined dictionaries which
would allow us to compress even relatively short strings (which are not
long enough for PGLZ to train its dictionary).
 * For jsonb/hstore we could implement compression methods which have
dictionary of keys. This could be either static predefined dictionary or
dynamically appended dictionary with some storage.
 * For jsonb and other container types we can implement compression
methods which would allow extraction of particular fields without
decompression of full value.

Therefore, it would be nice to make compression methods pluggable.

*Design*

Compression methods would be stored in pg_compress system catalog table of
following structure:

compnamename
comptype  oid
compcompfunc  regproc
compdecompfunc  regproc

Compression methods could be created by "CREATE COMPRESSION METHOD" command
and deleted by "DROP COMPRESSION METHOD" command.

CREATE COMPRESSION METHOD compname [FOR TYPE comptype_name]
WITH COMPRESS FUNCTION compcompfunc_name
 DECOMPRESS FUNCTION compdecompfunc_name;
DROP COMPRESSION METHOD compname;

Signatures of compcompfunc and compdecompfunc would be similar
pglz_compress and pglz_decompress except compression strategy. There is
only one compression strategy in use for pglz (PGLZ_strategy_default).
Thus, I'm not sure it would be useful to provide multiple strategies for
compression methods.

extern int32 compcompfunc(const char *source, int32 slen, char *dest);
extern int32 compdecompfunc(const char *source, int32 slen, char *dest,
int32 rawsize);

Compression method could be type-agnostic (comptype = 0) or type specific
(comptype != 0). Default compression method is PGLZ.

Compression method of column would be stored in pg_attribute table.
Dependencies between columns and compression methods would be tracked in
pg_depend preventing dropping compression method which is currently in use.
Compression method of the attribute could be altered by ALTER TABLE command.

ALTER TABLE table_name ALTER COLUMN column_name SET COMPRESSION METHOD
compname;

Since mixing of different compression method in the same attribute would be
hard to manage (especially dependencies tracking), altering attribute
compression method would require a table rewrite.

*Implementation details*

Catalog changes, new commands, dependency tracking etc are mostly
mechanical stuff with no fundamental problems. The hardest part seems to be
providing seamless integration of custom compression methods into existing
code.

It doesn't seems hard to add extra parameter with compression method to
toast_compress_datum. However, PG_DETOAST_DATUM should call custom
decompress function with only knowledge of datum. That means that we should
somehow conceal knowledge of compression method into datum. The solution
could be putting compression method oid right after varlena header. Putting
this on-disk would cause storage overhead and break backward compatibility.
Thus, we can add this oid right after reading datum from the page. This
could be the weakest point in the whole proposal and I'll be very glad for
better ideas.

P.S. I'd like to thank Petr Korobeinikov <pkorobeini...@gmail.com> who
started work on this patch and sent me draft of proposal in Russian.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-12-11 Thread Alexander Korotkov
On Thu, Dec 10, 2015 at 9:26 AM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Wed, Dec 9, 2015 at 2:17 PM, Alexander Korotkov <
> a.korot...@postgrespro.ru> wrote:
>
>> On Tue, Dec 8, 2015 at 6:00 PM, Amit Kapila <amit.kapil...@gmail.com>
>> wrote:
>>
>>> On Tue, Dec 8, 2015 at 3:56 PM, Alexander Korotkov <
>>> a.korot...@postgrespro.ru> wrote:
>>>>
>>>> ​Agree. This patch need to be carefully verified. Current experiments
>>>> just show that it is promising direction for improvement. I'll come with
>>>> better version of this patch.
>>>>
>>>> Also, after testing on large machines I have another observation to
>>>> share. For now, LWLock doesn't guarantee that exclusive lock would be ever
>>>> acquired (assuming each shared lock duration is finite). It because when
>>>> there is no exclusive lock, new shared locks aren't queued and LWLock state
>>>> is changed directly. Thus, process which tries to acquire exclusive lock
>>>> have to wait for gap in shared locks.
>>>>
>>>
>>> I think this has the potential to starve exclusive lockers in worst case.
>>>
>>>
>>>> But with high concurrency for shared lock that could happen very rare,
>>>> say never.
>>>>
>>>> We did see this on big Intel machine in practice. pgbench -S gets
>>>> shared ProcArrayLock very frequently. Since some number of connections is
>>>> achieved, new connections hangs on getting exclusive ProcArrayLock. I think
>>>> we could do some workaround for this problem. For instance, when exclusive
>>>> lock waiter have some timeout it could set some special bit which prevents
>>>> others to get new shared locks.
>>>>
>>>>
>>> I think timeout based solution would lead to giving priority to
>>> exclusive lock waiters (assume a case where each of exclusive
>>> lock waiter timesout one after another) and make shared lockers
>>> wait and a timer based solution might turn out to be costly for
>>> general cases where wait is not so long.
>>>
>>
>> ​Since all lwlock waiters are ordered in the queue, we can let only first
>> waiter to set this bit.​
>>
>
> Thats okay, but still every time an Exclusive locker woke up, the
> threshold time for its wait might be already over and it will set the
> bit.  In theory, that looks okay, but as compare to current algorithm
> it will make more shared lockers to be added into wait queue.
>
>
>> Anyway, once bit is set, shared lockers would be added to the queue. They
>> would get the lock in queue order.
>>
>>
>
> Ye thats right, but I think in general the solution to this problem
> should be don't let any Exclusive locker to starve and still allow
> as many shared lockers as possible.  I think here it is important
> how we define starving, should it be based on time or something
> else?  I find timer based solution somewhat less suitable, but may
> be it is okay, if there is no other better way.
>

​Yes, we probably should find something better.​

Another way could be to
>>> check if the Exclusive locker needs to go for repeated wait for a
>>> couple of times, then we can set such a bit.
>>>
>>
>> ​I'm not sure what do you mean by repeated wait. Do you mean exclusive
>> locker was waked twice up by timeout?
>>
>
> I mean to say once the Exclusive locker is woken up, it again
> re-tries to acquire the lock as it does today, but if it finds that the
> number of retries is greater than certain threshold (let us say 10),
> then we sit the bit.
>

​Yes, there is a cycle with retries in LWLockAcquire function. The case of
retry is when ​waiter is waked up, but someone other steal the lock before
him. Lock waiter is waked up by lock releaser only when lock becomes free.
But in the case of high concurrency for shared lock, it almost never
becomes free. So, exclusive locker would be never waked up. I'm pretty sure
this happens on big Intel machine while we do the benchmark. So, relying on
number of retries wouldn't work in this case.
I'll do the tests to verify if retries happens in our case.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


[HACKERS] Isolation of table creation

2015-12-11 Thread Alexander Korotkov
Hackers,

I discovered interesting issue with PostgreSQL transaction isolation.
When transaction is in repeatable read isolation level, I can't see table
which was created after transaction obtained snapshot. But I can run DML
statements with this table. See example below.

Session 1
# begin transaction isolation level repeatable read;
BEGIN
# \dt
No relations found.
Session 2
# create table tmp (i int not null);
CREATE TABLE
# insert into tmp values (1);
INSERT 0 1
# \dt
No relations found.
# select * from tmp;
 i
---
(0 rows)

# insert into tmp values (2);
INSERT 0 1
# select * from tmp;
 i
---
 2
(1 row)
# commit;
COMMIT

Is it a bug?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Optimizer questions

2016-01-05 Thread Alexander Korotkov
On Wed, Jan 6, 2016 at 12:08 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> konstantin knizhnik <k.knizh...@postgrespro.ru> writes:
> > 1. The cost compared in grouping_planner doesn't take in account price
> of get_authorized_users - it is not changed when I am altering function
> cost. Is it correct behavior?
>
> The general problem of accounting for tlist eval cost is not handled very
> well now, but especially not with respect to the idea that different paths
> might have different tlist costs.  I'm working on an upper-planner rewrite
> which should make this better, or at least make it practical to make it
> better.
>

Hmm... Besides costing it would be nice to postpone calculation of
expensive tlist functions after LIMIT.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] statistics for array types

2015-12-18 Thread Alexander Korotkov
On Wed, Sep 16, 2015 at 8:01 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Mon, Aug 24, 2015 at 8:26 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:
>
>> On Thu, Aug 20, 2015 at 6:00 PM, Tomas Vondra <
>> tomas.von...@2ndquadrant.com> wrote:
>>
>>> Hi,
>>>
>>> On 08/11/2015 04:38 PM, Jeff Janes wrote:
>>>
>>>> When reviewing some recent patches, I decided the statistics gathered
>>>>  for arrays had some pre-existing shortcomings.
>>>>
>>>> The main one is that when the arrays contain rare elements there is
>>>> no histogram to fall back upon when the MCE array is empty, the way
>>>> there is for scalar stats.  So it has to punt completely and resort
>>>> to saying that it is 0.5% selectivity without recourse to any data at
>>>> all.
>>>>
>>>> The rationale for applying the threshold before things are eligible
>>>> for inclusion in the MCE array seems to be that this puts some
>>>> theoretical bound on the amount of error we are likely to have in
>>>> that element.  But I think it is better to exceed that theoretical
>>>> bound than it is to have no data at all.
>>>>
>>>> The attached patch forces there to be at least one element in MCE,
>>>> keeping the one element with the highest predicted frequency if the
>>>> MCE would otherwise be empty.  Then any other element queried for is
>>>> assumed to be no more common than this most common element.
>>>>
>>>
>>> We only really need the frequency, right? So do we really need to keep
>>> the actual MCV element? I.e. most_common_elem_freqs does not have the
>>> same number of values as most_common_elems anyway:
>>>
>>>   A list of the frequencies of the most common element values, i.e., the
>>>   fraction of rows containing at least one instance of the given value.
>>>   Two or three additional values follow the per-element frequencies;
>>>   these are the minimum and maximum of the preceding per-element
>>>   frequencies, and optionally the frequency of null elements.
>>>   (Null when most_common_elems is.)
>>>
>>> So we might modify it so that it's always defined - either it tracks the
>>> same values as today (when most_common_elems is defined), or the
>>> frequency of the most common element (when most_common_elems is NULL).
>>>
>>
>> I had also considered that.  It requires more changes to make it happen,
>> and it seems to create a more complex contract on what those columns mean,
>> but without giving a corresponding benefit.
>>
>>
>>>
>>> This way we can keep the current theoretical error-bound on the MCE
>>> frequencies, and if that's not possible we can have at least the new
>>> value without confusing existing code.
>>
>>
>> But if the frequency of the most common element was grossly wrongly, then
>> whatever value we stick in there is still going to be grossly wrong.
>> Removing the value associated with it isn't going to stop it from being
>> wrong.  When we do query with the (incorrectly thought) first most common
>> element, either it will find and use the wrong value from slot 1, or it
>> will find nothing and fall back on the same wrong value from slot 3.
>>
>
> Hmm, I think we should store cutoff_freq / nonnull_cnt as minfreq when we
> collect no MCEs. Moreover, I think we should store it even when num_mcelem
> >= track_len and we haven't cut MCEs we find. In this case we can get more
> precise estimation for rare element using the knowledge that all MCEs which
> exceeds the threshold are present (assuming their frequecies could be much
> higher than threshold).
>
> When there are no MCEs then we should use assumption that there are no
> elements more frequent than cutoff_freq / nonnull_cnt. Using lower values
> wouldn't be statistically correct.
>

​The patch implementing my idea above​ is attached. In your example it
gives following result.

# explain (analyze) select * from foobar where foo @> '{567}';
  QUERY PLAN
---
 Seq Scan on foobar  (cost=0.00..2387.00 rows=30 width=61) (actual
time=28.691..28.691 rows=0 loops=1)
   Filter: (foo @> '{567}'::integer[])
   Rows Removed by Filter: 10
 Planning time: 0.044 ms
 Execution time: 28.707 ms
(5 rows)

In this particular example it gives less accurate estimate. However, I
believe it would be more safe estimate in general.

I've faced difficulties with storing empty mcelements
array. update_attstats turns empty array into null, and get_attstatsslot
throws error when trying to read null array. I've changes get_attstatsslot
so that it returns empty array when it meets null. I'm not sure in this
solution.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


array_typanalyze_0_mce.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Commit fest status for 2015-11

2015-12-24 Thread Alexander Korotkov
On Thu, Dec 24, 2015 at 6:18 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Michael Paquier <michael.paqu...@gmail.com> writes:
> > And the CF is no closed. Here is the final score:
> > Committed: 30.
> > Moved to next CF: 42.
> > Rejected: 9.
> > Returned with Feedback: 22.
> > Total: 103.
>
> Many thanks to Michael for doing the CF management work this time!
>

​+1
​Michael, thank you for the great job.


--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Patch: pg_trgm: gin index scan performance for similarity search

2015-12-24 Thread Alexander Korotkov
Hi, Christophe!

On Thu, Dec 24, 2015 at 6:28 PM, Fornaroli Christophe <cforn...@gmail.com>
wrote:

> This code uses this upper bound for the similarity: ntrue / (nkeys -
> ntrue). But if there is ntrue trigrams in common, we know that the indexed
> string is at least ntrue trigrams long. We can then use a more aggressive
> upper bound: ntrue / (ntrue + nkeys - ntrue) or ntrue / nkeys. Attached is
> a patch that changes this.
>

​Good catch, thank you! The estimate in pg_trgm was not optimal.
I think it would be good to add comment which would explicitly state why do
we use this upper bound.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension

2015-12-19 Thread Alexander Korotkov
On Fri, Dec 18, 2015 at 10:53 PM, Artur Zakirov <a.zaki...@postgrespro.ru>
wrote:

> On 18.12.2015 22:43, Artur Zakirov wrote:
>
>> Hello.
>>
>> PostgreSQL has a contrib module named pg_trgm. It is used to the fuzzy
>> text search. It provides some functions and operators for determining the
>> similarity of the given texts using trigram matching.
>>
>> Sorry, I have forgotten to mark previous message with [PROPOSAL].
>

​I think, you shouldn't mark thread as ​[PROPOSAL] since you're providing a
full patch.


> I registered the patch in commitfest:
> https://commitfest.postgresql.org/8/448/


​Great.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-12-01 Thread Alexander Korotkov
On Sat, Nov 28, 2015 at 2:27 PM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> On Fri, Nov 27, 2015 at 11:42 PM, Teodor Sigaev <teo...@sigaev.ru> wrote:
> > Seems, patch is ready to commit. But it needs some documentation.
>
> Of what kind? The documentation of pg_rewind is rather explicit on the
> subject and looks fine as-is, and that's what Alexander and I agreed
> on upthread. If there is something forgotten though, this may be the
> fact that we do not mention in 9.5's documentation that pg_rewind can
> *not* handle timeline switches.
>

​However, we can add some explicit statements​ about new pg_rewind
capabilities. Please, check attached patch for those statements.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pg-rewind-target-switch-7.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why we don't have checksums on clog files

2016-06-07 Thread Alexander Korotkov
On Tue, Jun 7, 2016 at 5:41 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:

> On Mon, Jun 6, 2016 at 8:43 PM, Alex Ignatov <a.igna...@postgrespro.ru>
> wrote:
>
>> Why we don't have checksums on clog files.
>>
>> We have checksum on pg_control, optional checksumming on data files, some
>> form of checksumming on wal's. But why we don't have any checksumming on
>> clogs. Corruptions on clogs lead to transaction visisbility problems and
>> database consistency violation.
>>
>> Can anybody explain this situation with clogs?
>>
>>
> I think it will be better if users can have an option to checksum clog
> pages.  However, I think that will need a change in page (CLOG-page) format
> which might not be a trivial work to accomplish.
>

I think we should think no only about CLOG, but about all SLRU pages.
For data pages we have to reinitialize database cluster to enable
checksums. I think we can also require reinitialization to enable checksums
of SLRU pages. Thus, major problem would be that number of items fitting to
page would depend on whether checksums are enabled.  However, it shouldn't
too hard.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2016-01-29 Thread Alexander Korotkov
On Tue, Dec 29, 2015 at 11:56 AM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Wed, Dec 16, 2015 at 12:26 AM, Robert Haas <robertmh...@gmail.com>
> wrote:
> >
> >
> > In terms of this project overall, NumLWLocks() now knows about only
> > four categories of stuff: fixed lwlocks, backend locks (proc.c),
> > replication slot locks, and locks needed by extensions.  I think it'd
> > probably be fine to move the backend locks into PGPROC directly, and
> > the replication slot locks into ReplicationSlot.
> >
>
> IIdus has written a patch to move backend locks into PGPROC which
> I am reviewing and will do performance tests once he responds to
> my review comments and I have written a patch to move replication
> slot locks into ReplicationSlot which is attached with this mail.
>

This patch looks good for me.
It compiles without warnings, passes regression tests.
I also did small testing of replication slots in order to check that it
works correctly.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2016-01-29 Thread Alexander Korotkov
On Mon, Jan 4, 2016 at 5:58 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Mon, Jan 4, 2016 at 1:20 AM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
> > If we do that way, then user of API needs to maintain the interlock
> > guarantee that the requested number of locks is same as allocations
> > done from that tranche and also if it is not allocating all the
> > LWLocks in one function, it needs to save the base address of the
> > array and then allocate from it by maintaining some counter.
> > I agree that looking up for tranche names multiple times is not good,
> > if there are many number of lwlocks, but that is done at startup
> > time and not at operation-level.  Also if we look currently at
> > the extensions in contrib, then just one of them allocactes one
> > LWLock in this fashion, now there could be extnsions apart from
> > extensions in contrib, but not sure if they require many number of
> > LWLocks, so I feel it is better to give an API which is more
> > convinient for user to use.
>
> Well, I agree with you that we should provide the most convenient API
> possible, but I don't agree that yours is more convenient than the one
> I proposed.  I think it is less convenient.  In most cases, if the
> user asks for a large number of LWLocks, they aren't going to be each
> for a different purpose - they will all be for the same purpose, like
> one per buffer or one per hash partition.  The case where the user
> wants to allocate 8 lwlocks from an extension, each for a different
> purpose, and spread those allocations across a bunch of different
> functions probably does not exist.  *Maybe* there is somebody
> allocating lwlocks from an extension for unrelated purposes, but I'd
> be willing to bet that, if so, all of those allocations happen one
> right after the other in a single function, because anything else
> would be completely nuts.
>

I'm not sure if we should return LWLockPadded*.

+LWLockPadded *
> +GetLWLockAddinTranche(const char *tranche_name)


It would be nice to isolate extension from knowledge about padding. Could
we one day change it from LWLockPadded * to LWLockMinimallyPadded * or any?

+LWLock **
> +GetLWLockAddinTranche(const char *tranche_name)


Could we use this signature?


> >> > Also I have retained NUM_USER_DEFINED_LWLOCKS in
> >> > MainLWLockArray so that if any extensions want to assign a LWLock
> >> > after startup, it can be used from this pool.  The tranche for such
> >> > locks
> >> > will be reported as main.
> >>
> >> I don't like this.  I think we should get rid of
> >> NUM_USER_DEFINED_LWLOCKS altogether.  It's just an accident waiting to
> >> happen, and I don't think there are enough people using LWLocks from
> >> extensions that we'll annoy very many people by breaking backward
> >> compatibility here.  If we are going to care about backward
> >> compatibility, then I think the old-style lwlocks should go in their
> >> own tranche, not main.  But personally, I think it'd be better to just
> >> force a hard break and make people switch to using the new APIs.
> >
> > One point to think before removing it altogether, is that the new API's
> > provide a way to allocate LWLocks at the startup-time and if any user
> > wants to allocate a new LWLock after start-up, it will not be possible
> > with the proposed API's.  Do you think for such cases, we should ask
> > user to use it the way we have exposed or do you have something else
> > in mind?
>
> Allocating a new lock after startup mostly doesn't work, because there
> will be at most 3 available and sometimes less.  And even if it does
> work, it often isn't very useful because you probably need some other
> shared memory space as well - otherwise, what is the lock protecting?
> And that space might not be available either.
>
> I'd be interested in knowing whether there are cases where useful
> extensions can be loaded apart from shared_preload_libraries because
> of NUM_USER_DEFINED_LWLOCKS and our tendency to allocate a little bit
> of extra shared memory, but my suspicion is that it rarely works out
> and is too flaky to be useful to anybody.


+1 for dropping old API
I don't think it's useful to have LWLocks without having shared memory.

There is another thing I'd like extensions to be able to do. It would be
nice if extensions could use dynamic shared memory instead of static. Then
extensions could use shared memory without being in
shared_preload_libraries. But if extension register DSM, then there is no
way to tell other backends to use it for that extension. Also DSM would be
deallocated when all backends detached from it. This it out of scope for
this patch though.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2016-01-29 Thread Alexander Korotkov
On Thu, Jan 21, 2016 at 12:37 AM, Alvaro Herrera <alvhe...@2ndquadrant.com>
wrote:

> So far as I can tell, there are three patches in flight here:
>
> * replication slot IO lwlocks
> * ability of extensions to request tranches dynamically
> * PGPROC
>
> The first one hasn't been reviewed at all, but the other two have seen a
> bit of discussion and evolution.  Is anyone doing any more reviewing?


I'd like to add another one: fixed tranche id for each SLRU.
And I'm going to make review over others.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


slru_tranche_ids_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


<    2   3   4   5   6   7   8   9   10   11   >