[HACKERS] transam README small fix

2016-03-01 Thread Stas Kelvich
Hi,

Transaction function call sequence description in transam/README is slightly 
outdated. Select now handled by PortalRunSelect instead of ProcessQuery. It is 
also hard to follow what tabulation there means — sometimes that means 
“function called by function”, sometimes it isn't. So I’ve also changed it to 
actual call nesting.




transam.readme.patch
Description: Binary data

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] transam README small fix

2016-03-04 Thread Stas Kelvich
Thanks.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


> On 04 Mar 2016, at 22:14, Robert Haas  wrote:
> 
> On Tue, Mar 1, 2016 at 4:31 AM, Stas Kelvich  wrote:
>> Transaction function call sequence description in transam/README is slightly 
>> outdated. Select now handled by PortalRunSelect instead of ProcessQuery. It 
>> is also hard to follow what tabulation there means — sometimes that means 
>> “function called by function”, sometimes it isn't. So I’ve also changed it 
>> to actual call nesting.
> 
> After some study, this looks good to me, so committed.
> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] 2PC support for pglogical

2016-03-10 Thread Stas Kelvich
Hi.

Here is proof-of-concept version of two phase commit support for logical 
replication. There are some changes in core postgres, pglogical_output and 
pglogical extensions. I’ve used version from 2nd Quadrant repo, pglogical 
branch and rebased it to current head. Some notes about this patch:

* LWLockAssign() was deleted, so I changed that to new locks tranche api.

* Seems that only reliable way to get GID during replay of commit/rollback 
prepared is to force postgres to write GID in corresponding records, otherwise 
we can lose correspondence between xid and gid  if we are replaying data from 
wal sender while transaction was commited some time ago. So i’ve changed 
postgres to write gid’s not only on prepare, but also on commit/rollback 
prepared. That should be done only in logical level, but now I just want to 
here some other opinions on that.

* Abort prepared xlog record also lack database information. Normally logical 
decoding just cleans reorder buffer when facing abort, but in case of 2PC we 
should send it to callbacks anyway. So I’ve added that info to abort records.

* Prepare emits xlog record with TwoPhaseFileHeader in it and that structure is 
the same as xl_xact_parsed_commit, but with some fields renamed. Probably that 
is just due to historical reasons. It is possible to change PREPARE to write 
ordinary commit records with some flag and then use the same infrastructure to 
parse it. So DecodePrepare/DecodeCommit can be done with the same function. 





pglogical_twophase.diff
Description: Binary data


---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2016-03-11 Thread Stas Kelvich

> On 10 Mar 2016, at 20:29, Teodor Sigaev  wrote:
> 
> I would like to suggest rename both functions to array_to_tsvector and 
> tsvector_to_array to have consistent name. Later we could add 
> to_tsvector([regconfig, ], text[]) with morphological processing.
> 
> Thoughts?
> 

Seems reasonable, done.



tsvector_ops-v6.diff
Description: Binary data


---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2016-03-11 Thread Stas Kelvich
> 
> On 11 Mar 2016, at 16:13, Stas Kelvich  wrote:
> 
> 
>> On 10 Mar 2016, at 20:29, Teodor Sigaev  wrote:
>> 
>> I would like to suggest rename both functions to array_to_tsvector and 
>> tsvector_to_array to have consistent name. Later we could add 
>> to_tsvector([regconfig, ], text[]) with morphological processing.
>> 
>> Thoughts?
>> 

Hi, thanks for commit.

I saw errors on windows, here is the fix:



tsvector.fix.patch
Description: Binary data

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-03-18 Thread Stas Kelvich

> On 11 Mar 2016, at 19:41, Jesper Pedersen  wrote:
> 

Thanks for review, Jesper.

> Some comments:
> 
> * The patch needs a rebase against the latest TwoPhaseFileHeader change

Done.

> * Rework the check.sh script into a TAP test case (src/test/recovery), as 
> suggested by Alvaro and Michael down thread

Done. Originally I thought about reducing number of tests (11 right now), but 
now, after some debugging, I’m more convinced that it is better to include them 
all, as they are really testing different code paths.

> * Add documentation for RecoverPreparedFromXLOG

Done.

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



twophase_replay.v2.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] async replication code

2016-03-19 Thread Stas Kelvich

> On 16 Mar 2016, at 15:52, otheus uibk  wrote:
> 
> Greetings,

Hi

> 
> I am new here.  Apologetic question: how can i search the archives of this 
> mailing list? Is there some set of magic words to put in Google? Must I wade 
> through 20 pages of hits? Should I download all the archives and grep? :) 
> 

http://www.postgresql.org/list/

https://wiki.postgresql.org/wiki/Mailing_Lists

> Main question:  I want to understand very precisely the exact algirithm used 
> in PG to do asynchronous streaming replication. Eg, I want to know how the 
> record is written and flushed to the socket and how that happens in time 
> w.r.t WAL-to-disk and response to client. Will someone please give me a head 
> start as to where to begin my code-spelunking? 
> 

WAL synced to disc, then everything else is happening.

http://www.postgresql.org/docs/9.5/static/warm-standby.html#STREAMING-REPLICATION

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] eXtensible Transaction Manager API (v2)

2016-03-19 Thread Stas Kelvich
On 12 Mar 2016, at 13:19, Michael Paquier  wrote:
> 
> On Fri, Mar 11, 2016 at 9:35 PM, Tom Lane  wrote:
>> IMO this is not committable as-is, and I don't think that it's something
>> that will become committable during this 'fest.  I think we'd be well
>> advised to boot it to the 2016-09 CF and focus our efforts on other stuff
>> that has a better chance of getting finished this month.
> 
> Yeah, I would believe that a good first step would be to discuss
> deeply about that directly at PGCon for folks that will be there and
> interested in the subject. It seems like a good timing to brainstorm
> things F2F at the developer unconference for example, a couple of
> months before the 1st CF of 9.7. We may perhaps (or not) get to
> cleaner picture of what kind of things are wanted in this area.

To give overview of xtm coupled with postgres_fdw from users perspective i’ve 
packed patched postgres with docker
and provided test case when it is easy to spot violation of READ COMMITTED 
isolation level without XTM.

This test fills database with users across two shards connected by postgres_fdw 
and inherits the same table. Then 
starts to concurrently transfers money between users in different shards:

begin;
update t set v = v - 1 where u=%d; -- this is user from t_fdw1, first shard
update t set v = v + 1 where u=%d; -- this is user from t_fdw2, second shard
commit;

Also test simultaneously runs reader thread that counts all money in system:

select sum(v) from t;

So in transactional system we expect that sum should be always constant (zero 
in our case, as we initialize user with zero balance).
But we can see that without tsdtm total amount of money fluctuates around zero.

https://github.com/kelvich/postgres_xtm_docker

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fd.c: flush data problems on osx

2016-03-19 Thread Stas Kelvich

> On 17 Mar 2016, at 20:23, Andres Freund  wrote:
> 
>> 
>> Also there are no default ifdef inside this function, is there any
>> check that will guarantee that pg_flush_data will not end up with
>> empty body on some platform?
> 
> There doesn't need to be - it's purely "advisory", i.e. just an
> optimization.


Ah, okay, then I misinterpret purpose of that function and it shouldn’t be 
forced to sync.

> 
>> One possible solution for that is just fallback to pg_fdatasync in case when 
>> offset = nbytes = 0.
> 
> Hm, that's a bit heavyweight. I'd rather do an lseek(SEEK_END) to get
> the file size. Could you test that?
> 

It looks like OSX mmap raises EINVAL when length isn’t aligned to pagesize 
while manual says it can be of arbitrary length, so i aligned it.
Also there were call to mmap with PROT_READ | PROT_WRITE, but when called from 
pre_sync_fname file descriptor is just O_RDONLY, so i changed mmap mode to 
PROT_READ — seems that PROT_WRITE wasn’t needed anyway.

And all of that reduces number of warnings in order of magnitude but there are 
still some and I don’t yet understand why are they happening.

> 
> Greetings,
> 
> Andres Freund
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



flushdata.v2.patch
Description: Binary data




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] fd.c: flush data problems on osx

2016-03-20 Thread Stas Kelvich
Hi

Current implementation of pg_flush_data when called with zero offset and zero 
nbytes is assumed to flush all file. In osx it uses mmap with these arguments, 
but according to man: 

"[EINVAL]   The len argument was negative or zero. Historically, the 
system call would not return an
error if the argument was zero.  See other potential 
additional restrictions in the COMPAT-
IBILITY section below."

so it is generate a lot of warnings:

"WARNING:  could not mmap while flushing dirty data: Invalid argument"

Call to pg_flush_data with zero offset and nbytes happens when replica starts 
from base backup and fsync=on. TAP test to reproduce is attached.

One possible solution for that is just fallback to pg_fdatasync in case when 
offset = nbytes = 0.

Also there are no default ifdef inside this function, is there any check that 
will guarantee that pg_flush_data will not end up with empty body on some 
platform?

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
# Minimal test testing streaming replication
use strict;
use warnings;
use PostgresNode;
use TestLib;
use Test::More tests => 1;

# Initialize master node
my $node_master = get_new_node('master');
$node_master->init(allows_streaming => 1);
$node_master->append_conf(
	'postgresql.conf', qq(
fsync = 'on'
));
$node_master->start;
my $backup_name = 'my_backup';

# Take backup
$node_master->backup($backup_name);

# Create streaming standby linking to master
my $node_standby = get_new_node('standby');
$node_standby->init_from_backup($node_master, $backup_name);
$node_standby->start;


flushdata.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fd.c: flush data problems on osx

2016-03-21 Thread Stas Kelvich

> On 18 Mar 2016, at 14:45, Stas Kelvich  wrote:
>> 
>>> One possible solution for that is just fallback to pg_fdatasync in case 
>>> when offset = nbytes = 0.
>> 
>> Hm, that's a bit heavyweight. I'd rather do an lseek(SEEK_END) to get
>> the file size. Could you test that?
>> 
> 
> It looks like OSX mmap raises EINVAL when length isn’t aligned to pagesize 
> while manual says it can be of arbitrary length, so i aligned it.
> Also there were call to mmap with PROT_READ | PROT_WRITE, but when called 
> from pre_sync_fname file descriptor is just O_RDONLY, so i changed mmap mode 
> to PROT_READ — seems that PROT_WRITE wasn’t needed anyway.
> 
> And all of that reduces number of warnings in order of magnitude but there 
> are still some and I don’t yet understand why are they happening.

I’ve spend some more time on this issue and found that remaining warnings were 
caused by mmap-ing directories — that raises EINVAL in OSX (probably not only 
OSX, but I didn’t tried).
So i’ve skipped mmap for dirs and now restore happens without warnings. Also 
I’ve fixed wrong error check that was in previous version of patch.




flushdata.v3.patch
Description: Binary data


---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fd.c: flush data problems on osx

2016-03-21 Thread Stas Kelvich
On 21 Mar 2016, at 14:53, Andres Freund  wrote:
> Hm. I think we should rather just skip calling pg_flush_data in the
> directory case, that very likely isn't beneficial on any OS.

Seems reasonable, changed.

> I think we still need to fix the mmap() implementation to support the
> offset = 0, nbytes = 0 case (via fseek(SEEK_END).

It is already in this diff. I’ve added this few messages ago.



flushdata.v4.patch
Description: Binary data

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] we have added support for box type in SP-GiST index

2016-03-21 Thread Stas Kelvich

> On 21 Mar 2016, at 22:38, Kevin Grittner  wrote:
> 
> On Mon, Mar 21, 2016 at 11:57 AM, Teodor Sigaev  wrote:
> 
>>> I couldn't get the term 4D point.  Maybe it means that we are
>>> using box datatype as the prefix, but we are not treating them
>>> as boxes.
>> 
>> exactly, we treat boxes as 4-dimentional points.
> 
> I'm not entirely sure I understand the terminology either, since I
> tend to think of a *point* as having *zero* dimensions.  Would it
> perhaps be more accurate to say we are treating a 2-dimensional box
> as a point in 4-dimensional space?

Or just say 4-d vector instead of 4-d point. Look like it will be enough 
rigorous.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] we have added support for box type in SP-GiST index

2016-03-21 Thread Stas Kelvich

> On 22 Mar 2016, at 01:47, Jim Nasby  wrote:
> 
> On 3/21/16 11:57 AM, Teodor Sigaev wrote:
>> A and B are points of intersection of lines. So, box PBCAis a bounding
>> box for points contained in 3-rd (see labeling above). For example X
>> labeled point is not a descendace of child node with centroid  C because
>> it must be in branch of 1-st quad of parent node. So, each node (except
>> root) will have a limitation in its quadrant. To transfer that
>> limitation the traversalValue is used.
> 
> Isn't this basically the same thing that the cube contrib module does? (Which 
> has the added benefit of kNN-capable operators).

Cube and postgres own geometric types are indexed by building R-tree. While 
this is one of the most popular solutions it
degrades almost to linear search when bounding boxes for each index node 
overlaps a lot. This problem will arise when one will
try to index streets in the city with grid system — a lot of street's bounding 
boxes will just coincide with bounding box for whole city.

But that patch use totally different strategy for building index and do not 
suffer from above problem.

> 
> If that's true then ISTM it'd be better to work on pulling cube's features 
> into box?
> 
> If it's not true, I'm still wondering if there's enough commonality here that 
> we should pull cube into core…

There is also intarray module with very common functionality, but it also 
addressing different data pattern.

Optimal indexing and kNN strategy are very sensible to the data itself and if 
we want some general multidimensional search routines,
then I think none of the mentioned extensions is general enough. Cube barely 
applicable when dimensions number is higher than 10-20,
intarray performs bad on data with big sets of possible coordinates, this patch 
is also intended to help with specific, niche problem.

While people tends to use machine learning and regressions models more and more 
it is interesting to have some general n-dim indexing with kNN,
but I think it is different problem and should be solved in a different way.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2PC support for pglogical

2016-03-24 Thread Stas Kelvich

> On 24 Mar 2016, at 17:03, Robert Haas  wrote:
> 
> On Wed, Mar 23, 2016 at 1:44 AM, Craig Ringer  wrote:
>> On 10 March 2016 at 22:50, Stas Kelvich  wrote:
>>> Hi.
>>> 
>>> Here is proof-of-concept version of two phase commit support for logical
>>> replication.
>> 
>> I missed this when you posted it, so sorry for the late response.
>> 
>> I've read through this but not tested it yet. I really appreciate you doing
>> it, it's been on my wishlist/backburner for ages.
>> 
>> On reading through the patch I noticed that there doesn't seem to be any
>> consideration of locking. The prepared xact can still hold strong locks on
>> catalogs. How's that handled? I think Robert's group locking stuff is what
>> we'll want here - for a prepared xact we can join the prepared xact as a
>> group lock member so we inherit its locks. Right now if you try DDL in a
>> prepared xact I suspect it'll break.
> 
> I have grave doubts about that approach.  It's not impossible it could
> be right, but I wouldn't bet the farm on it.
> 

I think all the locking already handled properly by creating dummy backend in 
PGPROC, as it done in usual postgres 2pc implementation.

Just checked DDL with following scenario:

* prepare tx that inserts a row in table on master
* execute DROP TABLE on pglogical slave
* commit prepared on master

Now it behaves as expected — slave blocks DROP TABLE until commit prepared on 
master.

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-03-30 Thread Stas Kelvich
On Mar 29, 2016, at 6:04 PM, David Steele  wrote:It looks like you should post a new patch or respond to Michael's comments.  Marked as "waiting on author".Yep, here it is.On Mar 22, 2016, at 4:20 PM, Michael Paquier  wrote:Looking at this patch….Thanks.+++ b/src/test/recovery/t/006_twophase.pl@@ -0,0 +1,226 @@+# Checks for recovery_min_apply_delay+use strict;This description is wrong, this file has been copied from 005.Yep, done.+my $node_master = get_new_node("Candie");+my $node_slave = get_new_node('Django');Please let's use a node names that are more descriptive.Hm, it’s hard to create descriptive names because test changes master/slave roles for that nodes several times during test. It’s possible to call them “node1” and “node2” but I’m not sure that improves something. But anyway I’m not insisting on that particular names and will agree with any reasonable suggestion.+# Switch to synchronous replication+$node_master->append_conf('postgresql.conf', qq(+synchronous_standby_names = '*'+));+$node_master->restart;Reloading would be fine.Good catch, done.+   /* During replay that lock isn't really necessary, but let's takeit anyway */+   LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);+   for (i = 0; i < TwoPhaseState->numPrepXacts; i++)+   {+   gxact = TwoPhaseState->prepXacts[i];+   proc = &ProcGlobal->allProcs[gxact->pgprocno];+   pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];++   if (TransactionIdEquals(xid, pgxact->xid))+   {+   gxact->locking_backend = MyBackendId;+   MyLockedGxact = gxact;+   break;+   }+   }+   LWLockRelease(TwoPhaseStateLock);Not taking ProcArrayLock here?All accesses to 2pc dummies in ProcArray are covered with TwoPhaseStateLock, so I thick that’s safe. Also I’ve deleted comment above that block, probably it’s more confusing than descriptive.The comment at the top of XlogReadTwoPhaseData is incorrect.Yep, fixed.RecoverPreparedFromXLOG and RecoverPreparedFromFiles have a lot ofcode in common, having this duplication is not good, and you couldsimplify your patch.I reworked patch to avoid duplicated code between RecoverPreparedFromXLOG/RecoverPreparedFromFiles and also between FinishPreparedTransaction/XlogRedoFinishPrepared.

twophase_replay.v3.diff
Description: Binary data

-- Stas KelvichPostgres Professional: http://www.postgrespro.comRussian Postgres Company




Re: [HACKERS] Speedup twophase transactions

2016-01-12 Thread Stas Kelvich
My +1 for moving function to xlogutils.c too.

Now call to this function goes through series of callbacks so it is hard to 
find it.
Personally I found it only after I have implemented same function by myself 
(based on code in pg_xlogdump).

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


> On 12 Jan 2016, at 18:56, Alvaro Herrera  wrote:
> 
> Michael Paquier wrote:
>> On Tue, Jan 12, 2016 at 5:21 PM, Simon Riggs  wrote:
>>> Should we just move the code somewhere just to imply it is generic? Seems
>>> pointless refactoring to me.
>> 
>> Er, why not xlogutils.c? Having the 2PC code depending directly on
>> something that is within logicalfuncs.c is weird.
> 
> Yes, I agree with Michael -- it's better to place code in its logical
> location than keep it somewhere else just because historically it was
> there.  That way, future coders can find the function more easily.
> 
> -- 
> Álvaro Herrerahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-26 Thread Stas Kelvich
Hi,

Thanks for reviews and commit!

  As Simon and Andres already mentioned in this thread replay of twophase 
transaction is significantly slower then the same operations in normal mode. 
Major reason is that each state file is fsynced during replay and while it is 
not a problem for recovery, it is a problem for replication. Under high 2pc 
update load lag between master and async replica is constantly increasing (see 
graph below).

  One way to improve things is to move fsyncs to restartpoints, but as we saw 
previously it is a half-measure and just frequent calls to fopen can cause 
bottleneck.

  Other option is to use the same scenario for replay that was used already for 
non-recovery mode: read state files to memory during replay of prepare, and if 
checkpoint/restartpoint occurs between prepare and commit move data to files. 
On commit we can read xlog or files. So here is the patch that implements this 
scenario for replay.

  Patch is quite straightforward. During replay of prepare records 
RecoverPreparedFromXLOG() is called to create memory state in GXACT, PROC, 
PGPROC; on commit XlogRedoFinishPrepared() is called to clean up that state. 
Also there are several functions (PrescanPreparedTransactions, 
StandbyTransactionIdIsPrepared) that were assuming that during replay all 
prepared xacts have files in pg_twophase, so I have extended them to check 
GXACT too.
  Side effect of that behaviour is that we can see prepared xacts in 
pg_prepared_xacts view on slave.

While this patch touches quite sensible part of postgres replay and there is 
some rarely used code paths, I wrote shell script to setup master/slave 
replication and test different failure scenarios that can happened with 
instances. Attaching this file to show test scenarios that I have tested and 
more importantly to show what I didn’t tested. Particularly I failed to 
reproduce situation where StandbyTransactionIdIsPrepared() is called, may be 
somebody can suggest way how to force it’s usage. Also I’m not too sure about 
necessity of calling cache invalidation callbacks during 
XlogRedoFinishPrepared(), I’ve marked this place in patch with 2REVIEWER 
comment.

Tests shows that this patch increases speed of 2pc replay to the level when 
replica can keep pace with master.

Graph: replica lag under a pgbench run for a 200 seconds with 2pc update 
transactions (80 connections, one update per 2pc tx, two servers with 12 cores 
each, 10GbE interconnect) on current master and with suggested patch. Replica 
lag measured with "select sent_location-replay_location as delay from 
pg_stat_replication;" each second.



twophase_replay.diff
Description: Binary data


check.sh
Description: Binary data


---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

> On 12 Jan 2016, at 22:57, Simon Riggs  wrote:
> 
> On 12 January 2016 at 18:14, Andres Freund  wrote:
> Hi,
> 
> Thank you for the additional review.
>  
> On 2016-01-11 19:39:14 +, Simon Riggs wrote:
> > Currently, the patch reuses all of the code related to reading/write state
> > files, so it is the minimal patch that can implement the important things
> > for performance. The current patch succeeds in its goal to improve
> > performance, so I personally see no further need for code churn.
> 
> Sorry, I don't buy that argument. This approach leaves us with a bunch
> of code related to statefiles that's barely ever going to be exercised,
> and leaves the performance bottleneck on WAL replay in place.
> 
> I raised the issue of WAL replay performance before you were involved, as has 
> been mentioned already. I don't see it as a blocker for this patch. I have 
> already requested it from Stas and he has already agreed to write that.
> 
> Anyway, we know the statefile code works, so I'd prefer to keep it, rather 
> than write a whole load of new code that would almost certainly fail. 
> Whatever the code looks like, the frequency of usage is the same. As I 
> already said, you can submit a patch for the new way if you wish; the reality 
> is that this code works and there's no additional performance gain from doing 
> it a different way.
>  
> > As you suggest, we could also completely redesign the state file mechanism
> > and/or put it in WAL at checkpoint. That's all very nice but is much more
> > code and doesn't anything more for performance, since the current mainline
> > path writes ZERO files at checkpoint.
> 
> Well, on the primary, yes.
> 
> Your changes proposed earlier wouldn't change performance on the standby.
>  
> > If you want that for some other reason or refactoring, I won't stop
> > you, but its a separate patch for a separate purpose.
> 
> Maintainability/complexity very much has to be considered during review
> and

Re: [HACKERS] Speedup twophase transactions

2016-01-26 Thread Stas Kelvich
Agree, I had the same idea in my mind when was writing that script.

I will migrate it to TAP suite and write a review for Michael Paquier's patch.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


> On 26 Jan 2016, at 20:20, Alvaro Herrera  wrote:
> 
> Stas Kelvich wrote:
> 
>> While this patch touches quite sensible part of postgres replay and there is 
>> some rarely used code paths, I wrote shell script to setup master/slave 
>> replication and test different failure scenarios that can happened with 
>> instances. Attaching this file to show test scenarios that I have tested and 
>> more importantly to show what I didn’t tested. Particularly I failed to 
>> reproduce situation where StandbyTransactionIdIsPrepared() is called, may be 
>> somebody can suggest way how to force it’s usage. Also I’m not too sure 
>> about necessity of calling cache invalidation callbacks during 
>> XlogRedoFinishPrepared(), I’ve marked this place in patch with 2REVIEWER 
>> comment.
> 
> I think this is the third thread in which I say this: We need to push
> Michael Paquier's recovery test framework, then convert your test script
> to that.  That way we can put your tests in core.
> 
> -- 
> Álvaro Herrerahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-01-27 Thread Stas Kelvich
Hi.

I tried that and confirm strange behaviour. It seems that problem with small 
cyrillic letter ‘х’. (simplest obscene language filter? =)

That can be reproduced with simpler test

Stas



test.c
Description: Binary data

 
> On 27 Jan 2016, at 13:59, Artur Zakirov  wrote:
> 
> On 27.01.2016 13:46, Shulgin, Oleksandr wrote:
>> 
>> Not sure why the file uses "SET KOI8-R" directive then?
>> 
> 
> This directive is used only by Hunspell program. PostgreSQL ignores this 
> directive and assumes that input affix and dictionary files in the UTF-8 
> encoding.
> 
>> 
>> 
>> What error message do you get with this test program?  (I don't get any,
>> but I'm not on Mac OS.)
>> --
>> Alex
>> 
>> 
> 
> With this program you will get wrong output. A error message is not called. 
> You can execute the following commands:
> 
> > cc test.c -o test
> > ./test
> 
> You will get the output:
> 
> SFX/Y/?/аться/шутся
> 
> Although the output should be:
> 
> SFX/Y/хаться/шутся/хаться
> 
> -- 
> Artur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2016-01-27 Thread Stas Kelvich
Hi

> On 22 Jan 2016, at 19:03, Tomas Vondra  wrote:
> OK, although I do recommend using more sensible variable names, i.e. why how 
> to use 'lexemes' instead of 'lexarr' for example? Similarly for the other 
> functions.


Changed. With old names I tried to follow conventions in surrounding code, but 
probably that is a good idea to switch to more meaningful names in new code.

>> 
>> 
>> delete(tsin tsvector, tsv_filter tsvector) — Delete lexemes and/or positions 
>> of tsv_filter from tsin. When lexeme in tsv_filter has no positions function 
>> will delete any occurrence of same lexeme in tsin. When tsv_filter lexeme 
>> have positions function will delete them from positions of matching lexeme 
>> in tsin. If after such removal resulting positions set is empty then 
>> function will delete that lexeme from resulting tsvector.
>> 
> 
> I can't really imagine situation in which I'd need this, but if you do have a 
> use case for it ... although in the initial paragraph you say "... but if 
> somebody wants to delete for example ..." which suggests you may not have 
> such use case.
> 
> Based on bad experience with extending API based on vague ideas, I recommend 
> only really adding functions with existing need. It's easy to add a function 
> later, much more difficult to remove it or change the signature.

I tried to create more or less self-contained api, e.g. have ability to negate 
effect of concatenation. But i’ve also asked people around what they think 
about extending API and everybody convinced that it is better to stick to 
smaller API. So let’s drop it. At least that functions exists in mail list in 
case if somebody will google for such kind of behaviour.

>> 
>> Also if we want some level of completeness of API and taking into account 
>> that concat() function shift positions on second argument I thought that it 
>> can be useful to also add function that can shift all positions of specific 
>> value. This helps to undo concatenation: delete one of concatenating 
>> tsvectors and then shift positions in resulting tsvector. So I also wrote 
>> one another small function:
>> 
>> shift(tsin tsvector,offset int16) — Shift all positions in tsin by given 
>> offset
> 
> That seems rather too low-level. Shouldn't it be really built into delete() 
> directly somehow?


I think it is ambiguous task on delete. But if we are dropping support of 
delete(tsvector, tsvector) I don’t see points in keeping that functions.

>>> 
>>> 7) Some of the functions use intexterm that does not match the function
>>>   name. I see two such cases - to_tsvector and setweight. Is there a
>>>   reason for that?
>>> 
>> 
>> Because sgml compiler wants unique indexterm. Both functions that
>> youmentioned use overloading of arguments and have non-unique name.
> 
> As Michael pointed out, that should probably be handled by using  
> and  tags.


Done.


> On 19 Jan 2016, at 00:21, Alvaro Herrera  wrote:
> 
> 
> It's a bit funny that you reintroduce the "unrecognized weight: %d"
> (instead of %c) in tsvector_setweight_by_filter.
> 


Ah, I was thinking about moving it to separate diff and messed. Fixed and 
attaching diff with same fix for old tsvector_setweight.




tsvector_ops-v2.1.diff
Description: Binary data


tsvector_ops-v2.2.diff
Description: Binary data



---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] In-core regression tests for replication, cascading, archiving, PITR, etc.

2016-02-04 Thread Stas Kelvich
Hi.

I’ve looked over proposed patch and migrated my shell tests scripts that i’ve 
used for testing twophase commits on master/slave to this test framework. 
Everything looks mature, and I didn’t encountered any problems with writing new 
tests using this infrastructure.

>From my point of view I don’t see any problems to commit this patches in their 
>current state.

Also some things that came into mind about test suite:

0) There are several routines that does actual checking, like 
is/command_ok/command_fails. I think it will be very handy to have wrappers 
psql_ok/psql_fails that calls psql through the command_ok/fails.

1) Better to raise more meaningful error when IPC::Run is absend.

2) --enable-tap-tests deserves mention in test/recovery/README and more obvious 
error message when one trying to run make check in test/recovery without 
--enable-tap-tests.

3) Is it hard to give ability to run TAP tests in extensions?

4) It will be handy if make check will write path to log files in case of 
failed test.

5) psql() accepts database name as a first argument, but everywhere in tests it 
is ‘postgres’. Isn’t it simpler to store dbname in connstr, and have separate 
function to change database?

6) Clean logs on prove restart? Clean up tmp installations?

7) Make check sets PGPORT PG_REGRESS for prove. Is it necessary?

> On 22 Jan 2016, at 09:17, Michael Paquier  wrote:
> 
> On Mon, Dec 21, 2015 at 4:45 PM, Michael Paquier
>  wrote:
>> As this thread is stalling a bit, please find attached a series of
>> patch gathering all the pending issues for this thread:
>> - 0001, fix config_default.pl for MSVC builds to take into account TAP tests
>> - 0002, append a node name in get_new_node (per Noah's request)
>> - 0003, the actual recovery test suite
>> Hopefully this facilitates future reviews.
> 
> Patch 2 has been pushed as c8642d9 (thanks Alvaro). The remaining two
> patches still apply and pass cleanly.
> -- 
> Michael
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch_extras extension

2016-02-17 Thread Stas Kelvich

> On 17 Feb 2016, at 08:14, Oleg Bartunov  wrote:
> 
> 
> 
> On Wed, Feb 17, 2016 at 6:57 AM, Tim Abbott  wrote:
> Just following up here since I haven't gotten a reply -- I'd love to work 
> with someone from the Postgres community on a plan to make the tsearch_extras 
> functionality available as part of mainline postgres.
> 
> 
> -Tim Abbott
> 
> On Wed, Feb 3, 2016 at 9:41 PM, Tim Abbott  wrote:
> Hello,
> 
> I'm a maintainer of the Zulip open source group chat application.  Zulip 
> depends on a small (~200 lines) postgres extension called tsearch_extras 
> (https://github.com/zbenjamin/tsearch_extras) that returns the actual 
> (offset, length) pairs of all the matches for a postgres full text search 
> query.  This extension allows Zulip to do its own highlighting of the full 
> text search matches, using a more complicated method than what Postgres 
> supports natively.  
> 
> I think tsearch_extras is probably of value to others using postgres 
> full-text search (and I'd certainly love to get out of the business of 
> maintaining an out-of-tree postgres extension), so I'm wondering if this 
> feature (or a variant of it) would be of interest to upstream?  
> 
> Thanks!
> 
> -Tim Abbott
> 
> (See 
> http://www.postgresql.org/message-id/flat/52c7186d.8010...@strangersgate.com#52c7186d.8010...@strangersgate.com
>  for the discussion on postgres mailing lists that caused us to write this 
> module in the first place.)
> 
> Tim,
> 
> take a look on this patch (https://commitfest.postgresql.org/7/385/) and 
> contact author.  It contains everything you need to your purposes.
> 
> btw, Stas, check on status "Returned with feedback" !
> 
> 
> Regards,
> Oleg 
> 

Hi,

Yes, this extension have some common functionality with patch. Particularly 
yours tsvector_lexemes and mine to_array do the same thing. I wasn’t aware of 
you patch when started to work on that, so I think we can ask commiter to 
mention you in commit message (if it that will be commited).

And how do you use ts_match_locs_array / ts_match_locs_array? To highlight 
search results? There is function called ts_headline that can mark matches with 
custom start/stop strings.

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2016-02-17 Thread Stas Kelvich
> 
> On 02 Feb 2016, at 20:10, Teodor Sigaev  wrote:
> 
> Some notices:
> 
> 1 tsin in documentation doesn't look like a good name. Changed to vector 
> similar to other places.
> 
> 2 I did some editorization about freeing memory/forgotten names etc

Thanks.

> 
> 3 It seems to me that tsvector_unnest() could be seriously optimized for
> large tsvectors: with current coding it detoasts/decompresses tsvector value 
> on each call. Much better to do it once in
> multi_call_memory_ctx context at first call init

Done, moved detoasting to first SRF call.

> 4 It seems debatable returning empty array for position/weight if they are 
> absent:
> =# select * from unnest('a:1 b'::tsvector);
> lexeme | positions | weights
> +---+-
> a  | {1}   | {D}
> b  | {}| {}
> I think, it's better to return NULL in this case
> 

Okay, done.

> 5 
> array_to_tsvector/tsvector_setweight_by_filter/tsvector_delete_arr/tsvector_filter
>  doesn't check or pay attention to NULL elements in input arrays
> 

Thanks! Fixed and added tests.

> -- 
> Teodor Sigaev   E-mail: teo...@sigaev.ru
>   WWW: http://www.sigaev.ru/
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



tsvector_ops-v4.diff
Description: Binary data


---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical decoding restart problems

2016-08-25 Thread Stas Kelvich
> On 20 Aug 2016, at 15:59, Craig Ringer  wrote:
> 
> I'll wait for a test case or some more detail.

Thanks for clarification about how restart_lsn is working.

Digging slightly deeper into this topic revealed that problem was in two phase 
decoding, not it logical decoding itself.
While I was writing DecodePrepare() I've and copied call to 
SnapBuildCommitTxn() function from DecodeCommit()
which was removing current transaction from running list and that’s obviously 
wrong thing to do for a prepared tx.

So I end up with following workaround:

--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -950,7 +950,7 @@ SnapBuildAbortTxn(SnapBuild *builder, XLogRecPtr lsn,
  */
 void
 SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
-  int nsubxacts, TransactionId *subxacts)
+  int nsubxacts, TransactionId *subxacts, bool 
isCommit)
 {
int nxact;
 
@@ -1026,7 +1026,8 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, 
TransactionId xid,
 * Make sure toplevel txn is not tracked in running txn's anymore, 
switch
 * state to consistent if possible.
 */
-   SnapBuildEndTxn(builder, lsn, xid);
+   if (isCommit)
+   SnapBuildEndTxn(builder, lsn, xid);


Calling SnapBuildCommitTxn with isCommit=true from commit and false from 
Prepare. However while I’m not
observing partially decoded transactions anymore, I’m not sure that this is 
right way to build proper snapshot and
something else isn’t broken.

Also while I was playing psycopg2 logical decoding client I do see empty 
transaction containing only
BEGIN/COMMIT (with test_decoding output plugin, and current postgres master).


-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical decoding restart problems

2016-08-31 Thread Stas Kelvich
> On 31 Aug 2016, at 03:28, Craig Ringer  wrote:
> 
> On 25 Aug. 2016 20:03, "Stas Kelvich"  wrote:
> >
> > Thanks for clarification about how restart_lsn is working.
> >
> > Digging slightly deeper into this topic revealed that problem was in two 
> > phase decoding, not it logical decoding itself.
> 
> Good to know. The behavior didn't make much sense for logical decoding but 
> bugs usually don't.
> 
> Do you plan to submit a patch for logical decoding of prepared transactions 
> to 10.0?
> 

I do, probably on CF2. There is issue with locks that Andres pointed me to; 
also twophase.c was written before logical
decoding happened and it deserves some refactoring to avoid duplicated 
functionality between 2pc decoding and restoring of
old prepared xact — I want to include that in patch too.

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-06 Thread Stas Kelvich
> On 06 Sep 2016, at 04:41, Michael Paquier  wrote:
> 
> On Sat, Sep 3, 2016 at 10:26 PM, Michael Paquier
>  wrote:
>> On Fri, Sep 2, 2016 at 5:06 AM, Simon Riggs  wrote:
>>> On 13 April 2016 at 15:31, Stas Kelvich  wrote:
>>> 
>>>> Fixed patch attached. There already was infrastructure to skip currently
>>>> held locks during replay of standby_redo() and I’ve extended that with 
>>>> check for
>>>> prepared xids.
>>> 
>>> Please confirm that everything still works on current HEAD for the new
>>> CF, so review can start.
>> 
>> The patch does not apply cleanly. Stas, could you rebase? I am
>> switching the patch to "waiting on author" for now.
> 
> So, I have just done the rebase myself and did an extra round of
> reviews of the patch. Here are couple of comments after this extra
> lookup.
> 
> LockGXactByXid() is aimed to be used only in recovery, so it seems
> adapted to be to add an assertion with RecoveryInProgress(). Using
> this routine in non-recovery code paths is risky because it assumes
> that a PGXACT could be missing, which is fine in recovery as prepared
> transactions could be moved to twophase files because of a checkpoint,
> but not in normal cases. We could also add an assertion of the kind
> gxact->locking_backend == InvalidBackendId before locking the PGXACT
> but as this routine is just normally used by the startup process (in
> non-standalone mode!) that's fine without.
> 
> The handling of invalidation messages and removal of relfilenodes done
> in FinishGXact can be grouped together, checking only once for
> !RecoveryInProgress().
> 
> + *
> + * The same procedure happens during replication and crash recovery.
>  *
> "during WAL replay" is more generic and applies here.
> 
> +
> +next_file:
> +   continue;
> +
> That can be removed and replaced by a "continue;".
> 
> +   /*
> +* At first check prepared tx that wasn't yet moved to disk.
> +*/
> +   LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
> +   for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
> +   {
> +   GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
> +   PGXACT *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
> +
> +   if (TransactionIdEquals(pgxact->xid, xid))
> +   {
> +   LWLockRelease(TwoPhaseStateLock);
> +   return true;
> +   }
> +   }
> +   LWLockRelease(TwoPhaseStateLock);
> This overlaps with TwoPhaseGetGXact but I'd rather keep both things
> separated: it does not seem worth complicating the existing interface,
> and putting that in cache during recovery has no meaning.

Oh, I was preparing new version of patch, after fresh look on it. Probably, I 
should
said that in this topic. I’ve found a bug in sub transaction handling and now 
working
on fix.

> 
> I have also reworked the test format, and fixed many typos and grammar
> problems in the patch as well as in the tests.

Thanks!

> 
> After review the result is attached. Perhaps a committer could get a look at 
> it?

I'll check it against my failure scenario with subtransactions and post results 
or updated patch here.

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-06 Thread Stas Kelvich

> On 06 Sep 2016, at 12:09, Simon Riggs  wrote:
> 
> On 6 September 2016 at 09:58, Stas Kelvich  wrote:
>> 
>> I'll check it against my failure scenario with subtransactions and post 
>> results or updated patch here.
> 
> Make sure tests are added for that. It would have been better to say
> you knew there were bugs in it earlier.

I’ve spotted that yesterday during rebase. Sorry. Next time in same situation 
i’ll write right away
to save everyone’s time.

> On 06 Sep 2016, at 12:03, Michael Paquier  wrote:
> 
> On Tue, Sep 6, 2016 at 5:58 PM, Stas Kelvich  wrote:
>> Oh, I was preparing new version of patch, after fresh look on it. Probably, 
>> I should
>> said that in this topic. I’ve found a bug in sub transaction handling and 
>> now working
>> on fix.
> 
> What's the problem actually?

Handling of xids_p array in PrescanPreparedTransactions() is wrong for prepared 
tx's in memory.

Also I want to double-check and add comments to RecoveryInProgress() checks in 
FinishGXact.

I’ll post reworked patch in several days.

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-07 Thread Stas Kelvich
> On 07 Sep 2016, at 03:09, Michael Paquier  wrote:
> 
>>> On 06 Sep 2016, at 12:03, Michael Paquier  wrote:
>>> 
>>> On Tue, Sep 6, 2016 at 5:58 PM, Stas Kelvich  
>>> wrote:
>>>> Oh, I was preparing new version of patch, after fresh look on it. 
>>>> Probably, I should
>>>> said that in this topic. I’ve found a bug in sub transaction handling and 
>>>> now working
>>>> on fix.
>>> 
>>> What's the problem actually?
>> 
>> Handling of xids_p array in PrescanPreparedTransactions() is wrong for 
>> prepared tx's in memory.
>> Also I want to double-check and add comments to RecoveryInProgress() checks 
>> in FinishGXact.
>> 
>> I’ll post reworked patch in several days.
> 
> Could you use as a base the version I just sent yesterday then? I
> noticed many mistakes in the comments, for example many s/it's/its/
> and did a couple of adjustments around the code, the goto next_file
> was particularly ugly. That will be that much work not do to again
> later.

Yes. Already merged branch with your version.

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Bug in two-phase transaction recovery

2016-09-07 Thread Stas Kelvich
Hello.

Some time ago two-phase state file format was changed to have variable size GID,
but several places that read that files were not updated to use new offsets. 
Problem
exists in master and 9.6 and can be reproduced on prepared transactions with
savepoints. For example:

create table t(id int);
begin;
insert into t values (42);
savepoint s1;
insert into t values (43);
prepare transaction 'x';
begin;
insert into t values (142);
savepoint s1;
insert into t values (143);
prepare transaction 'y’;

and restart the server. Startup process will hang. Fix attached.

Also while looking at StandbyRecoverPreparedTransactions() i’ve noticed that 
buffer
for 2pc file is allocated in TopMemoryContext but never freed. That probably 
exists
for a long time.



gidlen_fixes.diff
Description: Binary data





standby_recover_pfree.diff
Description: Binary data

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Suggestions for first contribution?

2016-09-07 Thread Stas Kelvich
> On 05 Sep 2016, at 20:25, Christian Convey  wrote:
> 
> Hi guys,
> 
> Can anyone suggest a project for my first PG contribution?
> 
> My first two ideas didn't pan out:  Yury doesn't seem to need help
> with CMake, and the TODO list's "-Wcast-align" project (now deleted)
> appeared impractical.

There is also a list of projects for google summer of code: 
https://wiki.postgresql.org/wiki/GSoC_2016

That topics expected to be doable by a newcomer during several months. It is 
also slightly
outdated, but you always can check current state of things by searching this 
mail list on interesting topic.

Also postgres have several internal API’s like fdw[1] or gist[2] that allows 
you to implements something
useful without digging too much into a postgres internals.

[1] https://www.postgresql.org/docs/current/static/fdwhandler.html
[2] https://www.postgresql.org/docs/current/static/gist-extensibility.html

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-16 Thread Stas Kelvich
> On 07 Sep 2016, at 11:07, Stas Kelvich  wrote:
> 
>> On 07 Sep 2016, at 03:09, Michael Paquier  wrote:
>> 
>>>> On 06 Sep 2016, at 12:03, Michael Paquier  
>>>> wrote:
>>>> 
>>>> On Tue, Sep 6, 2016 at 5:58 PM, Stas Kelvich  
>>>> wrote:
>>>>> Oh, I was preparing new version of patch, after fresh look on it. 
>>>>> Probably, I should
>>>>> said that in this topic. I’ve found a bug in sub transaction handling and 
>>>>> now working
>>>>> on fix.
>>>> 
>>>> What's the problem actually?
>>> 
>>> Handling of xids_p array in PrescanPreparedTransactions() is wrong for 
>>> prepared tx's in memory.
>>> Also I want to double-check and add comments to RecoveryInProgress() checks 
>>> in FinishGXact.
>>> 
>>> I’ll post reworked patch in several days.
>> 
>> Could you use as a base the version I just sent yesterday then? I
>> noticed many mistakes in the comments, for example many s/it's/its/
>> and did a couple of adjustments around the code, the goto next_file
>> was particularly ugly. That will be that much work not do to again
>> later.
> 
> Yes. Already merged branch with your version.

Here is updated version of patch.

Looking through old version i’ve noted few things that were bothering me:

* In case of crash replay PREPARE redo accesses SUBTRANS, but StartupSUBTRANS() 
isn’t called yet
in StartupXLOG().
* Several functions in twophase.c have to walk through both PGPROC and 
pg_twophase directory.

Now I slightly changed order of things in StartupXLOG() so twophase state 
loaded from from file and 
StartupSUBTRANS called before actual recovery starts. So in all other functions 
we can be sure that
file were already loaded in memory.

Also since 2pc transactions now are dumped to files only on checkpoint, we can 
get rid of
PrescanPreparedTransactions() — all necessary info can reside in checkpoint 
itself. I’ve changed
behaviour of oldestActiveXid write in checkpoint and that’s probably 
discussable topic, but ISTM
that simplifies a lot recovery logic in both twophase.c and xlog.c. More 
comments on that in patch itself.



twophase_replay.v7.patch
Description: Binary data

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-20 Thread Stas Kelvich
RecoveryRestartPoint(). We would need as well to tweak
> PrescanPreparedTransactions accordingly than everything we are trying
> to do here.
> That would be way more simple than what's proposed here, and we'd
> still keep all the performance benefits.

So file arrives to replica through WAL and instead of directly reading it you 
suggest
to copy it do DSM if it is small enough, and to filesystem if not. Probably 
that will
allow us to stay only around reading/writing files, but:

* That will be measurably slower than proposed approach because of unnecessary
copying between WAL and DSM. Not to mention prepare records of arbitrary length.
* Different behaviour on replica and master — less tested code for replica.
* Almost the same behaviour can be achieved by delaying fsync on 2pc files till
checkpoint.

But if reword you comment in a way that it is better to avoid replaying prepare 
record at all,
like it happens now in master, then I see the point. And that is possible even 
without DSM, it possible
to allocate static sized array storing some info about tx, whether it is in the 
WAL or in file, xid, gid.
Some sort of PGXACT doppelganger only for replay purposes instead of using 
normal one.

So taking into account my comments what do you think? Should we keep current 
approach
or try to avoid replaying prepare records at all?

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-21 Thread Stas Kelvich
> On 21 Sep 2016, at 10:32, Michael Paquier  wrote:
> 
> On Tue, Sep 20, 2016 at 11:13 PM, Stas Kelvich  
> wrote:
>> 
>> Putting that before actual WAL replay is just following historical order of 
>> events.
>> Prepared files are pieces of WAL that happened before checkpoint, so ISTM
>> there is no conceptual problem in restoring their state before replay.
> 
> For wal_level = minimal there is no need to call that to begin with..
> And I'd think that it is better to continue with things the same way.
> 

Hm, why?

>> 
>> With this patch we are reusing one infrastructure for normal work, recovery 
>> and replay.
> 
> The mix between normal work and recovery is the scary part. Normal
> work and recovery are entirely two different things.
> 

Okay, agreed. Commit procedure that checks whether recovery is active or not
is quite hacky solution.

>> So taking into account my comments what do you think? Should we keep current 
>> approach
>> or try to avoid replaying prepare records at all?
> 
> I'd really like to give a try to the idea you just mentioned, aka
> delay the fsync of the 2PC files from RecreateTwoPhaseFile to
> CreateRestartPoint, or get things into memory.. I could write one, or
> both of those patches. I don't mind.

I’m not giving up yet, i’ll write them) I still have in mind several other 
patches to 2pc handling in
postgres during this release cycle — logical decoding and partitioned hash 
instead of 
TwoPhaseState list.

My bet that relative speed of that patches will depend on used filesystem. Like 
it was with the
first patch in that mail thread it is totally possible sometimes to hit 
filesystem limits on file
creation speed. Otherwise both approaches should be more or less equal, i 
suppose.

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] assert violation in logical messages serialization

2016-09-27 Thread Stas Kelvich
Hello.

During processing of logical messages in ReorderBufferSerializeChange()
pointer to ondisk structure isn’t updated after possible reorder buffer realloc 
by
ReorderBufferSerializeReserve().

Actual reason spotted by Konstantin Knizhnik.



reorderbuffer.patch
Description: Binary data


-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with logical replication

2017-10-09 Thread Stas Kelvich

> On 2 Oct 2017, at 19:59, Petr Jelinek  wrote:
> 
>> 
>> Program terminated with signal SIGABRT, Aborted.
>> #0  0x7f3608f8ec37 in __GI_raise (sig=sig@entry=6) at
>> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>> 56../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
>> (gdb) bt
>> #0  0x7f3608f8ec37 in __GI_raise (sig=sig@entry=6) at
>> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>> #1  0x7f3608f92028 in __GI_abort () at abort.c:89
>> #2  0x009f5740 in ExceptionalCondition (conditionName=0xbf6b30
>> "!(((xid) != ((TransactionId) 0)))",
>> errorType=0xbf69af "FailedAssertion", fileName=0xbf69a8 "lmgr.c",
>> lineNumber=582) at assert.c:54
>> #3  0x0086ac1d in XactLockTableWait (xid=0, rel=0x0, ctid=0x0,
>> oper=XLTW_None) at lmgr.c:582
>> #4  0x0081c9f7 in SnapBuildWaitSnapshot (running=0x2749198,
>> cutoff=898498) at snapbuild.c:1400
>> #5  0x0081c7c7 in SnapBuildFindSnapshot (builder=0x2807910,
>> lsn=798477760, running=0x2749198) at snapbuild.c:1311
>> #6  0x0081c339 in SnapBuildProcessRunningXacts
>> (builder=0x2807910, lsn=798477760, running=0x2749198) at snapbuild.c:1106
> 
> 
> Hmm this would suggest that XLOG_RUNNING_XACTS record contains invalid
> transaction ids but we specifically skip those in
> GetRunningTransactionData(). Can you check pg_waldump output for that
> record (lsn is shown in the backtrace)?

  I investigated this case and it seems that XactLockTableWait() in 
SnapBuildWaitSnapshot()
not always work as expected. XactLockTableWait() waits on LockAcquire() for xid 
to be 
completed and if we finally got this lock but transaction is still in progress 
then such xid
assumed to be a subxid. However LockAcquire/LockRelease cycle can happen after 
transaction
set xid, but before XactLockTableInsert().

Namely following history happened for xid 5225 and lead to crash:

[backend] LOG:  AssignTransactionId: XactTopTransactionId = 5225
   [walsender] LOG:  LogCurrentRunningXacts: Wrote RUNNING_XACTS xctn=1, 
xid[0]=5225
   [walsender] LOG:  XactLockTableWait: LockAcquire 5225
   [walsender] LOG:  XactLockTableWait: LockRelease 5225
[backend] LOG:  AssignTransactionId: LockAcquire ExclusiveLock 5225
   [walsender] LOG:  TransactionIdIsInProgress: SVC->latestCompletedXid=5224 < 
xid=5225 => true
[backend] LOG:  CommitTransaction: ProcArrayEndTransaction xid=5225, ipw=0
[backend] LOG:  CommitTransaction: ResourceOwnerRelease locks xid=5225

I’ve quickly checked other usages of XactLockTableWait() and it seems that it 
is used mostly with
xids from heap so tx definetly set it lock somewhere in the past.

Not sure what it the best approach to handle that. May be write running xacts 
only if they already
set their lock?

Also attaching pgbench script that can reproduce failure. I run it over usual 
pgbench database
in 10 threads. It takes several minutes to crash.



reload.pgb
Description: Binary data

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: About CMake v2

2016-10-03 Thread Stas Kelvich
> On 17 Sep 2016, at 20:21, Yury Zhuravlev  wrote:
> 
> Michael Paquier wrote:
>> On Sat, Sep 17, 2016 at 1:40 AM, Yury Zhuravlev
>>  wrote:
>>> Michael Paquier wrote:
>>> I merged master to my branch and I spent time to porting all changes. I hope
>>> send patch in the weekend without terrible flaws.
>> 
>> By the way, I noticed that you did not register this patch in the current 
>> CF..
> 
> Now, I published the first version of the patch. This patch not ready for 
> commit yet and all current task you can read here:
> https://github.com/stalkerg/postgres_cmake/issues
> 
> I hope we realy close to end. 
> In this patch I forbade in-source build for avoid overwriting current 
> Makefiles. We will can remove all Makefiles only after shall see in CMake. 
> You don't need support two system. During commitfests CMake build system will 
> be supported by me.  
> I need help with buildfarm because my knowledge of Perl is very bad (thought 
> about rewrite buildfarm to Python). 
> 
> I hope for your support.


Tried to generate Xcode project out of cmake, build fails on genbki.pl: can't 
locate Catalog.pm (which itself lives in src/backend/catalog/Catalog.pm)

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Stats sender and 2pc minor problem

2016-10-13 Thread Stas Kelvich
Hello.

Statistics sender logic during usual commit and two-phase commit do not 
strictly matches each other and that leads to
delta_live_tuples added to n_live_tup in case of truncate in two phase commit.

That can be see in following example:

CREATE TABLE trunc_stats_test5(id serial);
INSERT INTO trunc_stats_test5 DEFAULT VALUES;
INSERT INTO trunc_stats_test5 DEFAULT VALUES;
INSERT INTO trunc_stats_test5 DEFAULT VALUES;
BEGIN;
TRUNCATE trunc_stats_test5;
PREPARE TRANSACTION 'twophase_stats';
COMMIT PREPARED 'twophase_stats';

After that pg_stat_user_tables will have n_live_tup = 3 instead of 0.

Fix along with test is attached.



2pc-stats.patch
Description: Binary data

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-01 Thread Stas Kelvich

> On 2 Mar 2017, at 01:20, Petr Jelinek  wrote:
> 
> The info gets removed on COMMIT PREPARED but at that point
> there is no real difference between replicating it as 2pc or 1pc since
> the 2pc behavior is for all intents and purposes lost at that point.
> 

If we are doing 2pc and COMMIT PREPARED happens then we should
replicate that without transaction body to the receiving servers since tx
is already prepared on them with some GID. So we need a way to construct
that GID.

It seems that last ~10 messages I’m failing to explain some points about this
topic. Or, maybe, I’m failing to understand some points. Can we maybe setup
skype call to discuss this and post summary here? Craig? Peter?

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-02 Thread Stas Kelvich

> On 2 Mar 2017, at 11:00, Craig Ringer  wrote:
> 
> We already have it, because we just decoded the PREPARE TRANSACTION.
> I'm preparing a patch revision to demonstrate this.

Yes, we already have it, but if server reboots between commit prepared (all
prepared state is gone) and decoding of this commit prepared then we loose
that mapping, isn’t it?

> BTW, I've been reviewing the patch in more detail. Other than a bunch
> of copy-and-paste that I'm cleaning up, the main issue I've found is
> that in DecodePrepare, you call:
> 
>SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
>   parsed->nsubxacts, parsed->subxacts);
> 
> but I am not convinced it is correct to call it at PREPARE TRANSACTION
> time, only at COMMIT PREPARED time. We want to see the 2pc prepared
> xact's state when decoding it, but there might be later commits that
> cannot yet see that state and shouldn't have it visible in their
> snapshots. 

Agree, that is problem. That allows to decode this PREPARE, but after that
it is better to mark this transaction as running in snapshot or perform prepare
decoding with some kind of copied-end-edited snapshot. I’ll have a look at this.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-16 Thread Stas Kelvich

> On 16 Mar 2017, at 14:44, Craig Ringer  wrote:
> 
> I'm going to try to pick this patch up and amend its interface per our
> discussion earlier, see if I can get it committable.

I’m working right now on issue with building snapshots for decoding prepared tx.
I hope I'll send updated patch later today.

> -- 
> Craig Ringer   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-16 Thread Stas Kelvich

>> On 2 Mar 2017, at 11:00, Craig Ringer  wrote:
>> 
>> BTW, I've been reviewing the patch in more detail. Other than a bunch
>> of copy-and-paste that I'm cleaning up, the main issue I've found is
>> that in DecodePrepare, you call:
>> 
>>   SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
>>  parsed->nsubxacts, parsed->subxacts);
>> 
>> but I am not convinced it is correct to call it at PREPARE TRANSACTION
>> time, only at COMMIT PREPARED time. We want to see the 2pc prepared
>> xact's state when decoding it, but there might be later commits that
>> cannot yet see that state and shouldn't have it visible in their
>> snapshots. 
> 
> Agree, that is problem. That allows to decode this PREPARE, but after that
> it is better to mark this transaction as running in snapshot or perform 
> prepare
> decoding with some kind of copied-end-edited snapshot. I’ll have a look at 
> this.
> 

While working on this i’ve spotted quite a nasty corner case with aborted 
prepared
transaction. I have some not that great ideas how to fix it, but maybe i 
blurred my
view and missed something. So want to ask here at first.

Suppose we created a table, then in 2pc tx we are altering it and after that 
aborting tx.
So pg_class will have something like this:

xmin | xmax | relname
100  | 200| mytable
200  | 0| mytable

After previous abort, tuple (100,200,mytable) becomes visible and if we will 
alter table
again then xmax of first tuple will be set current xid, resulting in following 
table:

xmin | xmax | relname
100  | 300| mytable
200  | 0| mytable
300  | 0| mytable

In that moment we’ve lost information that first tuple was deleted by our 
prepared tx.
And from POV of historic snapshot that will be constructed to decode prepare 
first
tuple is visible, but actually send tuple should be used. Moreover such 
snapshot could
see both tuples violating oid uniqueness, but heapscan stops after finding 
first one.

I see here two possible workarounds:

* Try at first to scan catalog filtering out tuples with xmax bigger than 
snapshot->xmax
as it was possibly deleted by our tx. Than if nothing found scan in a usual way.

* Do not decode such transaction at all. If by the time of decoding prepare 
record we
already know that it is aborted than such decoding doesn’t have a lot of sense.
IMO intended usage of logical 2pc decoding is to decide about commit/abort based
on answers from logical subscribers/replicas. So there will be barrier between
prepare and commit/abort and such situations shouldn’t happen.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-20 Thread Stas Kelvich

> On 20 Mar 2017, at 11:32, Craig Ringer  wrote:
> 
> On 19 March 2017 at 21:26, Petr Jelinek  wrote:
> 
>> I think only genam would need changes to do two-phase scan for this as
>> the catalog scans should ultimately go there. It's going to slow down
>> things but we could limit the impact by doing the two-phase scan only
>> when historical snapshot is in use and the tx being decoded changed
>> catalogs (we already have global knowledge of the first one, and it
>> would be trivial to add the second one as we have local knowledge of
>> that as well).
> 
> 
> TBH, I have no idea how to approach the genam changes for the proposed
> double-scan method. It sounds like Stas has some idea how to proceed
> though (right?)
> 

I thought about having special field (or reusing one of the existing fields)
in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin
as Petr suggested. Then this logic can reside in ReorderBufferCommit().
However this is not solving problem with catcache, so I'm looking into it right 
now.


> On 17 Mar 2017, at 05:38, Craig Ringer  wrote:
> 
> On 16 March 2017 at 19:52, Stas Kelvich  wrote:
> 
>> 
>> I’m working right now on issue with building snapshots for decoding prepared 
>> tx.
>> I hope I'll send updated patch later today.
> 
> 
> Great.
> 
> What approach are you taking?


Just as before I marking this transaction committed in snapbuilder, but after
decoding I delete this transaction from xip (which holds committed transactions
in case of historic snapshot).


> -- 
> Craig Ringer   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-20 Thread Stas Kelvich

> On 20 Mar 2017, at 15:17, Craig Ringer  wrote:
> 
>> I thought about having special field (or reusing one of the existing fields)
>> in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin
>> as Petr suggested. Then this logic can reside in ReorderBufferCommit().
>> However this is not solving problem with catcache, so I'm looking into it 
>> right now.
> 
> OK, so this is only an issue if we have xacts that change the schema
> of tables and also insert/update/delete to their heaps. Right?
> 
> So, given that this is CF3 for Pg10, should we take a step back and
> impose the limitation that we can decode 2PC with schema changes or
> data row changes, but not both?

Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan 
approach.
If I’ll fail to do that during this time then I’ll just update this patch to 
decode
only non-ddl 2pc transactions as you suggested.

>> Just as before I marking this transaction committed in snapbuilder, but after
>> decoding I delete this transaction from xip (which holds committed 
>> transactions
>> in case of historic snapshot).
> 
> That seems kind of hacky TBH. I didn't much like marking it as
> committed then un-committing it.
> 
> I think it's mostly an interface issue though. I'd rather say
> SnapBuildPushPrepareTransaction and SnapBuildPopPreparedTransaction or
> something, to make it clear what we're doing.

Yes, that will be less confusing. However there is no any kind of queue, so
SnapBuildStartPrepare / SnapBuildFinishPrepare should work too.

> -- 
> Craig Ringer   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-20 Thread Stas Kelvich

> On 20 Mar 2017, at 16:39, Craig Ringer  wrote:
> 
> On 20 March 2017 at 20:57, Stas Kelvich  wrote:
>> 
>>> On 20 Mar 2017, at 15:17, Craig Ringer  wrote:
>>> 
>>>> I thought about having special field (or reusing one of the existing 
>>>> fields)
>>>> in snapshot struct to force filtering xmax > snap->xmax or xmin = 
>>>> snap->xmin
>>>> as Petr suggested. Then this logic can reside in ReorderBufferCommit().
>>>> However this is not solving problem with catcache, so I'm looking into it 
>>>> right now.
>>> 
>>> OK, so this is only an issue if we have xacts that change the schema
>>> of tables and also insert/update/delete to their heaps. Right?
>>> 
>>> So, given that this is CF3 for Pg10, should we take a step back and
>>> impose the limitation that we can decode 2PC with schema changes or
>>> data row changes, but not both?
>> 
>> Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan 
>> approach.
>> If I’ll fail to do that during this time then I’ll just update this patch to 
>> decode
>> only non-ddl 2pc transactions as you suggested.
> 
> I wasn't suggesting not decoding them, but giving the plugin the
> option of whether to proceed with decoding or not.
> 
> As Simon said, have a pre-decode-prepared callback that lets the
> plugin get a lock on the 2pc xact if it wants, or say it doesn't want
> to decode it until it commits.
> 
> That'd be useful anyway, so we can filter and only do decoding at
> prepare transaction time of xacts the downstream wants to know about
> before they commit.

Ah, got that. Okay.

> -- 
> Craig Ringer   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-03-23 Thread Stas Kelvich

> On 23 Mar 2017, at 15:53, Craig Ringer  wrote:
> 
> On 23 March 2017 at 19:33, Alexey Kondratov  
> wrote:
> 
>> (1) Add errors handling to COPY as a minimum program
> 
> Huge +1 if you can do it in an efficient way.
> 
> I think the main barrier to doing so is that the naïve approach
> creates a subtransaction for every row, which is pretty dire in
> performance terms and burns transaction IDs very rapidly.
> 
> Most of our datatype I/O functions, etc, have no facility for being
> invoked in a mode where they fail nicely and clean up after
> themselves. We rely on unwinding the subtransaction's memory context
> for error handling, for releasing any LWLocks that were taken, etc.
> There's no try_timestamptz_in function or anything, just
> timestamptz_in, and it ERROR's if it doesn't like its input. You
> cannot safely PG_TRY / PG_CATCH such an exception and continue
> processing to, say, write another row.
> 
> Currently we also don't have a way to differentiate between
> 
> * "this row is structurally invalid" (wrong number of columns, etc)
> * "this row is structually valid but has fields we could not parse
> into their data types"
> * "this row looks structurally valid and has data types we could
> parse, but does not satisfy a constraint on the destination table"
> 
> Nor do we have a way to write to any kind of failure-log table in the
> database, since a simple approach relies on aborting subtransactions
> to clean up failed inserts so it can't write anything for failed rows.
> Not without starting a 2nd subxact to record the failure, anyway.

If we are optimising COPY for case with small amount of bad rows
than 2nd subtransaction seems as not a bad idea. We can try to
apply batch in subtx, if it fails on some row N then insert rows [1, N)
in next subtx, report an error, commit subtx. Row N+1 can be treated
as beginning of next batch.


But if there will be some problems with handling everything with
subtransaction and since parallelism is anyway mentioned, what about
starting bgworker that will perform data insertion and will be controlled
by backend?

For example backend can do following:

* Start bgworker (or even parallel worker) 
* Get chunk of rows out of the file and try to apply them in batch
as subtransaction in bgworker.
* If it fails than we can open transaction in backend itself and
raise notice / move failed rows to special errors table.

> So, having said why it's hard, I don't really have much for you in
> terms of suggestions for ways forward. User-defined data types,
> user-defined constraints and triggers, etc mean anything involving
> significant interface changes will be a hard sell, especially in
> something pretty performance-sensitive like COPY.
> 
> I guess it'd be worth setting out your goals first. Do you want to
> handle all the kinds of problems above? Malformed  rows, rows with
> malformed field values, and rows that fail to satisfy a constraint? or
> just some subset?
> 
> 
> 
> -- 
> Craig Ringer   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-27 Thread Stas Kelvich

> On 27 Mar 2017, at 12:26, Craig Ringer  wrote:
> 
> On 27 March 2017 at 09:31, Craig Ringer  wrote:
> 
>> We're in the last week of the CF. If you have a patch that's nearly
>> ready or getting there, now would be a good time to post it for help
>> and input from others.
>> 
>> I would really like to get this in, but we're running out of time.
>> 
>> Even if you just post your snapshot management work, with the cosmetic
>> changes discussed above, that would be a valuable start.
> 
> I'm going to pick up the last patch and:

I’m heavily underestimated amount of changes there, but almost finished
and will send updated patch in several hours.

> * Ensure we only add the GID to xact records for 2pc commits and aborts

And only during wal_level >= logical. Done.
Also patch adds origin info to prepares and aborts.

> * Add separate callbacks for prepare, abort prepared, and commit
> prepared (of xacts already processed during prepare), so we aren't
> overloading the "commit" callback and don't have to create fake empty
> transactions to pass to the commit callback;

Done.

> * Add another callback to determine whether an xact should be
> processed at PREPARE TRANSACTION or COMMIT PREPARED time.

Also done.

> * Rename the snapshot builder faux-commit stuff in the current patch
> so it's clearer what's going on.

Hm. Okay, i’ll leave that part to you.

> * Write tests covering DDL, abort-during-decode, etc

I’ve extended test, but it is good to have some more.

> Some special care is needed for the callback that decides whether to
> process a given xact as 2PC or not. It's called before PREPARE
> TRANSACTION to decide whether to decode any given xact at prepare time
> or wait until it commits. It's called again at COMMIT PREPARED time if
> we crashed after we processed PREPARE TRANSACTION and advanced our
> confirmed_flush_lsn such that we won't re-process the PREPARE
> TRANSACTION again. Our restart_lsn might've advanced past it so we
> never even decode it, so we can't rely on seeing it at all. It has
> access to the xid, gid and invalidations, all of which we have at both
> prepare and commit time, to make its decision from. It must have the
> same result at prepare and commit time for any given xact. We can
> probably use a cache in the reorder buffer to avoid the 2nd call on
> commit prepared if we haven't crashed/reconnected between the two.

Good point. Didn’t think about restart_lsn in case when we are skipping this
particular prepare (filter_prepared() -> true, in my terms). I think that should
work properly as it use the same code path as it was before, but I’ll look at 
it.

> This proposal does not provide a way to safely decode a 2pc xact that
> made catalog changes which may be aborted while being decoded. The
> plugin must lock such an xact so that it can't be aborted while being
> processed, or defer decoding until commit prepared. It can use the
> invalidations for the commit to decide.

I had played with that two-pass catalog scan and it seems to be
working but after some time I realised that it is not useful for the main
case when commit/abort is generated after receiver side will answer to
prepares. Also that two-pass scan is a massive change in relcache.c and
genam.c (FWIW there were no problems with cache, but some problems
with index scan and handling one-to-many queries to catalog, e.g. table
with it fields)

Finally i decided to throw it and switched to filter_prepare callback and
passed there txn structure to allow access to has_catalog_changes field.


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-27 Thread Stas Kelvich

> On 27 Mar 2017, at 16:29, Craig Ringer  wrote:
> 
> On 27 March 2017 at 17:53, Stas Kelvich  wrote:
> 
>> I’m heavily underestimated amount of changes there, but almost finished
>> and will send updated patch in several hours.
> 
> Oh, brilliant! Please post whatever you have before you knock off for
> the day anyway, even if it's just a WIP, so I can pick it up tomorrow
> my time and poke at its tests etc.
> 

Ok, here it is.

Major differences comparing to previous version:

* GID is stored to commit/abort records only when wal_level >= logical.

* More consistency about storing and parsing origin info. Now it
is stored in prepare and abort records when repsession is active.

* Some clenup, function renames to get rid of xact_even/gid fields
in ReorderBuffer which i used only to copy them ReorderBufferTXN.

* Changed output plugin interface to one that was suggested upthread.
Now prepare/CP/AP is separate callback, and if none of them is set
then 2pc tx will be decoded as 1pc to provide back-compatibility.

* New callback filter_prepare() that can be used to switch between
1pc/2pc style of decoding 2pc tx.

* test_decoding uses new API and filters out aborted and running prepared tx.
It is actually easy to move unlock of 2PCState there to prepare callback to 
allow
decode of running tx, but since that extension is example ISTM that is better 
not to
hold that lock there during whole prepare decoding. However I leaved
enough information there about this and about case when that locks are not need 
at all
(when we are coordinating this tx).
  Talking about locking of running prepared tx during decode, I think better 
solution
would be to use own custom lock here and register XACT_EVENT_PRE_ABORT
callback in extension to conflict with this lock. Decode should hold it in 
shared way,
while commit in excluseve. That will allow to lock stuff granularly ang block 
only
tx that is being decoded.
  However we don’t have XACT_EVENT_PRE_ABORT, but it is several LOCs to
add it. Should I?

* It is actually doesn’t pass one of mine regression tests. I’ve added expected 
output
as it should be. I’ll try to send follow up message with fix, but right now 
sending it
as is, as you asked.




logical_twophase.diff
Description: Binary data




Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-27 Thread Stas Kelvich

> On 28 Mar 2017, at 00:19, Stas Kelvich  wrote:
> 
> * It is actually doesn’t pass one of mine regression tests. I’ve added 
> expected output
> as it should be. I’ll try to send follow up message with fix, but right now 
> sending it
> as is, as you asked.
> 
> 

Fixed. I forgot to postpone ReorderBufferTxn cleanup in case of prepare.

So it pass provided regression tests right now.

I’ll give it more testing tomorrow and going to write TAP test to check 
behaviour
when we loose info whether prepare was sent to subscriber or not.




logical_twophase.diff
Description: Binary data



Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-27 Thread Stas Kelvich

> On 28 Mar 2017, at 00:25, Andres Freund  wrote:
> 
> Hi,
> 
> On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote:
>> Ok, here it is.
> 
> On a very quick skim, this doesn't seem to solve the issues around
> deadlocks of prepared transactions vs. catalog tables.  What if the
> prepared transaction contains something like LOCK pg_class; (there's a
> lot more realistic examples)? Then decoding won't be able to continue,
> until that transaction is committed / aborted?

But why is that deadlock? Seems as just lock.

In case of prepared lock of pg_class decoding will wait until it committed and
then continue to decode. As well as anything in postgres that accesses pg_class,
including inability to connect to database and bricking database if you 
accidentally
disconnected before committing that tx (as you showed me some while ago :-).

IMO it is issue of being able to prepare such lock, than of decoding.

Is there any other scenarios where catalog readers are blocked except explicit 
lock
on catalog table? Alters on catalogs seems to be prohibited.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-29 Thread Stas Kelvich

> On 28 Mar 2017, at 18:08, Andres Freund  wrote:
> 
> On 2017-03-28 15:55:15 +0100, Simon Riggs wrote:
>> 
>> 
>> That assertion is obviously false... the plugin can resolve this in
>> various ways, if we allow it.
> 
> Handling it by breaking replication isn't handling it (e.g. timeouts in
> decoding etc).  Handling it by rolling back *prepared* transactions
> (which are supposed to be guaranteed to succeed!), isn't either.
> 
> 
>> You can say that in your opinion you prefer to see this handled in
>> some higher level way, though it would be good to hear why and how.
> 
> It's pretty obvious why: A bit of DDL by the user shouldn't lead to the
> issues mentioned above.
> 
> 
>> Bottom line here is we shouldn't reject this patch on this point,
> 
> I think it definitely has to be rejected because of that.  And I didn't
> bring this up at the last minute, I repeatedly brought it up before.
> Both to Craig and Stas.

Okay. In order to find more realistic cases that blocks replication
i’ve created following setup:

* in backend: tests_decoding plugins hooks on xact events and utility
statement hooks and transform each commit into prepare, then sleeps
on latch. If transaction contains DDL that whole statement pushed in
wal as transactional message. If DDL can not be prepared or disallows
execution in transaction block than it goes as nontransactional logical
message without prepare/decode injection. If transaction didn’t issued any
DDL and didn’t write anything to wal, then it skips 2pc too.

* after prepare is decoded, output plugin in walsender unlocks backend
allowing to proceed with commit prepared. So in case when decoding
tries to access blocked catalog everything should stop.

* small python script that consumes decoded wal from walsender (thanks
Craig and Petr)

After small acrobatics with that hooks I’ve managed to run whole
regression suite in parallel mode through such setup of test_decoding
without any deadlocks. I’ve added two xact_events to postgres and
allowedn prepare of transactions that touched temp tables since
they are heavily used in tests and creates a lot of noise in diffs.

So it boils down to 3 failed regression tests out of 177, namely:

* transactions.sql — here commit of tx stucks with obtaining SafeSnapshot().
I didn’t look what is happening there specifically, but just checked that
walsender isn’t blocked. I’m going to look more closely at this.

* prepared_xacts.sql — here select prepared_xacts() sees our prepared
tx. It is possible to filter them out, but obviously it works as expected.

* guc.sql — here pendingActions arrives on 'DISCARD ALL’ preventing tx
from being prepared. I didn’t found the way to check presence of
pendingActions outside of async.c so decided to leave it as is.

It seems that at least in regression tests nothing can block twophase
logical decoding. Is that strong enough argument to hypothesis that current
approach doesn’t creates deadlock except locks on catalog which should be
disallowed anyway?

Patches attached. logical_twophase_v5 is slightly modified version of previous
patch merged with Craig’s changes. Second file is set of patches over previous
one, that implements logic i’ve just described. There is runtest.sh script that
setups postgres, runs python logical consumer in background and starts
regression test.


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



logical_twophase_v5.diff
Description: Binary data


logical_twophase_regresstest.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-04-04 Thread Stas Kelvich

> On 4 Apr 2017, at 04:23, Masahiko Sawada  wrote:
> 
> 
> I reviewed this patch but when I tried to build contrib/test_decoding
> I got the following error.
> 

Thanks!

Yes, seems that 18ce3a4a changed ProcessUtility_hook signature.
Updated.

> There are still some unnecessary code in v5 patch.

Actually second diff isn’t intended to be part of the patch, I've just shared
the way I ran regression test suite through the 2pc decoding changing
all commits to prepare/commits where commits happens only after decoding
of prepare is finished (more details in my previous message in this thread).

That is just argument against Andres concern that prepared transaction
is able to deadlock with decoding process — at least no such cases in
regression tests.

And that concern is main thing blocking this patch. Except explicit catalog
locks in prepared tx nobody yet found such cases and it is hard to address
or argue about.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



logical_twophase_v6.diff
Description: Binary data


logical_twophase_regresstest.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication worker and statistics

2017-04-05 Thread Stas Kelvich

> On 27 Mar 2017, at 18:59, Robert Haas  wrote:
> 
> On Mon, Mar 27, 2017 at 11:14 AM, Fujii Masao  wrote:
>> Logical replication worker should call pgstat_report_stat()?
>> Currently it doesn't seem to do that and no statistics about
>> table accesses by logical replication workers are collected.
>> For example, this can prevent autovacuum from working on
>> those tables properly.
> 
> Yeah, that doesn't sound good.


Seems that nobody is working on this, so i’m going to create the patch.


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication worker and statistics

2017-04-10 Thread Stas Kelvich

> On 10 Apr 2017, at 05:20, Noah Misch  wrote:
> 
> On Wed, Apr 05, 2017 at 05:02:18PM +0300, Stas Kelvich wrote:
>>> On 27 Mar 2017, at 18:59, Robert Haas  wrote:
>>> On Mon, Mar 27, 2017 at 11:14 AM, Fujii Masao  wrote:
>>>> Logical replication worker should call pgstat_report_stat()?
>>>> Currently it doesn't seem to do that and no statistics about
>>>> table accesses by logical replication workers are collected.
>>>> For example, this can prevent autovacuum from working on
>>>> those tables properly.
>>> 
>>> Yeah, that doesn't sound good.
>> 
>> Seems that nobody is working on this, so i’m going to create the patch.
> 
> [Action required within three days.  This is a generic notification.]
> 
> The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> v10 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within three calendar days of
> this message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping v10.  Consequently, I will appreciate your efforts
> toward speedy resolution.  Thanks.
> 
> [1] 
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

Here is small patch to call statistics in logical worker. Originally i thought 
that stat
collection during logical replication should manually account amounts of 
changed tuples,
but seems that it is already smoothly handled on relation level. So call to 
pgstat_report_stat() is enough.

Also i’ve added statistics checks to logrep tap tests, but that is probably 
quite fragile
without something like wait_for_stats() from regression test stats.sql.




call_pgstat_report_stat.diff
Description: Binary data


logical_worker_stats_test.diff
Description: Binary data



Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication worker and statistics

2017-04-10 Thread Stas Kelvich

> On 10 Apr 2017, at 19:50, Peter Eisentraut  
> wrote:
> 
> On 4/10/17 05:49, Stas Kelvich wrote:
>> Here is small patch to call statistics in logical worker. Originally i 
>> thought that stat
>> collection during logical replication should manually account amounts of 
>> changed tuples,
>> but seems that it is already smoothly handled on relation level. So call to 
>> pgstat_report_stat() is enough.
> 
> I wonder whether we need a similar call somewhere in tablesync.c.  It
> seems to work without it, though.

I thought it spins up the same worker from worker.c.


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-04-12 Thread Stas Kelvich

> On 12 Apr 2017, at 20:23, Robert Haas  wrote:
> 
> On Wed, Apr 12, 2017 at 1:18 PM, Nicolas Barbier
>  wrote:
>> 2017-04-11 Robert Haas :
>>> If the data quality is poor (say, 50% of lines have errors) it's
>>> almost impossible to avoid runaway XID consumption.
>> 
>> Yup, that seems difficult to work around with anything similar to the
>> proposed. So the docs might need to suggest not to insert a 300 GB
>> file with 50% erroneous lines :-).
> 
> Yep.  But it does seem reasonably likely that someone might shoot
> themselves in the foot anyway.  Maybe we just live with that.
> 

Moreover if that file consists of one-byte lines (plus one byte of newline char)
then during its import xid wraparound will happens 18 times =)

I think it’s reasonable at least to have something like max_errors parameter
to COPY, that will be set by default to 1000 for example. If user will hit that
limit then it is a good moment to put a warning about possible xid consumption
in case of bigger limit.

However I think it worth of quick research whether it is possible to create 
special
code path for COPY in which errors don’t cancel transaction. At least when
COPY called outside of transaction block.


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failed recovery with new faster 2PC code

2017-04-17 Thread Stas Kelvich

> On 17 Apr 2017, at 12:14, Simon Riggs  wrote:
> 
> On 15 April 2017 at 23:37, Jeff Janes  wrote:
>> After this commit, I get crash recovery failures when using prepared
>> transactions.
>> 
>> commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
>> Author: Simon Riggs 
>> Date:   Tue Apr 4 15:56:56 2017 -0400
>> 
>>Speedup 2PC recovery by skipping two phase state files in normal path
> 
> Thanks Jeff for your tests.
> 
> So that's now two crash bugs in as many days and lack of clarity about
> how to fix it.
> 
> Stas, I thought this patch was very important to you, yet two releases
> in a row we are too-late-and-buggy.

I’m looking at pgstat issue in nearby thread right now and will switch to this
shortly.

If that’s possible, I’m asking to delay revert for several days.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication - TRAP: FailedAssertion in pgstat.c

2017-04-17 Thread Stas Kelvich

> On 17 Apr 2017, at 10:30, Erik Rijkers  wrote:
> 
> On 2017-04-16 20:41, Andres Freund wrote:
>> On 2017-04-16 10:46:21 +0200, Erik Rijkers wrote:
>>> On 2017-04-15 04:47, Erik Rijkers wrote:
>>> >
>>> > 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
>>> > 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
>>> > 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch  +
>>> > 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch  +
>>> > 0005-Skip-unnecessary-snapshot-builds.patch
>>> I am now using these newer patches:
>>> https://www.postgresql.org/message-id/30242bc6-eca4-b7bb-670e-8d0458753a8c%402ndquadrant.com
>>> > It builds fine, but when I run the old pbench-over-logical-replication
>>> > test I get:
>>> >
>>> > TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File:
>>> > "pgstat.c", Line: 828)
>>> To get that error:
>> I presume this is the fault of
>> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=139eb9673cb84c76f493af7e68301ae204199746
>> if you git revert that individual commit, do things work again?
> 
> Yes, compiled from 67c2def11d4 with the above 4 patches, it runs flawlessly 
> again. (flawlessly= a few hours without any error)
> 

I’ve reproduced failure, this happens under tablesync worker and putting
pgstat_report_stat() under the previous condition block should help.

However for me it took about an hour of running this script to catch original 
assert.

Can you check with that patch applied?




logical_worker.patch
Description: Binary data

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Logical replication ApplyContext bloat

2017-04-18 Thread Stas Kelvich
Hi,

currently logical replication worker uses ApplyContext to decode received data
and that context is never freed during transaction processing. Hence if 
publication
side is performing something like 10M row inserts in single transaction, then
subscription worker will occupy more than 10G of ram (and can be killed by OOM).

Attached patch resets ApplyContext after each insert/update/delete/commit.
I’ve tried to reset context only on each 100/1000/1 value of CommandCounter,
but didn’t spotted any measurable difference in speed.

Problem spotted by Mikhail Shurutov.



applycontext_bloat.patch
Description: Binary data


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication ApplyContext bloat

2017-04-19 Thread Stas Kelvich

> On 19 Apr 2017, at 12:37, Petr Jelinek  wrote:
> 
> On 18/04/17 13:45, Stas Kelvich wrote:
>> Hi,
>> 
>> currently logical replication worker uses ApplyContext to decode received 
>> data
>> and that context is never freed during transaction processing. Hence if 
>> publication
>> side is performing something like 10M row inserts in single transaction, then
>> subscription worker will occupy more than 10G of ram (and can be killed by 
>> OOM).
>> 
>> Attached patch resets ApplyContext after each insert/update/delete/commit.
>> I’ve tried to reset context only on each 100/1000/1 value of 
>> CommandCounter,
>> but didn’t spotted any measurable difference in speed.
>> 
>> Problem spotted by Mikhail Shurutov.
>> 
> 
> Hmm we also do replication protocol handling inside the ApplyContext
> (LogicalRepApplyLoop), are you sure this change is safe in that regard?

Memory is bloated by logicalrep_read_* when palloc happens inside.
I’ve inserted context reset in ensure_transaction() which is only called in 
beginning
of hande_insert/update/delete/commit when data from previous call of that
function isn’t needed. So probably it is safe to clear memory there. At least
i don’t see any state that can be accessed later in this functions. Do you
have any specific concerns?


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication ApplyContext bloat

2017-04-19 Thread Stas Kelvich

> On 19 Apr 2017, at 13:34, Simon Riggs  wrote:
> 
> On 19 April 2017 at 11:24, Petr Jelinek  wrote:
> 
>> I'd still like you to add comment to the ApplyContext saying that it's
>> cleaned after handling each message so nothing that needs to survive for
>> example whole transaction can be allocated in it as that's not very well
>> visible IMHO (and I know I will forget about it when writing next patch
>> in that area).
> 
> It would be better to invent the contexts we want now, so we get the
> architecture right for future work. Otherwise we have problems with
> "can't backpatch this fix because that infrastructure doesn't exist in
> PG10" in a couple of years.
> 
> So I suggest we have
> 
> ApplyMessageContext - cleaned after every message
> ApplyTransactionContext - cleaned at EOXact
> ApplyContext - persistent

Right now ApplyContext cleaned after each transaction and by this patch I 
basically 
suggest to clean it after each command counter increment. 

So in your definitions that is ApplyMessageContext, most short-lived one. We can
rename now ApplyContext -> ApplyMessageContext to make that clear and avoid
possible name conflict when somebody will decide to keep something during whole
transaction or worker lifespan.

P.S. It also seems to me now that resetting context in ensure_transaction() 
isn’t that
good idea, probably better just explicitly call reset at the end of each 
function involved.

> 
> -- 
> Simon Riggs    http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication ApplyContext bloat

2017-04-19 Thread Stas Kelvich

> On 19 Apr 2017, at 14:30, Petr Jelinek  wrote:
> 
> On 19/04/17 12:46, Stas Kelvich wrote:
>> 
>> Right now ApplyContext cleaned after each transaction and by this patch I 
>> basically 
>> suggest to clean it after each command counter increment. 
>> 
>> So in your definitions that is ApplyMessageContext, most short-lived one. We 
>> can
>> rename now ApplyContext -> ApplyMessageContext to make that clear and avoid
>> possible name conflict when somebody will decide to keep something during 
>> whole
>> transaction or worker lifespan.
> 
> Yeah we can rename, and then rename the ApplyCacheContext to
> ApplyContext. That would leave the ApplyTransactionContext from Simon's
> proposal which is not really need it anywhere right now and I am not
> sure it will be needed given there is still TopTransactionContext
> present. If we really want ApplyTransactionContext we can probably just
> assign it to TopTransactionContext in ensure_transaction.
> 
>> 
>> P.S. It also seems to me now that resetting context in ensure_transaction() 
>> isn’t that
>> good idea, probably better just explicitly call reset at the end of each 
>> function involved.
>> 
> 
> Well it would definitely improve clarity.
> 

Okay, done.

With patch MemoryContextStats() shows following hierarchy during slot 
operations in 
apply worker:

TopMemoryContext: 83824 total in 5 blocks; 9224 free (8 chunks); 74600 used
  ApplyContext: 8192 total in 1 blocks; 6520 free (4 chunks); 1672 used
ApplyMessageContext: 8192 total in 1 blocks; 6632 free (11 chunks); 1560 
used
  ExecutorState: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used




apply_contexts.patch
Description: Binary data

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication ApplyContext bloat

2017-04-20 Thread Stas Kelvich

> On 19 Apr 2017, at 16:07, Alvaro Herrera  wrote:
> 
> Stas Kelvich wrote:
> 
>> With patch MemoryContextStats() shows following hierarchy during slot 
>> operations in 
>> apply worker:
>> 
>> TopMemoryContext: 83824 total in 5 blocks; 9224 free (8 chunks); 74600 used
>>  ApplyContext: 8192 total in 1 blocks; 6520 free (4 chunks); 1672 used
>>ApplyMessageContext: 8192 total in 1 blocks; 6632 free (11 chunks); 1560 
>> used
>>  ExecutorState: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used
>>ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
> 
> Please update src/backend/utils/mmgr/README to list the new contexts.


Thanks for noting.

Added short description of ApplyContext and ApplyMessageContext to README.



apply_contexts_2.patch
Description: Binary data




Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dynamic instrumentation of lwlock wait times (lwlock flamegraphs)

2017-06-23 Thread Stas Kelvich

> On 23 Jun 2017, at 00:08, Andres Freund  wrote:
> 
> Hi,
> 
> At pgcon some people were talking about the difficulty of instrumenting
> the time actually spent waiting for lwlocks and related measurements.
> I'd mentioned that linux these days provides infrastructure to measure
> such things in unmodified binaries.
> 
> Attached is a prototype of a script that measures the time spent inside
> PGSemaphoreLock(), aggregates that inside the kernel, grouped by pid and
> stacktrace.  That allows one to generate nice flame graphs showing which
> part of the code waits how long for lwlocks.
> 
> The attached script clearly needs improvements, but I thought I'd show
> some of the results it can get.  To run it you need the the python
> library of the 'bcc' project [1], and a sufficiently new kernel.  Some
> distributions, e.g. newer debian versions, package this as python-bpfcc
> and similar.
> 
> The output of the script can be turned into a flamegraph with
> https://github.com/brendangregg/FlameGraph 's flamegraph.pl.
> 
> Here's a few examples of a pgbench run. The number is the number of
> clients, sync/async indicates synchronous_commit on/off.  All the
> numbers here were generated with the libpq & pgbench batch patches
> applied and in use, but that's just because that was the state of my
> working tree.
> 
> http://anarazel.de/t/2017-06-22/pgsemwait_8_sync.svg
> http://anarazel.de/t/2017-06-22/pgsemwait_8_async.svg
> http://anarazel.de/t/2017-06-22/pgsemwait_64_sync.svg
> http://anarazel.de/t/2017-06-22/pgsemwait_64_async.svg
> 
> A bonus, not that representative one is the wait for a pgbench readonly
> run after the above, with autovacuum previously disabled:
> http://anarazel.de/t/2017-06-22/pgsemwait_64_select.svg
> interesting to see how the backends themselves never end up having to
> flush WAL themselves, or at least not in a manner triggering contention.
> 
> I plan to write a few more of these, because they're hugely useful to
> understand actual locking behaviour. Among them:
> - time beteen Acquire/Release of lwlocks, grouped by lwlock
> - time beteen Acquire/Release of lwlocks, grouped by stack
> - callstack of acquirer and waker of lwlocks, grouped by caller stack, waiter 
> stack
> - ...
> 
> I think it might be interesting to collect a few of these somewhere
> centrally once halfway mature.  Maybe in src/tools or such.

Wow, that’s extremely helpful, thanks a lot.


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2017-07-27 Thread Stas Kelvich

> On 27 Jul 2017, at 04:28, Robert Haas  wrote:
> 
>  However, you would not
> be guaranteed that all of those commits or rollbacks happen at
> anything like the same time.  So, you would have a sort of eventual
> consistency.

As far as I understand any solution that provides proper isolation for 
distributed
transactions in postgres will require distributed 2PC somewhere under the hood.
That is just consequence of parallelism that database allows — transactions can
abort due concurrent operations. So dichotomy is simple: either we need 2PC or
restrict write transactions to be physically serial.

In particular both Postgres-XL/XC and postgrespro multimaster are using 2PC to
commit distributed transaction.

Some years ago we created patches to implement transaction manager API and
that is just a way to inject consistent snapshots on different nodes, but atomic
commit itself is out of scope of TM API (hmm, may be it is better to call it 
snapshot
manager api?). That allows us to use it in quite different environments like 
fdw and
logical replication and both are using 2PC.

I want to submit TM API again during this release cycle along with 
implementation
for fdw. And I planned to base it on top of this patch. So I already rebased 
Masahiko’s
patch to current -master and started writing long list of nitpicks, but not 
finished yet.

Also I see the quite a big value in this patch even without 
tm/snapshots/whatever.
Right now fdw doesn’t guarantee neither isolation nor atomicity. And if one 
isn’t
doing cross-node analytical transactions it will be safe to live without 
isolation.
But living without atomicity means that some parts of data can be lost without 
simple
way to detect and fix that.


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2017-08-01 Thread Stas Kelvich

> On 31 Jul 2017, at 20:03, Robert Haas  wrote:
> 
> Regardless of whether we share XIDs or DXIDs, we need a more complex
> concept of transaction state than we have now.

Seems that discussion shifted from 2PC itself to the general issues with 
distributed
transactions. So it is probably appropriate to share here resume of things that 
we
done in area of distributed visibility. During last two years we tried three 
quite different
approaches and finally settled with Clock-SI. 

At first, to test different approaches we did small patch that wrap calls to 
visibility-related
functions (SetTransactionStatus, GetSnapshot, etc. Described in detail at 
wiki[1] ) in order to
allow overload them from extension. Such approach allows to implement almost 
anything
related to distributed visibility since you have full control about how local 
visibility is done.
That API isn’t hard prerequisite, and if one wants to create some concrete 
implementation
it can be done just in place. However, I think it is good to have such API in 
some form.

So three approaches that we tried:

1) Postgres-XL-like:

That is most straightforward way. Basically we need separate network service 
(GTM/DTM) that is
responsible for xid generation, and managing running-list of transactions. So 
acquiring
xid and snapshot is done by network calls. Because of shared xid space it is 
possible
to compare them in ordinary way and get right order. Gap between 
non-simultaneous
commits by 2pc is covered by the fact that we getting our snapshots from GTM, 
and
it will remove xid from running list only when transaction committed on both 
nodes.

Such approach is okay for OLAP-style transactions where tps isn’t high. But 
OLTP with
high transaction rate GTM will immediately became a bottleneck since even write 
transactions
need to get snapshot from GTM. Even if they access only one node.


2) Incremental SI [2]

Approach with central coordinator, that can allow local reads without network
communications by slightly altering visibility rules.

Despite the fact that it is kind of patented, we also failed to achieve proper 
visibility
by implementing algorithms from that paper. It always showed some 
inconsistencies.
May be because of bugs in our implementation, may be because of some
typos/mistakes in algorithm description itself. Reasoning in paper wasn’t very
clear for us, as well as patent issues, so we just leaved that.


3) Clock-SI [3]

It is MS research paper, that describes algorithm similar to ones used in 
Spanner and
CockroachDB, without central GTM and with reads that do not require network 
roundtrip.

There are two ideas behind it:

* Assuming snapshot isolation and visibility on node are based on CSN, use 
local time as CSN,
then when you are doing 2PC, collect prepare time from all participating nodes 
and
commit transaction everywhere with maximum of that times. If node during read 
faces tuples
committed by tx with CSN greater then their snapshot CSN (that can happen due to
time desynchronisation on node) then it just waits until that time come. So 
time desynchronisation
can affect performance, but can’t affect correctness.

* During distributed commit transaction neither running (if it commits then 
tuple
should be already visible) nor committed/aborted (it still can be aborted, so 
it is illegal to read).
So here IN-DOUBT transaction state appears, when reader should wait for writers.

We managed to implement that using mentioned XTM api. XID<->CSN mapping is
accounted by extension itself. Speed/scalability are also good.

I want to resubmit implementation of that algorithm for FDW later in August, 
along with some
isolation tests based on set of queries in [4].


[1] https://wiki.postgresql.org/wiki/DTM#eXtensible_Transaction_Manager_API
[2] http://pi3.informatik.uni-mannheim.de/~norman/dsi_jour_2014.pdf
[3] 
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/samehe-clocksi.srds2013.pdf
[4] https://github.com/ept/hermitage


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication ApplyContext bloat

2017-05-03 Thread Stas Kelvich

> On 20 Apr 2017, at 17:01, Dilip Kumar  wrote:
> 
> On Thu, Apr 20, 2017 at 7:04 PM, Stas Kelvich  
> wrote:
>> Thanks for noting.
>> 
>> Added short description of ApplyContext and ApplyMessageContext to README.
> 
> Typo
> 
> /analysys/analysis
> 

Fixed, thanks.

Added this to open items list.



apply_contexts_3.patch
Description: Binary data


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication ApplyContext bloat

2017-05-09 Thread Stas Kelvich

> On 9 May 2017, at 21:53, Peter Eisentraut  
> wrote:
> 
> On 5/8/17 11:26, Peter Eisentraut wrote:
>> I think it would make more sense to reset ApplyMessageContext in
>> apply_dispatch(), so that it is done consistently after every message
>> (as the name implies), not only some messages.
> 
> Committed that.
> 
>> Also, perhaps ApplyMessageContext should be a child of
>> TopTransactionContext.  (You have it as a child of ApplyContext, which
>> is under TopMemoryContext.)
> 
> Left that as is.

Thanks!

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] snapbuild woes

2017-05-11 Thread Stas Kelvich

> On 10 May 2017, at 11:43, Petr Jelinek  wrote:
> 
> On 09/05/17 22:11, Erik Rijkers wrote:
>> On 2017-05-09 21:00, Petr Jelinek wrote:
>>> On 09/05/17 19:54, Erik Rijkers wrote:
>>>> On 2017-05-09 11:50, Petr Jelinek wrote:
>>>> 
>>> 
>>> Ah okay, so this is same issue that's reported by both Masahiko Sawada
>>> [1] and Jeff Janes [2].
>>> 
>>> [1]
>>> https://www.postgresql.org/message-id/CAD21AoBYpyqTSw%2B%3DES%2BxXtRGMPKh%3DpKiqjNxZKnNUae0pSt9bg%40mail.gmail.com
>>> 
>>> [2]
>>> https://www.postgresql.org/message-id/flat/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com
>>> 
>> 
>> I don't understand why you come to that conclusion: both Masahiko Sawada
>> and Jeff Janes have a DROP SUBSCRIPTION in the mix; my cases haven't. 
>> Isn't that a real difference?
>> 
> 

Just noted another one issue/feature with snapshot builder — when logical 
decoding is in progress
and vacuum full command is being issued quite a big amount of files appears in 
pg_logical/mappings
and reside there till the checkpoint. Also having consequent vacuum full’s 
before checkpoint yields even
more files.

Having two pgbench-filled instances with logical replication between them:

for i in {1..100}; do psql -c 'vacuum full' && ls -la pg_logical/mappings | wc 
-l; done;
VACUUM
  73
VACUUM
 454
VACUUM
1146
VACUUM
2149
VACUUM
3463
VACUUM
5088
<...>
VACUUM
   44708
<…> // checkpoint happens somewhere here
VACUUM
   20921
WARNING:  could not truncate file "base/16384/61773": Too many open files in 
system
ERROR:  could not fsync file 
"pg_logical/mappings/map-4000-4df-0_A4EA29F8-5aa5-5ae6": Too many open files in 
system


I’m not sure whether this is boils down to some of the previous issues 
mentioned here or not, so posting
here as observation.


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-09-27 Thread Stas Kelvich

> On 7 Sep 2017, at 18:58, Nikhil Sontakke  wrote:
> 
> Hi,
> 
> FYI all, wanted to mention that I am working on an updated version of
> the latest patch that I plan to submit to a later CF.
> 

Cool!

So what kind of architecture do you have in mind? Same way as is it was 
implemented before?
As far as I remember there were two main issues:

* Decodong of aborted prepared transaction.

If such transaction modified catalog then we can’t read reliable info with our 
historic snapshot,
since clog already have aborted bit for our tx it will brake visibility logic. 
There are some way to
deal with that — by doing catalog seq scan two times and counting number of 
tuples (details
upthread) or by hijacking clog values in historic visibility function. But ISTM 
it is better not solve this
issue at all =) In most cases intended usage of decoding of 2PC transaction is 
to do some form
of distributed commit, so naturally decoding will happens only with in-progress 
transactions and
we commit/abort will happen only after it is decoded, sent and response is 
received. So we can
just have atomic flag that prevents commit/abort of tx currently being decoded. 
And we can filter
interesting prepared transactions based on GID, to prevent holding this lock 
for ordinary 2pc.

* Possible deadlocks that Andres was talking about.

I spend some time trying to find that, but didn’t find any. If locking pg_class 
in prepared tx is the only
example then (imho) it is better to just forbid to prepare such transactions. 
Otherwise if some realistic
examples that can block decoding are actually exist, then we probably need to 
reconsider the way
tx being decoded. Anyway this part probably need Andres blessing.



Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Cube extension kNN support

2013-09-22 Thread Stas Kelvich
Hello, hackers.

Here is the patch that introduces kNN search for cubes with euclidean, taxicab 
and chebyshev distances.

Following distance operators introduced:

<#> taxicab distance
<->  euclidean distance
<=> chebyshev distance

For example:
SELECT * FROM objects ORDER BY objects.coord <-> '(137,42,314)'::cube LIMIT 10;

Also there is operator "->" for selecting ordered rows directly from index.
This request selects rows ordered ascending by 3rd coordinate:

SELECT * FROM objects ORDER BY objects.coord->3 LIMIT 10; 

For descendent ordering suggested syntax with minus before coordinate. 
This request selects rows ordered descending by 4th coordinate:

SELECT * FROM objects ORDER BY objects.coord->-4 LIMIT 10; 

Stas Kelvich.




distances.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cube extension point support // GSoC'13

2013-09-24 Thread Stas Kelvich
Hello

There is new version of patch. I have separated ordering operators to different 
patch (https://commitfest.postgresql.org/action/patch_view?id=1243), fixed 
formatting issues and implemented backward compatibility with old-style points 
in cube_is_point() and cube_out().

Also comparing output files I've discovered that this four files is combination 
of two types of different behavior:

1) SELECT '-1e-700'::cube AS cube;
can be (0) or (-0)

2) Amount of zeros in exponent of floating point, i.e. SELECT '1e27'::cube AS 
cube;
 can be (1e+027) or (1e+27)

On my system (OSX) it is second option in both situations. I've also tested it 
on FreeBSD 9.0 and Ubuntu 12.04 with the same results. So is there some ideas 
how can I reproduce such results?

Stas.



points.diff
Description: Binary data


On Sep 16, 2013, at 10:48 AM, Heikki Linnakangas wrote:

> On 12.07.2013 14:57, Stas Kelvich wrote:
>> Hello.
>> 
>> here is a patch adding to cube extension support for compressed 
>> representation of point cubes. If cube is a point, i.e. has coincident lower 
>> left and upper right corners, than only one corner is stored. First bit of 
>> the cube header indicates whether the cube is point or not. Few moments:
>> 
>> * Patch preserves binary compatibility with old indices
>> * All functions that create cubes from user input, check whether it is a 
>> point or not
>> * All internal functions that can return cubes takes care of all cases where 
>> a cube might become a point
> 
> Great!
> 
> cube_is_point() needs to still handle old-style points. An NDBOX without the 
> point-flag set, where the ll and ur coordinates for each dimension are the 
> same, still needs to be considered a point. Even if you are careful to never 
> construct such structs in the code, they can be present on-disk if you have 
> upgraded from an earlier version with pg_upgrade. Same in cube_out().
> 
>> * Added tests for checking correct point behavior
> 
> You'll need to adjust all the expected output files, not only cube_1.out.
> 
>> Also this patch includes adapted Alexander Korotkov's patch with kNN-based 
>> ordering operator, which he wrote for postgresql-9.0beta1 with knngist 
>> patch. More info there 
>> http://www.postgresql.org/message-id/aanlktimhfaq6hcibrnk0tlcqmiyhywhwaq2zd87wb...@mail.gmail.com
> 
> To make review easier, it would be better to keep that as a separate patch, 
> actually. Could you split it up again, please?
> 
> - Heikki


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Cube extension types support

2013-09-24 Thread Stas Kelvich
Hello, hackers.

In this patch I've implemented support for different storage types for cubes. 
Now it supports float8, float4, int4, int2, int1. Type stored in the header of 
each cube, one for all coordinates. So for cubes with int1 coordinates it can 
save up to 8x disk space. Typed cubes can be created in two ways:

1) Via cube_suffix() functions, where suffix can be "f4", "i4", "i2", "i1", and 
arguments are two or one numbers or arrays, i.e.

# select cube_i1(1,2) as c;
 c  

 (1),(2):i1

# select cube_f4(array[1,2,3], array[5,6,7]) as c;
   c

 (1, 2, 3),(5, 6, 7):f4

2) Via modificator in the end of string that will be casted to cube, i.e.

# select '(1,2,3):i2'::cube as c;
  c   
--
 (1, 2, 3):i2

When no modificator given float8 will be used by default. Old-style cubes 
without type in header also treated as float8 for backward compatibility.

This patch changes a lot of things in code, so it interfere with others patches 
to the cube extension. I will update this patch, if other patches will be 
accepted to commit.



types_cumulative.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Cube extension split algorithm fix

2013-09-25 Thread Stas Kelvich
Hello, hackers.

I've fixed split algorithm that was implemented in cube extension. I've changed 
it according to the original Guttman paper (old version was more simple 
algorithm) and also ported Alexander Korotkov's algorithm from box datatype 
indexing that work faster and better on low dimensions.

On my test dataset (1M records, 7 dimensions, real world database of goods) 
numbers was following:

Building index over table (on expression):
old: 67.296058 seconds
new: 48.842391 seconds

Cube point search, mean, 100 queries
old: 0.001025 seconds
new: 0.000427 seconds

Index on field size:
old: 562 MB
new: 283 MB

Stas.



split.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Tsvector editing functions

2015-10-05 Thread Stas Kelvich
Hello.

There is patch that adds some editing routines for tsvector type.

tsvector delete(tsvector, text)
removes entry from tsvector by lexeme name
set unnest(tsvector)
expands a tsvector to a set of rows. Each row has following columns: 
lexeme, postings, weights.
text[] to_array(tsvector)
converts tsvector to array of lexemes
tsvector to_tsvector(text[])
converts array of lexemes to tsvector




tsvector_funcs.diff
Description: Binary data


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cube extension kNN support

2015-10-29 Thread Stas Kelvich
Hello.

That is updated version of the patch with proper update scripts.

Also i’ve noted that documentation states the wrong thing:

“It does not matter which order the opposite corners of a cube are entered in. 
The cube functions automatically swap values if needed to create a uniform 
"lower left — upper right" internal representation."

But in practice cubes stored "as is" and that leads to problems with getting 
cubes sorted along specific dimension directly from index.
As a simplest workaround i’ve deleted that sentence from docs and implemented 
two coordinate getters -> and ~>. First one returns
coordinate of cube as it stored, and second returns coordinate of cube 
normalised to (LL,UR)-form.

Other way to fix thing is to force ’normalization’ while creating cube. But 
that can produce wrong sorts with already existing data.

> On 09 Jul 2015, at 16:40, Alexander Korotkov  wrote:
> 
> Hi!
> 
> On Sat, May 9, 2015 at 6:53 AM, Stas Kelvich  wrote:
> Patch is pretty ready, last issue was about changed extension interface, so 
> there should be migration script and version bump.
> Attaching a version with all migration stuff.
> 
> I can't see cube--1.0--1.1.sql in the patch. Did forget to include it?
> 
> --
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company 

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



distances.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cube extension kNN support

2015-03-12 Thread Stas Kelvich
Documentation along with style fix.



distances2r3.patch
Description: Binary data


> On 08 Feb 2015, at 00:32, Alexander Korotkov  wrote:
> 
> Hi!
> 
> On Sat, Feb 7, 2015 at 12:45 PM, Stas Kelvich  wrote:
> I had updated old patch with kNN operators for cube data structures. Copying 
> description from old message:
> 
> Following distance operators introduced:
> 
> <#> taxicab distance
> <->  euclidean distance
> <=> chebyshev distance
> 
> For example:
> SELECT * FROM objects ORDER BY objects.coord <-> '(137,42,314)'::cube LIMIT 
> 10;
> 
> Also there is operator "->" for selecting ordered rows directly from index.
> This request selects rows ordered ascending by 3rd coordinate:
> 
> SELECT * FROM objects ORDER BY objects.coord->3 LIMIT 10;
> 
> For descendent ordering suggested syntax with minus before coordinate.
> This request selects rows ordered descending by 4th coordinate:
> 
> SELECT * FROM objects ORDER BY objects.coord->-4 LIMIT 10;
> 
> I've checked the patch. The first notes are so:
> 1) Check coding style, in particular braces. Postgres coding style require 
> using it for multiline statements.
> 2) Update documentation according to new features.
> 
> --
> With best regards,
> Alexander Korotkov.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cube extension kNN support

2015-05-08 Thread Stas Kelvich
Hi!

Patch is pretty ready, last issue was about changed extension interface, so 
there should be migration script and version bump.
Attaching a version with all migration stuff.



distances2r4.patch
Description: Binary data


Stas.


> On 09 May 2015, at 05:20, Alvaro Herrera  wrote:
> 
> Sergey Konoplev wrote:
>> On Thu, Mar 27, 2014 at 3:26 PM, Sergey Konoplev  wrote:
>>> On Sun, Sep 22, 2013 at 4:38 PM, Stas Kelvich  
>>> wrote:
>>>> Here is the patch that introduces kNN search for cubes with euclidean, 
>>>> taxicab and chebyshev distances.
>>> 
>>> What is the status of this patch?
>> 
>> Referring to our private conversation with Alexander Korotkov, the
>> patch is in WIP state currently, and, hopefully, will be ready by 9.5.
>> I'm ready to actively participate in its testing on a real world
>> production set of data.
> 
> This patch doesn't seem to have received an updated version.  Should we
> just punt on it?  The assumption would be that Stas or Alexander will be
> re-submitting this for 9.6.
> 
> -- 
> Álvaro Herrerahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2015-12-07 Thread Stas Kelvich
Hello.

Done with the list of suggestions. Also heavily edit delete function.



tsvector_ops.diff
Description: Binary data

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

> On 27 Nov 2015, at 15:13, Teodor Sigaev  wrote:
> 
> Hmm, seems, it will be useful to add two fuctions:
> 
> tsvector filter(tsvector, array_of_weigths) - returns tsvector contains 
> lexemes with given weights
> 
> tsvector setweight(tsvector, weigth, array_of_lexemes) - sets given weight 
> for given lexemes
> 
> Stas Kelvich wrote:
>> Hello.
>> 
>> There is patch that adds some editing routines for tsvector type.
>> 
>> tsvector delete(tsvector, text)
>>  removes entry from tsvector by lexeme name
>> set unnest(tsvector)
>>  expands a tsvector to a set of rows. Each row has following columns: 
>> lexeme, postings, weights.
>> text[] to_array(tsvector)
>>  converts tsvector to array of lexemes
>> tsvector to_tsvector(text[])
>>  converts array of lexemes to tsvector
>> 
>> 
>> 
>> 
>> 
>> 
>> Stas Kelvich
>> Postgres Professional: http://www.postgrespro.com
>> The Russian Postgres Company
>> 
>> 
>> 
>> 
>> 
> 
> -- 
> Teodor Sigaev   E-mail: teo...@sigaev.ru
>   WWW: http://www.sigaev.ru/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cube extension kNN support

2015-12-07 Thread Stas Kelvich
Hello, fixed.



cube_distances.diff
Description: Binary data


---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

> On 01 Dec 2015, at 17:52, Teodor Sigaev  wrote:
> 
> Patch looks good, but there ara some review notices:
> 1 gmake installcheck fails:
> *** /.../pgsql/contrib/cube/expected/cube_1.out   2015-12-01 
> 17:49:01.768764000 +0300
> --- /.../pgsql/contrib/cube/results/cube.out  2015-12-01 
> 17:49:12.190818000 +0300
> ***
> *** 1382,1388 
>  (1 row)
> 
>  -- Test of distances
> ! --
>  SELECT cube_distance('(1,1)'::cube, '(4,5)'::cube);
>   cube_distance
>  ---
> --- 1382,1388 
>  (1 row)
> 
>  -- Test of distances
> ! --
>  SELECT cube_distance('(1,1)'::cube, '(4,5)'::cube);
>   cube_distance
> 
> Seems, there a extra space at the end of string
> 
> 2 Pls, don use in C-code magick constants like 'case 16:'. Use macros to 
> define some human-readable name (g_cube_distance())
> 
> 3 Switch in g_cube_distance(): default switch path should generate a error. 
> It just simplifies a degbug process, may be in future.
> 
> 4 Docs: pls, don't use a strings with unlimited length.
> 
> 
> Stas Kelvich wrote:
>> Hello.
>> 
>> That is updated version of the patch with proper update scripts.
>> 
>> Also i’ve noted that documentation states the wrong thing:
>> 
>> “It does not matter which order the opposite corners of a cube are entered 
>> in. The cube functions automatically swap values if needed to create a 
>> uniform "lower left — upper right" internal representation."
>> 
>> But in practice cubes stored "as is" and that leads to problems with getting 
>> cubes sorted along specific dimension directly from index.
>> As a simplest workaround i’ve deleted that sentence from docs and 
>> implemented two coordinate getters -> and ~>. First one returns
>> coordinate of cube as it stored, and second returns coordinate of cube 
>> normalised to (LL,UR)-form.
>> 
>> Other way to fix thing is to force ’normalization’ while creating cube. But 
>> that can produce wrong sorts with already existing data.
>> 
>>> On 09 Jul 2015, at 16:40, Alexander Korotkov  wrote:
>>> 
>>> Hi!
>>> 
>>> On Sat, May 9, 2015 at 6:53 AM, Stas Kelvich  wrote:
>>> Patch is pretty ready, last issue was about changed extension interface, so 
>>> there should be migration script and version bump.
>>> Attaching a version with all migration stuff.
>>> 
>>> I can't see cube--1.0--1.1.sql in the patch. Did forget to include it?
>>> 
>>> --
>>> Alexander Korotkov
>>> Postgres Professional: http://www.postgrespro.com
>>> The Russian Postgres Company
>> 
>> Stas Kelvich
>> Postgres Professional: http://www.postgrespro.com
>> The Russian Postgres Company
>> 
>> 
>> 
>> 
> 
> -- 
> Teodor Sigaev   E-mail: teo...@sigaev.ru
>   WWW: http://www.sigaev.ru/
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Speedup twophase transactions

2015-12-09 Thread Stas Kelvich
Hello.

While working with cluster stuff (DTM, tsDTM) we noted that postgres 2pc 
transactions is approximately two times slower than an ordinary commit on 
workload with fast transactions — few single-row updates and COMMIT or 
PREPARE/COMMIT. Perf top showed that a lot of time is spent in kernel on 
fopen/fclose, so it worth a try to reduce file operations with 2pc tx.

Now 2PC in postgres does following:
* on prepare 2pc data (subxacts, commitrels, abortrels, invalmsgs) saved to 
xlog and to file, but file not is not fsynced
* on commit backend reads data from file
* if checkpoint occurs before commit, then files are fsynced during checkpoint
* if case of crash replay will move data from xlog to files

In this patch I’ve changed this procedures to following:
* on prepare backend writes data only to xlog and store pointer to the start of 
the xlog record
* if commit occurs before checkpoint then backend reads data from xlog by this 
pointer
* on checkpoint 2pc data copied to files and fsynced
* if commit happens after checkpoint then backend reads files
* in case of crash replay will move data from xlog to files (as it was before 
patch)

Most of that ideas was already mentioned in 2009 thread by Michael Paquier 
http://www.postgresql.org/message-id/c64c5f8b0908062031k3ff48428j824a9a46f2818...@mail.gmail.com
 where he suggested to store 2pc data in shared memory. 
At that time patch was declined because no significant speedup were observed. 
Now I see performance improvements by my patch at about 60%. Probably old 
benchmark overall tps was lower and it was harder to hit filesystem 
fopen/fclose limits.

Now results of benchmark are following (dual 6-core xeon server):

Current master without 2PC: ~42 ktps
Current master with 2PC: ~22 ktps
Current master with 2PC: ~36 ktps

Benchmark done with following script:

\set naccounts 10 * :scale
\setrandom from_aid 1 :naccounts
\setrandom to_aid 1 :naccounts
\setrandom delta 1 100
\set scale :scale+1
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance - :delta WHERE aid = :from_aid;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :to_aid;
PREPARE TRANSACTION ':client_id.:scale';
COMMIT PREPARED ':client_id.:scale';



2pc_xlog.diff
Description: Binary data


---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2015-12-09 Thread Stas Kelvich
Thanks, Kevin.

> I assume that last one should have been *Patched master with 2PC”?

Yes, this list should look like this:

Current master without 2PC: ~42 ktps
Current master with 2PC: ~22 ktps
Patched master with 2PC: ~36 ktps

And created CommitFest entry for this patch.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


> On 10 Dec 2015, at 00:37, Kevin Grittner  wrote:
> 
> On Wed, Dec 9, 2015 at 12:44 PM, Stas Kelvich  
> wrote:
> 
>> Now 2PC in postgres does following:
>> * on prepare 2pc data (subxacts, commitrels, abortrels, invalmsgs) saved to 
>> xlog and to file, but file not is not fsynced
>> * on commit backend reads data from file
>> * if checkpoint occurs before commit, then files are fsynced during 
>> checkpoint
>> * if case of crash replay will move data from xlog to files
>> 
>> In this patch I’ve changed this procedures to following:
>> * on prepare backend writes data only to xlog and store pointer to the start 
>> of the xlog record
>> * if commit occurs before checkpoint then backend reads data from xlog by 
>> this pointer
>> * on checkpoint 2pc data copied to files and fsynced
>> * if commit happens after checkpoint then backend reads files
>> * in case of crash replay will move data from xlog to files (as it was 
>> before patch)
> 
> That sounds like a very good plan to me.
> 
>> Now results of benchmark are following (dual 6-core xeon server):
>> 
>> Current master without 2PC: ~42 ktps
>> Current master with 2PC: ~22 ktps
>> Current master with 2PC: ~36 ktps
> 
> I assume that last one should have been *Patched master with 2PC"?
> 
> Please add this to the January CommitFest.
> 
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2015-12-10 Thread Stas Kelvich
Michael, Jeff thanks for reviewing and testing.

> On 10 Dec 2015, at 02:16, Michael Paquier  wrote:
> 
> This has better be InvalidXLogRecPtr if unused.


Yes, that’s better. Changed.

> On 10 Dec 2015, at 02:16, Michael Paquier  wrote:

> +if (gxact->prepare_lsn)
> +{
> +XlogReadTwoPhaseData(gxact->prepare_xlogptr, &buf, NULL);
> +}
> Perhaps you mean prepare_xlogptr here?


Yes, my bad. But funnily I have this error even number of times: code in 
CheckPointTwoPhase also uses prepare_lsn instead of xlogptr, so overall this 
was working well, that’s why it survived my own tests and probably Jeff’s tests.
I think that’s a bad variable naming, for example because lsn in pg_xlogdump 
points to start of the record, but here start used as xloptr and end as lsn.
So changed both variables to prepare_start_lsn and prepare_end_lsn.

> On 10 Dec 2015, at 09:48, Jeff Janes  wrote:
> I've tested this through my testing harness which forces the database
> to go through endless runs of crash recovery and checks for
> consistency, and so far it has survived perfectly.


Cool! I think that patch is most vulnerable to following type of workload: 
prepare transaction, do a lot of stuff with database to force checkpoints (or 
even recovery cycles), and commit it.

> On 10 Dec 2015, at 09:48, Jeff Janes  wrote:

> Can you give the full command line?  -j, -c, etc.


pgbench -h testhost -i && pgbench -h testhost -f 2pc.pgb -T 300 -P 1 -c 64 -j 
16 -r

where 2pc.pgb as in previous message.

Also all this applies to hosts with uniform memory. I tried to run patched 
postgres on NUMA with 60 physical cores and patch didn’t change anything. Perf 
top shows that main bottleneck is access to gxact, but on ordinary host with 
1/2 cpu’s that access even not in top ten heaviest routines.

> On 10 Dec 2015, at 09:48, Jeff Janes  wrote:

> Why are you incrementing :scale ?


That’s a funny part, overall 2pc speed depends on how you will name your 
prepared transaction. Concretely I tried to use random numbers for gid’s and it 
was slower than having constantly incrementing gid. Probably that happens due 
to linear search by gid in gxact array on commit. So I used :scale just as a 
counter, bacause it is initialised on pgbench start and line like “\set scale 
:scale+1” works well. (may be there is a different way to do it in pgbench).

> I very rapidly reach a point where most of the updates are against
> tuples that don't exist, and then get integer overflow problems.

Hmm, that’s strange. Probably you set scale to big value, so that 10*:scale 
is bigger that int4? But i thought that pgbench will change aid columns to 
bigint if scale is more than 2.



2pc_xlog.v2.diff
Description: Binary data


---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cube extension kNN support

2015-12-16 Thread Stas Kelvich
Hi, thanks for the review.

> 1) (nitpicking) There seem to be some minor whitespace issues, i.e.
>   trailing spaces, empty lines being added/removed, etc.


Fixed, I think

> 2) one of the regression tests started to fail
> 
>   SELECT '-1e-700'::cube AS cube;
> 
>   This used to return (0) but now I get (-0).

Actually that problem emerged because of the first problem. I had extra 
whitespace in sql file and removed that whitespace from one of the answers file 
(cube_1.sql), so diff with both cube.sql and cube_1.sql was one line length and 
you saw diff with cube.sql. 
In all systems that available to me (osx/linux/freebsd) I saw that right 
answers file is cube_1.sql. But in other OS’es you can get +/- 0 or e27/e027. I 
edited that answers files manually, so there probably can be some other typos.

> 3) I wonder why the docs were changed like this:
> 
>   
> -   It does not matter which order the opposite corners of a cube are
> -   entered in.  The cube functions
> -   automatically swap values if needed to create a uniform
> -   lower left — upper right internal representation.
> +   When corners coincide cube stores only one corner along with a
>special flag in order to reduce size wasted.
>   
> 
>   Was the old behavior removed? I don't think so - it seems to behave
>   as before, so why to remove this information? Maybe it's not useful?
>   But then why add the bit about optimizing storage of points?

I’ve edited it because the statement was mislead (or at least ambiguous) — 
cube_in function doesn’t swap coordinates. 
Simple way to see it:
> select '(1,3),(3,1)'::cube;
 cube  
---
 (1, 3),(3, 1)

But LowerLeft-UpperRight representation should be (1,1),(3,3)


Updated patch attached.




cube_distances.patch
Description: Binary data



> On 15 Dec 2015, at 21:46, Tomas Vondra  wrote:
> 
> Hi,
> 
> On 12/07/2015 03:47 PM, Stas Kelvich wrote:
>> Hello, fixed.
> 
> I've looked at the patch today, seems mostly fine to me.
> 
> Three comments though:
> 
> 1) (nitpicking) There seem to be some minor whitespace issues, i.e.
>   trailing spaces, empty lines being added/removed, etc.
> 
> 2) one of the regression tests started to fail
> 
>   SELECT '-1e-700'::cube AS cube;
> 
>   This used to return (0) but now I get (-0). As this query existed in
>   1.0, it's probably due to change in the C code. Now sure where.
> 
> 3) I wonder why the docs were changed like this:
> 
>   
> -   It does not matter which order the opposite corners of a cube are
> -   entered in.  The cube functions
> -   automatically swap values if needed to create a uniform
> -   lower left — upper right internal representation.
> +   When corners coincide cube stores only one corner along with a
>special flag in order to reduce size wasted.
>   
> 
>   Was the old behavior removed? I don't think so - it seems to behave
>   as before, so why to remove this information? Maybe it's not useful?
>   But then why add the bit about optimizing storage of points?
> 
> regards
> 
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cube extension kNN support

2015-12-16 Thread Stas Kelvich
> I don't think that's what the comment says, actually. It rather refers to 
> code like this:
> 
>result = Min(LL_COORD(c, n - 1), UR_COORD(c, n - 1));
> 
> i.e. if you specifically ask for a particular corner (ll, in this case), 
> you'll get the proper value.

Hmm, I was confused by phrase “create a uniform _internal_ representation” and 
actually internally cube stored “as is”. But probably i just misinterpret that.
So here is the updated version with old documentation restored.



cube_distances.patch
Description: Binary data



> On 16 Dec 2015, at 16:46, Tomas Vondra  wrote:
> 
> Hi,
> 
> On 12/16/2015 01:26 PM, Stas Kelvich wrote:
>> Hi, thanks for the review.
>> 
>>> 1) (nitpicking) There seem to be some minor whitespace issues, i.e.
>>>   trailing spaces, empty lines being added/removed, etc.
>> 
>> 
>> Fixed, I think
>> 
>>> 2) one of the regression tests started to fail
>>> 
>>>   SELECT '-1e-700'::cube AS cube;
>>> 
>>>   This used to return (0) but now I get (-0).
>> 
>> Actually that problem emerged because of the first problem. I had
> extra whitespace in sql file and removed that whitespace from one of the
> answers file (cube_1.sql), so diff with both cube.sql and cube_1.sql was
> one line length and you saw diff with cube.sql.
>> In all systems that available to me (osx/linux/freebsd) I saw that
> right answers file is cube_1.sql. But in other OS’es you can get +/- 0
> or e27/e027. I edited that answers files manually, so there probably can
> be some other typos.
> 
> Ah! So that's why I couldn't quickly find the issue in the C code ...
> 
>> 
>>> 3) I wonder why the docs were changed like this:
>>> 
>>>   
>>> -   It does not matter which order the opposite corners of a cube are
>>> -   entered in.  The cube functions
>>> -   automatically swap values if needed to create a uniform
>>> -   lower left — upper right internal representation.
>>> +   When corners coincide cube stores only one corner along with a
>>>special flag in order to reduce size wasted.
>>>   
>>> 
>>>   Was the old behavior removed? I don't think so - it seems to behave
>>>   as before, so why to remove this information? Maybe it's not useful?
>>>   But then why add the bit about optimizing storage of points?
>> 
>> I’ve edited it because the statement was mislead (or at least ambiguous) — 
>> cube_in function doesn’t swap coordinates.
>> Simple way to see it:
>>> select '(1,3),(3,1)'::cube;
>>  cube
>> ---
>>  (1, 3),(3, 1)
>> 
>> But LowerLeft-UpperRight representation should be (1,1),(3,3)
> 
> I don't think that's what the comment says, actually. It rather refers to 
> code like this:
> 
>result = Min(LL_COORD(c, n - 1), UR_COORD(c, n - 1));
> 
> i.e. if you specifically ask for a particular corner (ll, in this case), 
> you'll get the proper value.
> 
> regards
> 
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2015-12-30 Thread Stas Kelvich
Hi, Tomáš! Thanks for comprehensive review.

> On 15 Dec 2015, at 06:07, Tomas Vondra  wrote:
> 
> 1) It's a bit difficult to judge the usefulness of the API, as I've
>   always been a mere user of full-text search, and I never had a need
>   (or courage) to mess with the tsvectors. OTOH I don't see a good
>   reason no to have such API, when there's a need for it.
> 
>   The API seems to be reasonably complete, with one exception - when
>   looking at editing function we have for 'hstore', we do have these
>   variants for delete()
> 
>  delete(hstore,text)
>  delete(hstore,text[])
>  delete(hstore,hstore)
> 
>   while this patch only adds delete(tsvector,text). Would it make
>   sense to add variants similar to hstore? It probably does not make
>   much sense to add delete(tsvector,tsvector), right? But being able
>   to delete a bunch of lexemes in one go seems like a good thing.
> 
>   What do you think?

That’s a good idea and actually deleting tsvector from tsvector makes perfect 
sense. In delete function I used exact string match between string and lexemes 
in tsvector, but if somebody wants to delete for example “Cats” from tsvector, 
then he should downcase and singularize this word. Easiest way to do it is to 
just use to_tsvector() function. Also we can use this function to delete 
specific positions: like delete('cat:3 fat:2,4'::tsvector, 'fat:2'::tsvector) 
-> 'cat:3 fat:4'::tsvector.

So in attached patch I’ve implemented following:

delete(tsin tsvector, lexarrtext[]) — remove any occurence of lexemes inlexarr 
from tsin

delete(tsin tsvector, tsv_filter tsvector) — Delete lexemes and/or positions of 
tsv_filter from tsin. When lexeme in tsv_filter has no positions function will 
delete any occurrence of same lexeme in tsin. When tsv_filter lexeme have 
positions function will delete them from positions of matching lexeme in tsin. 
If after such removal resulting positions set is empty then function will 
delete that lexeme from resulting tsvector.

Also if we want some level of completeness of API and taking into account that 
concat() function shift positions on second argument I thought that it can be 
useful to also add function that can shift all positions of specific value. 
This helps to undo concatenation: delete one of concatenating tsvectors and 
then shift positions in resulting tsvector. So I also wrote one another small 
function:

shift(tsin tsvector,offset int16) — Shift all positions in tsin by given offset

> 
> 
> 2) tsvector_op.c needs a bit of love, to eliminate the two warnings it
>   currently triggers:
> 
>tsvector_op.c:211:2: warning: ISO C90 forbids mixed ...
>tsvector_op.c:635:9: warning: variable ‘lexeme2copy’ set but …
> 

fixed

> 3) the patch also touches tsvector_setweight(), only to do change:
> 
>  elog(ERROR, "unrecognized weight: %d", cw);
> 
>   to
> 
>  elog(ERROR, "unrecognized weight: %c", cw);
> 
>   That should probably go get committed separately, as a bugfix.
> 

Okay, i’ll submit that as a separate patch.

> 
> 4) I find it rather annoying that there are pretty much no comments in
>   the code. Granted, there are pretty much no comments in the
>   surrounding code, but I doubt that's a good reason for not having
>   any comments in new code. It makes reviews unnecessarily difficult.
> 

Fixed, I think.

> 
> 5) tsvector_concat() is not mentioned in docs at all
> 

Concat mentioned in docs as an operator ||.

> 
> 6) Docs don't mention names of the new parameters in function
>   signatures, just data types. The functions with a single parameter
>   probably don't need to do that, but multi-parameter ones should.
> 

Fixed.

> 7) Some of the functions use intexterm that does not match the function
>   name. I see two such cases - to_tsvector and setweight. Is there a
>   reason for that?
> 

Because sgml compiler wants unique indexterm. Both functions that you mentioned 
use overloading of arguments and have non-unique name.




tsvector_ops.diff
Description: Binary data


---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-09 Thread Stas Kelvich
Hi.

I’ve updated patch and wrote description of thighs that happens
with 2PC state data in the beginning of the file. I think now this patch is 
well documented,
but if somebody points me to places that probably requires more detailed 
description I’m ready
to extend that.



2pc_xlog.diff
Description: Binary data


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


> On 08 Jan 2016, at 19:29, Alvaro Herrera  wrote:
> 
> Simon Riggs wrote:
> 
>> I think we could do better still, but this looks like the easiest 80% and
>> actually removes code.
>> 
>> The lack of substantial comments on the patch is a problem though - the
>> details above should go in the patch. I'll have a go at reworking this for
>> you, this time.
> 
> Is someone submitting an updated patch here?
> 
> -- 
> Álvaro Herrerahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-09 Thread Stas Kelvich
Thanks a lot for your edits, now that patch is much more cleaner.

> Your comments say
> 
>   "In case of crash replay will move data from xlog to files, if that hasn't 
> happened before."
> 
> but I don't see that in code. Can you show me where that happens? 

xact.c calls RecreateTwoPhaseFile in xact_redo() function (xact.c:5596)

> On 09 Jan 2016, at 18:29, Simon Riggs  wrote:
> 
> Hmm, I was just preparing this for commit.
> 
> Please have a look at my mild edits and extended comments. 


One concern that come into my mind while reading updated
patch is about creating extra bool field in GlobalTransactionData structure. 
While this improves readability, it
also increases size of that structure and that size have impact on performance 
on systems with many cores
(say like 60-80). Probably one byte will not make measurable difference, but I 
think it is good idea to keep
GXact as small as possible. As far as I understand the same logic was behind 
split of 
PGPROC to PGPROC+PGXACT in 9.2 (comment in proc.h:166)

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-11 Thread Stas Kelvich

> On 10 Jan 2016, at 12:15, Simon Riggs  wrote:
> 
> So we've only optimized half the usage? We're still going to cause 
> replication delays.

Yes, replica will go through old procedures of moving data to and from file.

> We can either
> 
> 1) Skip fsyncing the RecreateTwoPhaseFile and then fsync during restartpoints

>From what i’ve seen with old 2pc code main performance bottleneck was caused 
>by frequent creating of files. So better to avoid files if possible.

> 
> 2) Copy the contents to shmem and then write them at restartpoint as we do 
> for checkpoint
> (preferred)

Problem with shared memory is that we can’t really predict size of state data, 
and anyway it isn’t faster then reading data from WAL
(I have tested that while preparing original patch). 

We can just apply the same logic on replica that on master: do not do anything 
special on prepare, and just read that data from WAL.
If checkpoint occurs during recovery/replay probably existing code will handle 
moving data to files.

I will update patch to address this issue.

> I think padding will negate the effects of the additional bool.
> 
> If we want to reduce the size of the array GIDSIZE is currently 200, but XA 
> says maximum 128 bytes.
> 
> Anybody know why that is set to 200?

Good catch about GID size.

If we talk about further optimisations i see two ways:

1) Optimising access to GXACT. Here we can try to shrink it; introduce more 
granular locks, 
e.g. move GIDs out of GXACT and lock GIDs array only once while checking new 
GID uniqueness; try to lock only part of GXACT by hash; etc.

2) Be optimistic about consequent COMMIT PREPARED. In normal workload next 
command after PREPARE will be COMMIT/ROLLBACK, so we can save
transaction context and release it only if next command isn’t our designated 
COMMIT/ROLLBACK. But that is a big amount of work and requires
changes to whole transaction pipeline in postgres.

Anyway I suggest that we should consider that as a separate task.

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-11 Thread Stas Kelvich
> 
> On 11 Jan 2016, at 21:40, Jesper Pedersen  wrote:
> 
> I have done a run with the patch and it looks really great.
> 
> Attached is the TPS graph - with a 1pc run too - and the perf profile as a 
> flame graph (28C/56T w/ 256Gb mem, 2 x RAID10 SSD).
> 

Thanks for testing and especially for the flame graph. That is somewhat in 
between the cases that I have tested. On commodity server with dual Xeon (6C 
each) 2pc speed is about 80% of 1pc speed, but on 60C/120T system that patch 
didn’t make significant difference because main bottleneck changes from file 
access to locks on array of running global transactions.

How did you generated names for your PREPARE’s? One funny thing that I’ve 
spotted that tx rate increased when i was using incrementing counter as GID 
instead of random string.

And can you also share flame graph for 1pc workload?

> 
> On 11 Jan 2016, at 21:43, Simon Riggs  wrote:
> 
> Have you measured lwlocking as a problem?
> 


Yes. GXACT locks that wasn’t even in perf top 10 on dual Xeon moves to the 
first places when running on 60 core system. But Jesper’s flame graph on 24 
core system shows different picture.

> On 12 Jan 2016, at 01:24, Andres Freund  wrote:
> 
> Currently recovery of 2pc often already is a bigger bottleneck than the 
> workload on the master, because replay has to execute the fsyncs implied by 
> statefile  re-creation serially, whereas on the master they'll usually be 
> executed in parallel.

That’s interesting observation. Simon already pointed me to this problem in 2pc 
replay, but I didn’t thought that it is so slow. I’m now working on that.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Cube extension improvement, GSoC

2013-03-22 Thread Stas Kelvich
Hello,

some time ago I started working on the data search system (about 100-200M of 
records) with queries consisted of several diapason and equality conditions, 
e.g.:

  WHERE dim1 BETWEEN 128 AND 137 AND
  WHERE dim2 BETWEEN 4815 AND 162342 AND
  WHERE dim3 = 42
  ORDER BY dim1 ASC

There are 6 or 7 search criteria in my task. In order to avoid full table scan 
I started using R-Tree over cube data type:

  CREATE INDEX ON my_table USING GiST(cube(array[dim1, dim2, dim3]))

For fast sorting I used Alexander Korotkov's patch for knngist 
(http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg153676.html), 
which changes metric for nearest neighbors search and allows to obtain cubes 
ordered by a specific coordinate. Having changed some data types and 
function/operator definitions I ported this to 9.2 
(https://github.com/kelvich/postgres/commit/bb372). Then, after this I could 
select ordered records right from the index:

  SELECT * FROM my_table
  WHERE cube(array[dim1, dim2, dim3]) <@ cube '(128,4815,42),(137,162342,42)'
  ORDER BY cube(array[dim1, dim2, dim3]) <*> 5;

The main issue of such approach is the index size. In my case it was about 
100GB for 100M of records. Therefore index building becomes very expensive 
disk-related operation. For the same reason reading a large number of records 
from the index is slow too.

I came to Oleg Bartunov, Theodor Sigaev and after a while to Alexander Korotkov 
for any help. (I'm very thankful to them and glad that they agreed to meet, 
listen to me and give useful advices). Having discussed it we decided that 
there was several possible improvements for the cube extension:

  * Adding point data type support to the cube extension in order to avoid 
storing of coincident upper left and lower right vertices, which may reduce the 
volume that leaf nodes occupy almost twice.
  * Checking the split algorithm with big datasets and, if possible, improving 
it.
  * Learning cube extension to store dimensions with different data types. Such 
index would be good alternative to compound key B-Tree multi-index (suitable 
for diapason queries and data ordering).
  * Providing support for KNN with metrics induced by the different norms: 
euclidean, taxicab norm, p-norm. This can be useful in fields where we can 
extract signature: similar images search, similar audio search.

I'd like to participate in GSoC with this improvements, and I'm very interested 
in any comments or suggestions about this feature list.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cube extension improvement, GSoC

2013-05-04 Thread Stas Kelvich
Hello.

>* Learning cube extension to store dimensions with different data types. 
> Such index would be good alternative to compound key B-Tree multi-index 
> (suitable for diapason queries and data ordering).
> 
> You mean, a cube containing something else than floats? I don't think we want 
> to explode the number of datatypes by doing that, casting between them could 
> be problematic.
>  
> At least option for having float4 cube instead of foat8 cube seems reasonable 
> for me, because of space economy payed by less accuracy.

The main idea was to reduce the size of the index when it is possible. This can 
be very important when we have many dimensions with low number of different 
elements.

> But I wonder if you could add cube-like operators for arrays. We already have 
> support for arrays of any datatype, and any number of dimensions. That seems 
> awfully lot like what the cube extension does.

All cube stuff assumes fixed number of dimensions and it is not very useful in 
arrays where we easily can change number of elements. One can define index on 
expression over array, i.e. CREATE INDEX ON table USING 
GIST(cube(array[a,b,c])) and cube-like operations will work. But it is only 
when we have fixed-size array. And again if array elements is smallints then 
such cube uses 8 times more space then it's actually needed — four times when 
converting to float8 and two times when storing coincident cube bounds.

> I think we have at least 3 data types more or less similar to cube.
> 1) array of ranges
> 2) range of arrays
> 3) 2d arrays
> Semantically cube is most close to array or ranges. However array of ranges 
> have huge storage overhead. 
> Also we can declare cube as domain on 2d arrays and declare operations of 
> that domain.

But what we should do when arrays in different records have different numbers 
of element?

Stas Kelvich.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cube extension improvement, GSoC

2013-05-14 Thread Stas Kelvich
HI.

Thanks, Heikki, for the answer on google-melange. For some reason i didn't 
receive email notification, so I saw this answer only today.

> Do you have access to a server you can use to perform those tests? (...)

Yes, i have. I am maintaining MPI cluster in my university, so it is not a 
problem. Actually tests for this proposal was made on servers from this 
cluster. But, anyway, thanks for offering help.

There is an open question about supporting different datatypes in cube 
extension. As I understand we have following ideas:

* add cube-like operators for arrays. We already have support for arrays of any 
datatype, and any number of dimensions. 
If we want to use tree-like data structures for this operators we will 
run into the same problems with trees and types. And we always can cast array 
to cube and use this operators. Or I wrongly understand this.

* Create support for storing cube coordinates with different data types. 
(2,4,8-bytes integers, 4,8-bytes floats)
Main goal of this is reducing the index size, so in order not to break 
big amount of code we can store data according to data type size (i.e. | 
smallint | real | real | double | double | ) and when we load data from the 
disk or cache we can cast it to float8 and existent code will work. To achieve 
this behavior two steps should be performed:
1) Store information about coordinates types when the index is created. 
Good question is where to store this data structure, but I believe it can be 
done.
2) Change functions that read and write data to disk, so they can cast 
type to/from float8 using data from previous clause.

* Don't do cube with type support
Eventually, there is different ways of reducing R-Tree size. For 
example we can store relative coordinates with dynamic size of MBR (VRMBR), 
instead of absolute coordinates with fixed sized MBR. There is some evidences, 
that this can sufficiently reduce size. 
http://link.springer.com/chapter/10.1007/11427865_13

On May 8, 2013, at 2:35 PM, Alexander Korotkov wrote:

> On Sat, May 4, 2013 at 11:19 PM, Stas Kelvich  wrote:
> > I think we have at least 3 data types more or less similar to cube.
> > 1) array of ranges
> > 2) range of arrays
> > 3) 2d arrays
> > Semantically cube is most close to array or ranges. However array of ranges 
> > have huge storage overhead.
> > Also we can declare cube as domain on 2d arrays and declare operations of 
> > that domain.
> 
> But what we should do when arrays in different records have different numbers 
> of element?
> 
> We can be faced with absolutely same situation with cube.
> 
> test=# create table cube_test (v cube);
> CREATE TABLE
> 
> test=# insert into cube_test values (cube(array[1,2])), (cube(array[1,2,3]));
> INSERT 0 2
> 
> In order to force all cubes to have same number of dimensions excplicit CHECK 
> on table is required.
> As I remember cube treats absent dimensions as zeros.
> 
> --
> With best regards,
> Alexander Korotkov.
> 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cube extension improvement, GSoC

2013-05-14 Thread Stas Kelvich
On May 14, 2013, at 4:53 PM, Alexander Korotkov wrote:

> Sounds promising. Did you examine how this technique can fit into GiST? In 
> current GiST interface methods don't have access to parent entries.

No, i didn't examine it yet. Anyway in this technique lots of changes should be 
performed to both cube GiST and storing methods. We can start from hacking with 
datatype support which only affects storing methods and after that begin 
investigation of VRMBRs support.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cube extension kNN support

2013-12-03 Thread Stas Kelvich
Hi.

> cube and earthdistance regression tests fail.


Code updated to work with current HEAD. Also added tests to cover new 
functionality.

> Do you have any benchmarks ?

  This patch just introduces functionality of calculating distances between 
cubes, so this code don't interfere much with kNN search speed. I think it's 
better to publish such benchmarks in neighbor patch about split algorithm.
  Anyway, we can compare kNN with b-tree and full scan:

create table test(a1 float, a2 float, a3 float);
insert into test (select 100*random(), 100*random(), 100*random() from 
generate_series(1,100) as s(a));
create index on test using gist(cube(array[a1,a2,a3]));
select * from test order by a1 limit 15; -- 227.658 ms
select * from test order by cube(array[a1,a2,a3])->1 limit 15; -- 1.275 ms
create index on test(a1);
select * from test order by a1 limit 15; -- 0.103 ms

As we can see, kNN ordering 10 times slower than B-tree (on silly request for 
R-Tree, just as example), but still 100+ times faster than full scan on this 
table.

Stas.

On Sep 25, 2013, at 5:25 PM, Peter Eisentraut  wrote:

> On 9/22/13 7:38 PM, Stas Kelvich wrote:
>> Here is the patch that introduces kNN search for cubes with euclidean, 
>> taxicab and chebyshev distances.
> 
> cube and earthdistance regression tests fail.



distances.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-24 Thread Stas Kelvich

> On 24 Jan 2017, at 09:42, Michael Paquier  wrote:
> 
> On Mon, Jan 23, 2017 at 9:00 PM, Nikhil Sontakke
>  wrote:
>> Speeding up recovery or failover activity via a faster promote is a
>> desirable thing. So, maybe, we should look at teaching the relevant
>> code about using "KnownPreparedList"? I know that would increase the
>> size of this patch and would mean more testing, but this seems to be
>> last remaining optimization in this code path.
> 
> That's a good idea, worth having in this patch. Actually we may not
> want to call KnownPreparedRecreateFiles() here as promotion is not
> synonym of end-of-recovery checkpoint for a couple of releases now.

Thanks for review, Nikhil and Michael.

I don’t follow here. We are moving data away from WAL to files on checkpoint 
because after checkpoint
there is no guaranty that WAL segment with our prepared tx will be still 
available.

> The difference between those two is likely noise.
> 
> By the way, in those measurements, the OS cache is still filled with
> the past WAL segments, which is a rather best case, no? What happens
> if you do the same kind of tests on a box where memory is busy doing
> something else and replayed WAL segments get evicted from the OS cache
> more aggressively once the startup process switches to a new segment?
> This could be tested for example on a VM with few memory (say 386MB or
> less) so as the startup process needs to access again the past WAL
> segments to recover the 2PC information it needs to get them back
> directly from disk... One trick that you could use here would be to
> tweak the startup process so as it drops the OS cache once a segment
> is finished replaying, and see the effects of an aggressive OS cache
> eviction. This patch is showing really nice improvements with the OS
> cache backing up the data, still it would make sense to test things
> with a worse test case and see if things could be done better. The
> startup process now only reads records sequentially, not randomly
> which is a concept that this patch introduces.
> 
> Anyway, perhaps this does not matter much, the non-recovery code path
> does the same thing as this patch, and the improvement is too much to
> be ignored. So for consistency's sake we could go with the approach
> proposed which has the advantage to not put any restriction on the
> size of the 2PC file contrary to what an implementation saving the
> contents of the 2PC files into memory would need to do.

Maybe i’m missing something, but I don’t see how OS cache can affect something 
here.

Total WAL size was 0x44 * 16 = 1088 MB, recovery time is about 20s. Sequential 
reading 1GB of data
is order of magnitude faster even on the old hdd, not speaking of ssd. Also you 
can take a look on flame graphs
attached to previous message — majority of time during recovery spent in 
pg_qsort while replaying 
PageRepairFragmentation, while whole xact_redo_commit() takes about 1% of time. 
That amount can
grow in case of uncached disk read but taking into account total recovery time 
this should not affect much.

If you are talking about uncached access only during checkpoint than here we 
are restricted with
max_prepared_transaction, so at max we will read about hundred of small files 
(usually fitting into one filesystem page) which will also
be barely noticeable comparing to recovery time between checkpoints. Also wal 
segments cache eviction during
replay doesn’t seems to me as standard scenario.

Anyway i took the machine with hdd to slow down read speed and run tests again. 
During one of the runs i
launched in parallel bash loop that was dropping os cache each second (while 
wal fragment replay takes
 also about one second).

1.5M transactions
 start segment: 0x06
 last segment: 0x47

patched, with constant cache_drop:
  total recovery time: 86s

patched, without constant cache_drop:
   total recovery time: 68s

(while difference is significant, i bet that happens mostly because of database 
file segments should be re-read after cache drop)

master, without constant cache_drop:
   time to recover 35 segments: 2h 25m (after that i tired to wait)
   expected total recovery time: 4.5 hours

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-25 Thread Stas Kelvich
>> 
>> Yes, that’s also possible but seems to be less flexible restricting us to 
>> some
>> specific GID format.
>> 
>> Anyway, I can measure WAL space overhead introduced by the GID’s inside 
>> commit records
>> to know exactly what will be the cost of such approach.
> 
> Stas,
> 
> Have you had a chance to look at this further?

Generally i’m okay with Simon’s approach and will send send updated patch. 
Anyway want to
perform some test to estimate how much disk space is actually wasted by extra 
WAL records.

> I think the approach of storing just the xid and fetching the GID
> during logical decoding of the PREPARE TRANSACTION is probably the
> best way forward, per my prior mail.

I don’t think that’s possible in this way. If we will not put GID in commit 
record, than by the time
when logical decoding will happened transaction will be already 
committed/aborted and there will
be no easy way to get that GID. I thought about several possibilities:

* Tracking xid/gid map in memory also doesn’t help much — if server reboots 
between prepare 
and commit we’ll lose that mapping. 
* We can provide some hooks on prepared tx recovery during startup, but that 
approach also fails
if reboot happened between commit and decoding of that commit.
* Logical messages are WAL-logged, but they don’t have any redo function so 
don’t helps much.

So to support user-accessible 2PC over replication based on 2PC decoding we 
should invent
something more nasty like writing them into a table.

> That should eliminate Simon's
> objection re the cost of tracking GIDs and still let us have access to
> them when we want them, which is the best of both worlds really.

Having 2PC decoding in core is a good thing anyway even without GID tracking =)

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-26 Thread Stas Kelvich

> On 26 Jan 2017, at 10:34, Michael Paquier  wrote:
> 
> On Thu, Jan 26, 2017 at 4:09 PM, Nikhil Sontakke
>  wrote:
>>> I look at this patch from you and that's present for me:
>>> https://www.postgresql.org/message-id/CAMGcDxf8Bn9ZPBBJZba9wiyQq->Qk5uqq=vjomnrnw5s+fks...@mail.gmail.com
>> 
>>> --- a/src/backend/access/transam/xlog.c
>>> +++ b/src/backend/access/transam/xlog.c
>>> @@ -9573,6 +9573,7 @@ xlog_redo(XLogReaderState *record)
>>> (errmsg("unexpected timeline ID %u (should be %u)
>>> in checkpoint record",
>>> checkPoint.ThisTimeLineID, ThisTimeLineID)));
>>> 
>>> +KnownPreparedRecreateFiles(checkPoint.redo);
>>> RecoveryRestartPoint(&checkPoint);
>>> }
>> 
>> Oh, sorry. I was asking about CheckpointTwoPhase(). I don't see a
>> function by this name. And now I see, the name is CheckPointTwoPhase()
>> :-)
> 
> My mistake then :D
> 
>>> And actually, when a XLOG_CHECKPOINT_SHUTDOWN record is taken, 2PC
>>> files are not flushed to disk with this patch. This is a problem as a
>>> new restart point is created... Having the flush in CheckpointTwoPhase
>>> really makes the most sense.
>> 
>> Umm, AFAICS, CheckPointTwoPhase() does not get called in the "standby
>> promote" code path.
> 
> CreateRestartPoint() calls it via CheckPointGuts() while in recovery.
> 

Huh, glad that this tread received a lot of attention.

> On 24 Jan 2017, at 17:26, Nikhil Sontakke  wrote:
> 
> We are talking about the recovery/promote code path. Specifically this
> call to KnownPreparedRecreateFiles() in PrescanPreparedTransactions().
> 
> We write the files to disk and they get immediately read up in the
> following code. We could not write the files to disk and read
> KnownPreparedList in the code path that follows as well as elsewhere.

Thanks Nikhil, now I got that. Since we are talking about promotion we are on 
different timescale and 1-10 second
lag matters a lot.

I think I have in my mind realistic scenario when proposed recovery code path 
will hit the worst case: Google cloud.
They have quite fast storage, but fsync time is really big and can go up to 
10-100ms (i suppose it is network-attacheble).
Having say 300 prepared tx, we can delay promotion up to half minute. 

So i think it worth of examination.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-26 Thread Stas Kelvich

> On 26 Jan 2017, at 12:51, Craig Ringer  wrote:
> 
> * Tracking xid/gid map in memory also doesn’t help much — if server reboots 
> between prepare
> and commit we’ll lose that mapping.
> 
> Er what? That's why I suggested using the prepared xacts shmem state. It's 
> persistent as you know from your work on prepared transaction files. It has 
> all the required info.

Imagine following scenario:

1. PREPARE happend
2. PREPARE decoded and sent where it should be sent
3. We got all responses from participating nodes and issuing COMMIT/ABORT
4. COMMIT/ABORT decoded and sent

After step 3 there is no more memory state associated with that prepared tx, so 
if will fail
between 3 and 4 then we can’t know GID unless we wrote it commit record (or 
table).

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-02-09 Thread Stas Kelvich

> On 31 Jan 2017, at 12:22, Craig Ringer  wrote:
> 
> Personally I don't think lack of access to the GID justifies blocking 2PC 
> logical decoding. It can be added separately. But it'd be nice to have 
> especially if it's cheap.

Agreed.

> On 2 Feb 2017, at 00:35, Craig Ringer  wrote:
> 
> Stas was concerned about what happens in logical decoding if we crash between 
> PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back and decode 
> the whole txn again anyway so it doesn't matter.

Not exactly. It seems that in previous discussions we were not on the same 
page, probably due to unclear arguments by me.

>From my point of view there is no problems (or at least new problems comparing 
>to ordinary 2PC) with preparing transactions on slave servers with something 
>like “#{xid}#{node_id}” instead of GID if issuing node is coordinator of that 
>transaction. In case of failure, restart, crash we have the same options about 
>deciding what to do with uncommitted transactions.

My concern is about the situation with external coordinator. That scenario is 
quite important for users of postgres native 2pc, notably J2EE user.  Suppose 
user (or his framework) issuing “prepare transaction ‘mytxname’;" to servers 
with ordinary synchronous physical replication. If master will crash and 
replica will be promoted than user can reconnect to it and commit/abort that 
transaction using his GID. And it is unclear to me how to achieve same 
behaviour with logical replication of 2pc without GID in commit record. If we 
will prepare with “#{xid}#{node_id}” on acceptor nodes, then if donor node will 
crash we’ll lose mapping between user’s gid and our internal gid; contrary we 
can prepare with user's GID on acceptors, but then we will not know that GID on 
donor during commit decode (by the time decoding happens all memory state 
already gone and we can’t exchange our xid to gid).

I performed some tests to understand real impact on size of WAL. I've compared 
postgres -master with wal_level = logical, after 3M 2PC transactions with 
patched postgres where GID’s are stored inside commit record too. Testing with 
194-bytes and 6-bytes GID’s. (GID max size is 200 bytes)

-master, 6-byte GID after 3M transaction: pg_current_xlog_location = 0/9572CB28
-patched, 6-byte GID after 3M transaction: pg_current_xlog_location = 0/96C442E0

so with 6-byte GID’s difference in WAL size is less than 1%

-master, 194-byte GID after 3M transaction: pg_current_xlog_location = 
0/B7501578
-patched, 194-byte GID after 3M transaction: pg_current_xlog_location = 
0/D8B43E28

and with 194-byte GID’s difference in WAL size is about 18%

So using big GID’s (as J2EE does) can cause notable WAL bloat, while small 
GID’s are almost unnoticeable.

May be we can introduce configuration option track_commit_gid by analogy with 
track_commit_timestamp and make that behaviour optional? Any objections to that?

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >