[HACKERS] transam README small fix
Hi, Transaction function call sequence description in transam/README is slightly outdated. Select now handled by PortalRunSelect instead of ProcessQuery. It is also hard to follow what tabulation there means — sometimes that means “function called by function”, sometimes it isn't. So I’ve also changed it to actual call nesting. transam.readme.patch Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] transam README small fix
Thanks. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company > On 04 Mar 2016, at 22:14, Robert Haas wrote: > > On Tue, Mar 1, 2016 at 4:31 AM, Stas Kelvich wrote: >> Transaction function call sequence description in transam/README is slightly >> outdated. Select now handled by PortalRunSelect instead of ProcessQuery. It >> is also hard to follow what tabulation there means — sometimes that means >> “function called by function”, sometimes it isn't. So I’ve also changed it >> to actual call nesting. > > After some study, this looks good to me, so committed. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] 2PC support for pglogical
Hi. Here is proof-of-concept version of two phase commit support for logical replication. There are some changes in core postgres, pglogical_output and pglogical extensions. I’ve used version from 2nd Quadrant repo, pglogical branch and rebased it to current head. Some notes about this patch: * LWLockAssign() was deleted, so I changed that to new locks tranche api. * Seems that only reliable way to get GID during replay of commit/rollback prepared is to force postgres to write GID in corresponding records, otherwise we can lose correspondence between xid and gid if we are replaying data from wal sender while transaction was commited some time ago. So i’ve changed postgres to write gid’s not only on prepare, but also on commit/rollback prepared. That should be done only in logical level, but now I just want to here some other opinions on that. * Abort prepared xlog record also lack database information. Normally logical decoding just cleans reorder buffer when facing abort, but in case of 2PC we should send it to callbacks anyway. So I’ve added that info to abort records. * Prepare emits xlog record with TwoPhaseFileHeader in it and that structure is the same as xl_xact_parsed_commit, but with some fields renamed. Probably that is just due to historical reasons. It is possible to change PREPARE to write ordinary commit records with some flag and then use the same infrastructure to parse it. So DecodePrepare/DecodeCommit can be done with the same function. pglogical_twophase.diff Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
> On 10 Mar 2016, at 20:29, Teodor Sigaev wrote: > > I would like to suggest rename both functions to array_to_tsvector and > tsvector_to_array to have consistent name. Later we could add > to_tsvector([regconfig, ], text[]) with morphological processing. > > Thoughts? > Seems reasonable, done. tsvector_ops-v6.diff Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
> > On 11 Mar 2016, at 16:13, Stas Kelvich wrote: > > >> On 10 Mar 2016, at 20:29, Teodor Sigaev wrote: >> >> I would like to suggest rename both functions to array_to_tsvector and >> tsvector_to_array to have consistent name. Later we could add >> to_tsvector([regconfig, ], text[]) with morphological processing. >> >> Thoughts? >> Hi, thanks for commit. I saw errors on windows, here is the fix: tsvector.fix.patch Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
> On 11 Mar 2016, at 19:41, Jesper Pedersen wrote: > Thanks for review, Jesper. > Some comments: > > * The patch needs a rebase against the latest TwoPhaseFileHeader change Done. > * Rework the check.sh script into a TAP test case (src/test/recovery), as > suggested by Alvaro and Michael down thread Done. Originally I thought about reducing number of tests (11 right now), but now, after some debugging, I’m more convinced that it is better to include them all, as they are really testing different code paths. > * Add documentation for RecoverPreparedFromXLOG Done. --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company twophase_replay.v2.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] async replication code
> On 16 Mar 2016, at 15:52, otheus uibk wrote: > > Greetings, Hi > > I am new here. Apologetic question: how can i search the archives of this > mailing list? Is there some set of magic words to put in Google? Must I wade > through 20 pages of hits? Should I download all the archives and grep? :) > http://www.postgresql.org/list/ https://wiki.postgresql.org/wiki/Mailing_Lists > Main question: I want to understand very precisely the exact algirithm used > in PG to do asynchronous streaming replication. Eg, I want to know how the > record is written and flushed to the socket and how that happens in time > w.r.t WAL-to-disk and response to client. Will someone please give me a head > start as to where to begin my code-spelunking? > WAL synced to disc, then everything else is happening. http://www.postgresql.org/docs/9.5/static/warm-standby.html#STREAMING-REPLICATION --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] eXtensible Transaction Manager API (v2)
On 12 Mar 2016, at 13:19, Michael Paquier wrote: > > On Fri, Mar 11, 2016 at 9:35 PM, Tom Lane wrote: >> IMO this is not committable as-is, and I don't think that it's something >> that will become committable during this 'fest. I think we'd be well >> advised to boot it to the 2016-09 CF and focus our efforts on other stuff >> that has a better chance of getting finished this month. > > Yeah, I would believe that a good first step would be to discuss > deeply about that directly at PGCon for folks that will be there and > interested in the subject. It seems like a good timing to brainstorm > things F2F at the developer unconference for example, a couple of > months before the 1st CF of 9.7. We may perhaps (or not) get to > cleaner picture of what kind of things are wanted in this area. To give overview of xtm coupled with postgres_fdw from users perspective i’ve packed patched postgres with docker and provided test case when it is easy to spot violation of READ COMMITTED isolation level without XTM. This test fills database with users across two shards connected by postgres_fdw and inherits the same table. Then starts to concurrently transfers money between users in different shards: begin; update t set v = v - 1 where u=%d; -- this is user from t_fdw1, first shard update t set v = v + 1 where u=%d; -- this is user from t_fdw2, second shard commit; Also test simultaneously runs reader thread that counts all money in system: select sum(v) from t; So in transactional system we expect that sum should be always constant (zero in our case, as we initialize user with zero balance). But we can see that without tsdtm total amount of money fluctuates around zero. https://github.com/kelvich/postgres_xtm_docker --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fd.c: flush data problems on osx
> On 17 Mar 2016, at 20:23, Andres Freund wrote: > >> >> Also there are no default ifdef inside this function, is there any >> check that will guarantee that pg_flush_data will not end up with >> empty body on some platform? > > There doesn't need to be - it's purely "advisory", i.e. just an > optimization. Ah, okay, then I misinterpret purpose of that function and it shouldn’t be forced to sync. > >> One possible solution for that is just fallback to pg_fdatasync in case when >> offset = nbytes = 0. > > Hm, that's a bit heavyweight. I'd rather do an lseek(SEEK_END) to get > the file size. Could you test that? > It looks like OSX mmap raises EINVAL when length isn’t aligned to pagesize while manual says it can be of arbitrary length, so i aligned it. Also there were call to mmap with PROT_READ | PROT_WRITE, but when called from pre_sync_fname file descriptor is just O_RDONLY, so i changed mmap mode to PROT_READ — seems that PROT_WRITE wasn’t needed anyway. And all of that reduces number of warnings in order of magnitude but there are still some and I don’t yet understand why are they happening. > > Greetings, > > Andres Freund > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company flushdata.v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] fd.c: flush data problems on osx
Hi Current implementation of pg_flush_data when called with zero offset and zero nbytes is assumed to flush all file. In osx it uses mmap with these arguments, but according to man: "[EINVAL] The len argument was negative or zero. Historically, the system call would not return an error if the argument was zero. See other potential additional restrictions in the COMPAT- IBILITY section below." so it is generate a lot of warnings: "WARNING: could not mmap while flushing dirty data: Invalid argument" Call to pg_flush_data with zero offset and nbytes happens when replica starts from base backup and fsync=on. TAP test to reproduce is attached. One possible solution for that is just fallback to pg_fdatasync in case when offset = nbytes = 0. Also there are no default ifdef inside this function, is there any check that will guarantee that pg_flush_data will not end up with empty body on some platform? --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company # Minimal test testing streaming replication use strict; use warnings; use PostgresNode; use TestLib; use Test::More tests => 1; # Initialize master node my $node_master = get_new_node('master'); $node_master->init(allows_streaming => 1); $node_master->append_conf( 'postgresql.conf', qq( fsync = 'on' )); $node_master->start; my $backup_name = 'my_backup'; # Take backup $node_master->backup($backup_name); # Create streaming standby linking to master my $node_standby = get_new_node('standby'); $node_standby->init_from_backup($node_master, $backup_name); $node_standby->start; flushdata.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fd.c: flush data problems on osx
> On 18 Mar 2016, at 14:45, Stas Kelvich wrote: >> >>> One possible solution for that is just fallback to pg_fdatasync in case >>> when offset = nbytes = 0. >> >> Hm, that's a bit heavyweight. I'd rather do an lseek(SEEK_END) to get >> the file size. Could you test that? >> > > It looks like OSX mmap raises EINVAL when length isn’t aligned to pagesize > while manual says it can be of arbitrary length, so i aligned it. > Also there were call to mmap with PROT_READ | PROT_WRITE, but when called > from pre_sync_fname file descriptor is just O_RDONLY, so i changed mmap mode > to PROT_READ — seems that PROT_WRITE wasn’t needed anyway. > > And all of that reduces number of warnings in order of magnitude but there > are still some and I don’t yet understand why are they happening. I’ve spend some more time on this issue and found that remaining warnings were caused by mmap-ing directories — that raises EINVAL in OSX (probably not only OSX, but I didn’t tried). So i’ve skipped mmap for dirs and now restore happens without warnings. Also I’ve fixed wrong error check that was in previous version of patch. flushdata.v3.patch Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fd.c: flush data problems on osx
On 21 Mar 2016, at 14:53, Andres Freund wrote: > Hm. I think we should rather just skip calling pg_flush_data in the > directory case, that very likely isn't beneficial on any OS. Seems reasonable, changed. > I think we still need to fix the mmap() implementation to support the > offset = 0, nbytes = 0 case (via fseek(SEEK_END). It is already in this diff. I’ve added this few messages ago. flushdata.v4.patch Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] we have added support for box type in SP-GiST index
> On 21 Mar 2016, at 22:38, Kevin Grittner wrote: > > On Mon, Mar 21, 2016 at 11:57 AM, Teodor Sigaev wrote: > >>> I couldn't get the term 4D point. Maybe it means that we are >>> using box datatype as the prefix, but we are not treating them >>> as boxes. >> >> exactly, we treat boxes as 4-dimentional points. > > I'm not entirely sure I understand the terminology either, since I > tend to think of a *point* as having *zero* dimensions. Would it > perhaps be more accurate to say we are treating a 2-dimensional box > as a point in 4-dimensional space? Or just say 4-d vector instead of 4-d point. Look like it will be enough rigorous. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] we have added support for box type in SP-GiST index
> On 22 Mar 2016, at 01:47, Jim Nasby wrote: > > On 3/21/16 11:57 AM, Teodor Sigaev wrote: >> A and B are points of intersection of lines. So, box PBCAis a bounding >> box for points contained in 3-rd (see labeling above). For example X >> labeled point is not a descendace of child node with centroid C because >> it must be in branch of 1-st quad of parent node. So, each node (except >> root) will have a limitation in its quadrant. To transfer that >> limitation the traversalValue is used. > > Isn't this basically the same thing that the cube contrib module does? (Which > has the added benefit of kNN-capable operators). Cube and postgres own geometric types are indexed by building R-tree. While this is one of the most popular solutions it degrades almost to linear search when bounding boxes for each index node overlaps a lot. This problem will arise when one will try to index streets in the city with grid system — a lot of street's bounding boxes will just coincide with bounding box for whole city. But that patch use totally different strategy for building index and do not suffer from above problem. > > If that's true then ISTM it'd be better to work on pulling cube's features > into box? > > If it's not true, I'm still wondering if there's enough commonality here that > we should pull cube into core… There is also intarray module with very common functionality, but it also addressing different data pattern. Optimal indexing and kNN strategy are very sensible to the data itself and if we want some general multidimensional search routines, then I think none of the mentioned extensions is general enough. Cube barely applicable when dimensions number is higher than 10-20, intarray performs bad on data with big sets of possible coordinates, this patch is also intended to help with specific, niche problem. While people tends to use machine learning and regressions models more and more it is interesting to have some general n-dim indexing with kNN, but I think it is different problem and should be solved in a different way. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2PC support for pglogical
> On 24 Mar 2016, at 17:03, Robert Haas wrote: > > On Wed, Mar 23, 2016 at 1:44 AM, Craig Ringer wrote: >> On 10 March 2016 at 22:50, Stas Kelvich wrote: >>> Hi. >>> >>> Here is proof-of-concept version of two phase commit support for logical >>> replication. >> >> I missed this when you posted it, so sorry for the late response. >> >> I've read through this but not tested it yet. I really appreciate you doing >> it, it's been on my wishlist/backburner for ages. >> >> On reading through the patch I noticed that there doesn't seem to be any >> consideration of locking. The prepared xact can still hold strong locks on >> catalogs. How's that handled? I think Robert's group locking stuff is what >> we'll want here - for a prepared xact we can join the prepared xact as a >> group lock member so we inherit its locks. Right now if you try DDL in a >> prepared xact I suspect it'll break. > > I have grave doubts about that approach. It's not impossible it could > be right, but I wouldn't bet the farm on it. > I think all the locking already handled properly by creating dummy backend in PGPROC, as it done in usual postgres 2pc implementation. Just checked DDL with following scenario: * prepare tx that inserts a row in table on master * execute DROP TABLE on pglogical slave * commit prepared on master Now it behaves as expected — slave blocks DROP TABLE until commit prepared on master. --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
On Mar 29, 2016, at 6:04 PM, David Steelewrote:It looks like you should post a new patch or respond to Michael's comments. Marked as "waiting on author".Yep, here it is.On Mar 22, 2016, at 4:20 PM, Michael Paquier wrote:Looking at this patch….Thanks.+++ b/src/test/recovery/t/006_twophase.pl@@ -0,0 +1,226 @@+# Checks for recovery_min_apply_delay+use strict;This description is wrong, this file has been copied from 005.Yep, done.+my $node_master = get_new_node("Candie");+my $node_slave = get_new_node('Django');Please let's use a node names that are more descriptive.Hm, it’s hard to create descriptive names because test changes master/slave roles for that nodes several times during test. It’s possible to call them “node1” and “node2” but I’m not sure that improves something. But anyway I’m not insisting on that particular names and will agree with any reasonable suggestion.+# Switch to synchronous replication+$node_master->append_conf('postgresql.conf', qq(+synchronous_standby_names = '*'+));+$node_master->restart;Reloading would be fine.Good catch, done.+ /* During replay that lock isn't really necessary, but let's takeit anyway */+ LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);+ for (i = 0; i < TwoPhaseState->numPrepXacts; i++)+ {+ gxact = TwoPhaseState->prepXacts[i];+ proc = &ProcGlobal->allProcs[gxact->pgprocno];+ pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];++ if (TransactionIdEquals(xid, pgxact->xid))+ {+ gxact->locking_backend = MyBackendId;+ MyLockedGxact = gxact;+ break;+ }+ }+ LWLockRelease(TwoPhaseStateLock);Not taking ProcArrayLock here?All accesses to 2pc dummies in ProcArray are covered with TwoPhaseStateLock, so I thick that’s safe. Also I’ve deleted comment above that block, probably it’s more confusing than descriptive.The comment at the top of XlogReadTwoPhaseData is incorrect.Yep, fixed.RecoverPreparedFromXLOG and RecoverPreparedFromFiles have a lot ofcode in common, having this duplication is not good, and you couldsimplify your patch.I reworked patch to avoid duplicated code between RecoverPreparedFromXLOG/RecoverPreparedFromFiles and also between FinishPreparedTransaction/XlogRedoFinishPrepared. twophase_replay.v3.diff Description: Binary data -- Stas KelvichPostgres Professional: http://www.postgrespro.comRussian Postgres Company
Re: [HACKERS] Speedup twophase transactions
My +1 for moving function to xlogutils.c too. Now call to this function goes through series of callbacks so it is hard to find it. Personally I found it only after I have implemented same function by myself (based on code in pg_xlogdump). Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company > On 12 Jan 2016, at 18:56, Alvaro Herrera wrote: > > Michael Paquier wrote: >> On Tue, Jan 12, 2016 at 5:21 PM, Simon Riggs wrote: >>> Should we just move the code somewhere just to imply it is generic? Seems >>> pointless refactoring to me. >> >> Er, why not xlogutils.c? Having the 2PC code depending directly on >> something that is within logicalfuncs.c is weird. > > Yes, I agree with Michael -- it's better to place code in its logical > location than keep it somewhere else just because historically it was > there. That way, future coders can find the function more easily. > > -- > Álvaro Herrerahttp://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
Hi, Thanks for reviews and commit! As Simon and Andres already mentioned in this thread replay of twophase transaction is significantly slower then the same operations in normal mode. Major reason is that each state file is fsynced during replay and while it is not a problem for recovery, it is a problem for replication. Under high 2pc update load lag between master and async replica is constantly increasing (see graph below). One way to improve things is to move fsyncs to restartpoints, but as we saw previously it is a half-measure and just frequent calls to fopen can cause bottleneck. Other option is to use the same scenario for replay that was used already for non-recovery mode: read state files to memory during replay of prepare, and if checkpoint/restartpoint occurs between prepare and commit move data to files. On commit we can read xlog or files. So here is the patch that implements this scenario for replay. Patch is quite straightforward. During replay of prepare records RecoverPreparedFromXLOG() is called to create memory state in GXACT, PROC, PGPROC; on commit XlogRedoFinishPrepared() is called to clean up that state. Also there are several functions (PrescanPreparedTransactions, StandbyTransactionIdIsPrepared) that were assuming that during replay all prepared xacts have files in pg_twophase, so I have extended them to check GXACT too. Side effect of that behaviour is that we can see prepared xacts in pg_prepared_xacts view on slave. While this patch touches quite sensible part of postgres replay and there is some rarely used code paths, I wrote shell script to setup master/slave replication and test different failure scenarios that can happened with instances. Attaching this file to show test scenarios that I have tested and more importantly to show what I didn’t tested. Particularly I failed to reproduce situation where StandbyTransactionIdIsPrepared() is called, may be somebody can suggest way how to force it’s usage. Also I’m not too sure about necessity of calling cache invalidation callbacks during XlogRedoFinishPrepared(), I’ve marked this place in patch with 2REVIEWER comment. Tests shows that this patch increases speed of 2pc replay to the level when replica can keep pace with master. Graph: replica lag under a pgbench run for a 200 seconds with 2pc update transactions (80 connections, one update per 2pc tx, two servers with 12 cores each, 10GbE interconnect) on current master and with suggested patch. Replica lag measured with "select sent_location-replay_location as delay from pg_stat_replication;" each second. twophase_replay.diff Description: Binary data check.sh Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company > On 12 Jan 2016, at 22:57, Simon Riggs wrote: > > On 12 January 2016 at 18:14, Andres Freund wrote: > Hi, > > Thank you for the additional review. > > On 2016-01-11 19:39:14 +, Simon Riggs wrote: > > Currently, the patch reuses all of the code related to reading/write state > > files, so it is the minimal patch that can implement the important things > > for performance. The current patch succeeds in its goal to improve > > performance, so I personally see no further need for code churn. > > Sorry, I don't buy that argument. This approach leaves us with a bunch > of code related to statefiles that's barely ever going to be exercised, > and leaves the performance bottleneck on WAL replay in place. > > I raised the issue of WAL replay performance before you were involved, as has > been mentioned already. I don't see it as a blocker for this patch. I have > already requested it from Stas and he has already agreed to write that. > > Anyway, we know the statefile code works, so I'd prefer to keep it, rather > than write a whole load of new code that would almost certainly fail. > Whatever the code looks like, the frequency of usage is the same. As I > already said, you can submit a patch for the new way if you wish; the reality > is that this code works and there's no additional performance gain from doing > it a different way. > > > As you suggest, we could also completely redesign the state file mechanism > > and/or put it in WAL at checkpoint. That's all very nice but is much more > > code and doesn't anything more for performance, since the current mainline > > path writes ZERO files at checkpoint. > > Well, on the primary, yes. > > Your changes proposed earlier wouldn't change performance on the standby. > > > If you want that for some other reason or refactoring, I won't stop > > you, but its a separate patch for a separate purpose. > > Maintainability/complexity very much has to be considered during review > and
Re: [HACKERS] Speedup twophase transactions
Agree, I had the same idea in my mind when was writing that script. I will migrate it to TAP suite and write a review for Michael Paquier's patch. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company > On 26 Jan 2016, at 20:20, Alvaro Herrera wrote: > > Stas Kelvich wrote: > >> While this patch touches quite sensible part of postgres replay and there is >> some rarely used code paths, I wrote shell script to setup master/slave >> replication and test different failure scenarios that can happened with >> instances. Attaching this file to show test scenarios that I have tested and >> more importantly to show what I didn’t tested. Particularly I failed to >> reproduce situation where StandbyTransactionIdIsPrepared() is called, may be >> somebody can suggest way how to force it’s usage. Also I’m not too sure >> about necessity of calling cache invalidation callbacks during >> XlogRedoFinishPrepared(), I’ve marked this place in patch with 2REVIEWER >> comment. > > I think this is the third thread in which I say this: We need to push > Michael Paquier's recovery test framework, then convert your test script > to that. That way we can put your tests in core. > > -- > Álvaro Herrerahttp://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"
Hi. I tried that and confirm strange behaviour. It seems that problem with small cyrillic letter ‘х’. (simplest obscene language filter? =) That can be reproduced with simpler test Stas test.c Description: Binary data > On 27 Jan 2016, at 13:59, Artur Zakirov wrote: > > On 27.01.2016 13:46, Shulgin, Oleksandr wrote: >> >> Not sure why the file uses "SET KOI8-R" directive then? >> > > This directive is used only by Hunspell program. PostgreSQL ignores this > directive and assumes that input affix and dictionary files in the UTF-8 > encoding. > >> >> >> What error message do you get with this test program? (I don't get any, >> but I'm not on Mac OS.) >> -- >> Alex >> >> > > With this program you will get wrong output. A error message is not called. > You can execute the following commands: > > > cc test.c -o test > > ./test > > You will get the output: > > SFX/Y/?/аться/шутся > > Although the output should be: > > SFX/Y/хаться/шутся/хаться > > -- > Artur Zakirov > Postgres Professional: http://www.postgrespro.com > Russian Postgres Company > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
Hi > On 22 Jan 2016, at 19:03, Tomas Vondra wrote: > OK, although I do recommend using more sensible variable names, i.e. why how > to use 'lexemes' instead of 'lexarr' for example? Similarly for the other > functions. Changed. With old names I tried to follow conventions in surrounding code, but probably that is a good idea to switch to more meaningful names in new code. >> >> >> delete(tsin tsvector, tsv_filter tsvector) — Delete lexemes and/or positions >> of tsv_filter from tsin. When lexeme in tsv_filter has no positions function >> will delete any occurrence of same lexeme in tsin. When tsv_filter lexeme >> have positions function will delete them from positions of matching lexeme >> in tsin. If after such removal resulting positions set is empty then >> function will delete that lexeme from resulting tsvector. >> > > I can't really imagine situation in which I'd need this, but if you do have a > use case for it ... although in the initial paragraph you say "... but if > somebody wants to delete for example ..." which suggests you may not have > such use case. > > Based on bad experience with extending API based on vague ideas, I recommend > only really adding functions with existing need. It's easy to add a function > later, much more difficult to remove it or change the signature. I tried to create more or less self-contained api, e.g. have ability to negate effect of concatenation. But i’ve also asked people around what they think about extending API and everybody convinced that it is better to stick to smaller API. So let’s drop it. At least that functions exists in mail list in case if somebody will google for such kind of behaviour. >> >> Also if we want some level of completeness of API and taking into account >> that concat() function shift positions on second argument I thought that it >> can be useful to also add function that can shift all positions of specific >> value. This helps to undo concatenation: delete one of concatenating >> tsvectors and then shift positions in resulting tsvector. So I also wrote >> one another small function: >> >> shift(tsin tsvector,offset int16) — Shift all positions in tsin by given >> offset > > That seems rather too low-level. Shouldn't it be really built into delete() > directly somehow? I think it is ambiguous task on delete. But if we are dropping support of delete(tsvector, tsvector) I don’t see points in keeping that functions. >>> >>> 7) Some of the functions use intexterm that does not match the function >>> name. I see two such cases - to_tsvector and setweight. Is there a >>> reason for that? >>> >> >> Because sgml compiler wants unique indexterm. Both functions that >> youmentioned use overloading of arguments and have non-unique name. > > As Michael pointed out, that should probably be handled by using > and tags. Done. > On 19 Jan 2016, at 00:21, Alvaro Herrera wrote: > > > It's a bit funny that you reintroduce the "unrecognized weight: %d" > (instead of %c) in tsvector_setweight_by_filter. > Ah, I was thinking about moving it to separate diff and messed. Fixed and attaching diff with same fix for old tsvector_setweight. tsvector_ops-v2.1.diff Description: Binary data tsvector_ops-v2.2.diff Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] In-core regression tests for replication, cascading, archiving, PITR, etc.
Hi. I’ve looked over proposed patch and migrated my shell tests scripts that i’ve used for testing twophase commits on master/slave to this test framework. Everything looks mature, and I didn’t encountered any problems with writing new tests using this infrastructure. >From my point of view I don’t see any problems to commit this patches in their >current state. Also some things that came into mind about test suite: 0) There are several routines that does actual checking, like is/command_ok/command_fails. I think it will be very handy to have wrappers psql_ok/psql_fails that calls psql through the command_ok/fails. 1) Better to raise more meaningful error when IPC::Run is absend. 2) --enable-tap-tests deserves mention in test/recovery/README and more obvious error message when one trying to run make check in test/recovery without --enable-tap-tests. 3) Is it hard to give ability to run TAP tests in extensions? 4) It will be handy if make check will write path to log files in case of failed test. 5) psql() accepts database name as a first argument, but everywhere in tests it is ‘postgres’. Isn’t it simpler to store dbname in connstr, and have separate function to change database? 6) Clean logs on prove restart? Clean up tmp installations? 7) Make check sets PGPORT PG_REGRESS for prove. Is it necessary? > On 22 Jan 2016, at 09:17, Michael Paquier wrote: > > On Mon, Dec 21, 2015 at 4:45 PM, Michael Paquier > wrote: >> As this thread is stalling a bit, please find attached a series of >> patch gathering all the pending issues for this thread: >> - 0001, fix config_default.pl for MSVC builds to take into account TAP tests >> - 0002, append a node name in get_new_node (per Noah's request) >> - 0003, the actual recovery test suite >> Hopefully this facilitates future reviews. > > Patch 2 has been pushed as c8642d9 (thanks Alvaro). The remaining two > patches still apply and pass cleanly. > -- > Michael > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch_extras extension
> On 17 Feb 2016, at 08:14, Oleg Bartunov wrote: > > > > On Wed, Feb 17, 2016 at 6:57 AM, Tim Abbott wrote: > Just following up here since I haven't gotten a reply -- I'd love to work > with someone from the Postgres community on a plan to make the tsearch_extras > functionality available as part of mainline postgres. > > > -Tim Abbott > > On Wed, Feb 3, 2016 at 9:41 PM, Tim Abbott wrote: > Hello, > > I'm a maintainer of the Zulip open source group chat application. Zulip > depends on a small (~200 lines) postgres extension called tsearch_extras > (https://github.com/zbenjamin/tsearch_extras) that returns the actual > (offset, length) pairs of all the matches for a postgres full text search > query. This extension allows Zulip to do its own highlighting of the full > text search matches, using a more complicated method than what Postgres > supports natively. > > I think tsearch_extras is probably of value to others using postgres > full-text search (and I'd certainly love to get out of the business of > maintaining an out-of-tree postgres extension), so I'm wondering if this > feature (or a variant of it) would be of interest to upstream? > > Thanks! > > -Tim Abbott > > (See > http://www.postgresql.org/message-id/flat/52c7186d.8010...@strangersgate.com#52c7186d.8010...@strangersgate.com > for the discussion on postgres mailing lists that caused us to write this > module in the first place.) > > Tim, > > take a look on this patch (https://commitfest.postgresql.org/7/385/) and > contact author. It contains everything you need to your purposes. > > btw, Stas, check on status "Returned with feedback" ! > > > Regards, > Oleg > Hi, Yes, this extension have some common functionality with patch. Particularly yours tsvector_lexemes and mine to_array do the same thing. I wasn’t aware of you patch when started to work on that, so I think we can ask commiter to mention you in commit message (if it that will be commited). And how do you use ts_match_locs_array / ts_match_locs_array? To highlight search results? There is function called ts_headline that can mark matches with custom start/stop strings. --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
> > On 02 Feb 2016, at 20:10, Teodor Sigaev wrote: > > Some notices: > > 1 tsin in documentation doesn't look like a good name. Changed to vector > similar to other places. > > 2 I did some editorization about freeing memory/forgotten names etc Thanks. > > 3 It seems to me that tsvector_unnest() could be seriously optimized for > large tsvectors: with current coding it detoasts/decompresses tsvector value > on each call. Much better to do it once in > multi_call_memory_ctx context at first call init Done, moved detoasting to first SRF call. > 4 It seems debatable returning empty array for position/weight if they are > absent: > =# select * from unnest('a:1 b'::tsvector); > lexeme | positions | weights > +---+- > a | {1} | {D} > b | {}| {} > I think, it's better to return NULL in this case > Okay, done. > 5 > array_to_tsvector/tsvector_setweight_by_filter/tsvector_delete_arr/tsvector_filter > doesn't check or pay attention to NULL elements in input arrays > Thanks! Fixed and added tests. > -- > Teodor Sigaev E-mail: teo...@sigaev.ru > WWW: http://www.sigaev.ru/ > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers tsvector_ops-v4.diff Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical decoding restart problems
> On 20 Aug 2016, at 15:59, Craig Ringer wrote: > > I'll wait for a test case or some more detail. Thanks for clarification about how restart_lsn is working. Digging slightly deeper into this topic revealed that problem was in two phase decoding, not it logical decoding itself. While I was writing DecodePrepare() I've and copied call to SnapBuildCommitTxn() function from DecodeCommit() which was removing current transaction from running list and that’s obviously wrong thing to do for a prepared tx. So I end up with following workaround: --- a/src/backend/replication/logical/snapbuild.c +++ b/src/backend/replication/logical/snapbuild.c @@ -950,7 +950,7 @@ SnapBuildAbortTxn(SnapBuild *builder, XLogRecPtr lsn, */ void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, - int nsubxacts, TransactionId *subxacts) + int nsubxacts, TransactionId *subxacts, bool isCommit) { int nxact; @@ -1026,7 +1026,8 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, * Make sure toplevel txn is not tracked in running txn's anymore, switch * state to consistent if possible. */ - SnapBuildEndTxn(builder, lsn, xid); + if (isCommit) + SnapBuildEndTxn(builder, lsn, xid); Calling SnapBuildCommitTxn with isCommit=true from commit and false from Prepare. However while I’m not observing partially decoded transactions anymore, I’m not sure that this is right way to build proper snapshot and something else isn’t broken. Also while I was playing psycopg2 logical decoding client I do see empty transaction containing only BEGIN/COMMIT (with test_decoding output plugin, and current postgres master). -- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical decoding restart problems
> On 31 Aug 2016, at 03:28, Craig Ringer wrote: > > On 25 Aug. 2016 20:03, "Stas Kelvich" wrote: > > > > Thanks for clarification about how restart_lsn is working. > > > > Digging slightly deeper into this topic revealed that problem was in two > > phase decoding, not it logical decoding itself. > > Good to know. The behavior didn't make much sense for logical decoding but > bugs usually don't. > > Do you plan to submit a patch for logical decoding of prepared transactions > to 10.0? > I do, probably on CF2. There is issue with locks that Andres pointed me to; also twophase.c was written before logical decoding happened and it deserves some refactoring to avoid duplicated functionality between 2pc decoding and restoring of old prepared xact — I want to include that in patch too. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
> On 06 Sep 2016, at 04:41, Michael Paquier wrote: > > On Sat, Sep 3, 2016 at 10:26 PM, Michael Paquier > wrote: >> On Fri, Sep 2, 2016 at 5:06 AM, Simon Riggs wrote: >>> On 13 April 2016 at 15:31, Stas Kelvich wrote: >>> >>>> Fixed patch attached. There already was infrastructure to skip currently >>>> held locks during replay of standby_redo() and I’ve extended that with >>>> check for >>>> prepared xids. >>> >>> Please confirm that everything still works on current HEAD for the new >>> CF, so review can start. >> >> The patch does not apply cleanly. Stas, could you rebase? I am >> switching the patch to "waiting on author" for now. > > So, I have just done the rebase myself and did an extra round of > reviews of the patch. Here are couple of comments after this extra > lookup. > > LockGXactByXid() is aimed to be used only in recovery, so it seems > adapted to be to add an assertion with RecoveryInProgress(). Using > this routine in non-recovery code paths is risky because it assumes > that a PGXACT could be missing, which is fine in recovery as prepared > transactions could be moved to twophase files because of a checkpoint, > but not in normal cases. We could also add an assertion of the kind > gxact->locking_backend == InvalidBackendId before locking the PGXACT > but as this routine is just normally used by the startup process (in > non-standalone mode!) that's fine without. > > The handling of invalidation messages and removal of relfilenodes done > in FinishGXact can be grouped together, checking only once for > !RecoveryInProgress(). > > + * > + * The same procedure happens during replication and crash recovery. > * > "during WAL replay" is more generic and applies here. > > + > +next_file: > + continue; > + > That can be removed and replaced by a "continue;". > > + /* > +* At first check prepared tx that wasn't yet moved to disk. > +*/ > + LWLockAcquire(TwoPhaseStateLock, LW_SHARED); > + for (i = 0; i < TwoPhaseState->numPrepXacts; i++) > + { > + GlobalTransaction gxact = TwoPhaseState->prepXacts[i]; > + PGXACT *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno]; > + > + if (TransactionIdEquals(pgxact->xid, xid)) > + { > + LWLockRelease(TwoPhaseStateLock); > + return true; > + } > + } > + LWLockRelease(TwoPhaseStateLock); > This overlaps with TwoPhaseGetGXact but I'd rather keep both things > separated: it does not seem worth complicating the existing interface, > and putting that in cache during recovery has no meaning. Oh, I was preparing new version of patch, after fresh look on it. Probably, I should said that in this topic. I’ve found a bug in sub transaction handling and now working on fix. > > I have also reworked the test format, and fixed many typos and grammar > problems in the patch as well as in the tests. Thanks! > > After review the result is attached. Perhaps a committer could get a look at > it? I'll check it against my failure scenario with subtransactions and post results or updated patch here. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
> On 06 Sep 2016, at 12:09, Simon Riggs wrote: > > On 6 September 2016 at 09:58, Stas Kelvich wrote: >> >> I'll check it against my failure scenario with subtransactions and post >> results or updated patch here. > > Make sure tests are added for that. It would have been better to say > you knew there were bugs in it earlier. I’ve spotted that yesterday during rebase. Sorry. Next time in same situation i’ll write right away to save everyone’s time. > On 06 Sep 2016, at 12:03, Michael Paquier wrote: > > On Tue, Sep 6, 2016 at 5:58 PM, Stas Kelvich wrote: >> Oh, I was preparing new version of patch, after fresh look on it. Probably, >> I should >> said that in this topic. I’ve found a bug in sub transaction handling and >> now working >> on fix. > > What's the problem actually? Handling of xids_p array in PrescanPreparedTransactions() is wrong for prepared tx's in memory. Also I want to double-check and add comments to RecoveryInProgress() checks in FinishGXact. I’ll post reworked patch in several days. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
> On 07 Sep 2016, at 03:09, Michael Paquier wrote: > >>> On 06 Sep 2016, at 12:03, Michael Paquier wrote: >>> >>> On Tue, Sep 6, 2016 at 5:58 PM, Stas Kelvich >>> wrote: >>>> Oh, I was preparing new version of patch, after fresh look on it. >>>> Probably, I should >>>> said that in this topic. I’ve found a bug in sub transaction handling and >>>> now working >>>> on fix. >>> >>> What's the problem actually? >> >> Handling of xids_p array in PrescanPreparedTransactions() is wrong for >> prepared tx's in memory. >> Also I want to double-check and add comments to RecoveryInProgress() checks >> in FinishGXact. >> >> I’ll post reworked patch in several days. > > Could you use as a base the version I just sent yesterday then? I > noticed many mistakes in the comments, for example many s/it's/its/ > and did a couple of adjustments around the code, the goto next_file > was particularly ugly. That will be that much work not do to again > later. Yes. Already merged branch with your version. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Bug in two-phase transaction recovery
Hello. Some time ago two-phase state file format was changed to have variable size GID, but several places that read that files were not updated to use new offsets. Problem exists in master and 9.6 and can be reproduced on prepared transactions with savepoints. For example: create table t(id int); begin; insert into t values (42); savepoint s1; insert into t values (43); prepare transaction 'x'; begin; insert into t values (142); savepoint s1; insert into t values (143); prepare transaction 'y’; and restart the server. Startup process will hang. Fix attached. Also while looking at StandbyRecoverPreparedTransactions() i’ve noticed that buffer for 2pc file is allocated in TopMemoryContext but never freed. That probably exists for a long time. gidlen_fixes.diff Description: Binary data standby_recover_pfree.diff Description: Binary data -- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Suggestions for first contribution?
> On 05 Sep 2016, at 20:25, Christian Convey wrote: > > Hi guys, > > Can anyone suggest a project for my first PG contribution? > > My first two ideas didn't pan out: Yury doesn't seem to need help > with CMake, and the TODO list's "-Wcast-align" project (now deleted) > appeared impractical. There is also a list of projects for google summer of code: https://wiki.postgresql.org/wiki/GSoC_2016 That topics expected to be doable by a newcomer during several months. It is also slightly outdated, but you always can check current state of things by searching this mail list on interesting topic. Also postgres have several internal API’s like fdw[1] or gist[2] that allows you to implements something useful without digging too much into a postgres internals. [1] https://www.postgresql.org/docs/current/static/fdwhandler.html [2] https://www.postgresql.org/docs/current/static/gist-extensibility.html -- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
> On 07 Sep 2016, at 11:07, Stas Kelvich wrote: > >> On 07 Sep 2016, at 03:09, Michael Paquier wrote: >> >>>> On 06 Sep 2016, at 12:03, Michael Paquier >>>> wrote: >>>> >>>> On Tue, Sep 6, 2016 at 5:58 PM, Stas Kelvich >>>> wrote: >>>>> Oh, I was preparing new version of patch, after fresh look on it. >>>>> Probably, I should >>>>> said that in this topic. I’ve found a bug in sub transaction handling and >>>>> now working >>>>> on fix. >>>> >>>> What's the problem actually? >>> >>> Handling of xids_p array in PrescanPreparedTransactions() is wrong for >>> prepared tx's in memory. >>> Also I want to double-check and add comments to RecoveryInProgress() checks >>> in FinishGXact. >>> >>> I’ll post reworked patch in several days. >> >> Could you use as a base the version I just sent yesterday then? I >> noticed many mistakes in the comments, for example many s/it's/its/ >> and did a couple of adjustments around the code, the goto next_file >> was particularly ugly. That will be that much work not do to again >> later. > > Yes. Already merged branch with your version. Here is updated version of patch. Looking through old version i’ve noted few things that were bothering me: * In case of crash replay PREPARE redo accesses SUBTRANS, but StartupSUBTRANS() isn’t called yet in StartupXLOG(). * Several functions in twophase.c have to walk through both PGPROC and pg_twophase directory. Now I slightly changed order of things in StartupXLOG() so twophase state loaded from from file and StartupSUBTRANS called before actual recovery starts. So in all other functions we can be sure that file were already loaded in memory. Also since 2pc transactions now are dumped to files only on checkpoint, we can get rid of PrescanPreparedTransactions() — all necessary info can reside in checkpoint itself. I’ve changed behaviour of oldestActiveXid write in checkpoint and that’s probably discussable topic, but ISTM that simplifies a lot recovery logic in both twophase.c and xlog.c. More comments on that in patch itself. twophase_replay.v7.patch Description: Binary data -- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
RecoveryRestartPoint(). We would need as well to tweak > PrescanPreparedTransactions accordingly than everything we are trying > to do here. > That would be way more simple than what's proposed here, and we'd > still keep all the performance benefits. So file arrives to replica through WAL and instead of directly reading it you suggest to copy it do DSM if it is small enough, and to filesystem if not. Probably that will allow us to stay only around reading/writing files, but: * That will be measurably slower than proposed approach because of unnecessary copying between WAL and DSM. Not to mention prepare records of arbitrary length. * Different behaviour on replica and master — less tested code for replica. * Almost the same behaviour can be achieved by delaying fsync on 2pc files till checkpoint. But if reword you comment in a way that it is better to avoid replaying prepare record at all, like it happens now in master, then I see the point. And that is possible even without DSM, it possible to allocate static sized array storing some info about tx, whether it is in the WAL or in file, xid, gid. Some sort of PGXACT doppelganger only for replay purposes instead of using normal one. So taking into account my comments what do you think? Should we keep current approach or try to avoid replaying prepare records at all? -- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
> On 21 Sep 2016, at 10:32, Michael Paquier wrote: > > On Tue, Sep 20, 2016 at 11:13 PM, Stas Kelvich > wrote: >> >> Putting that before actual WAL replay is just following historical order of >> events. >> Prepared files are pieces of WAL that happened before checkpoint, so ISTM >> there is no conceptual problem in restoring their state before replay. > > For wal_level = minimal there is no need to call that to begin with.. > And I'd think that it is better to continue with things the same way. > Hm, why? >> >> With this patch we are reusing one infrastructure for normal work, recovery >> and replay. > > The mix between normal work and recovery is the scary part. Normal > work and recovery are entirely two different things. > Okay, agreed. Commit procedure that checks whether recovery is active or not is quite hacky solution. >> So taking into account my comments what do you think? Should we keep current >> approach >> or try to avoid replaying prepare records at all? > > I'd really like to give a try to the idea you just mentioned, aka > delay the fsync of the 2PC files from RecreateTwoPhaseFile to > CreateRestartPoint, or get things into memory.. I could write one, or > both of those patches. I don't mind. I’m not giving up yet, i’ll write them) I still have in mind several other patches to 2pc handling in postgres during this release cycle — logical decoding and partitioned hash instead of TwoPhaseState list. My bet that relative speed of that patches will depend on used filesystem. Like it was with the first patch in that mail thread it is totally possible sometimes to hit filesystem limits on file creation speed. Otherwise both approaches should be more or less equal, i suppose. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] assert violation in logical messages serialization
Hello. During processing of logical messages in ReorderBufferSerializeChange() pointer to ondisk structure isn’t updated after possible reorder buffer realloc by ReorderBufferSerializeReserve(). Actual reason spotted by Konstantin Knizhnik. reorderbuffer.patch Description: Binary data -- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with logical replication
> On 2 Oct 2017, at 19:59, Petr Jelinek wrote: > >> >> Program terminated with signal SIGABRT, Aborted. >> #0 0x7f3608f8ec37 in __GI_raise (sig=sig@entry=6) at >> ../nptl/sysdeps/unix/sysv/linux/raise.c:56 >> 56../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory. >> (gdb) bt >> #0 0x7f3608f8ec37 in __GI_raise (sig=sig@entry=6) at >> ../nptl/sysdeps/unix/sysv/linux/raise.c:56 >> #1 0x7f3608f92028 in __GI_abort () at abort.c:89 >> #2 0x009f5740 in ExceptionalCondition (conditionName=0xbf6b30 >> "!(((xid) != ((TransactionId) 0)))", >> errorType=0xbf69af "FailedAssertion", fileName=0xbf69a8 "lmgr.c", >> lineNumber=582) at assert.c:54 >> #3 0x0086ac1d in XactLockTableWait (xid=0, rel=0x0, ctid=0x0, >> oper=XLTW_None) at lmgr.c:582 >> #4 0x0081c9f7 in SnapBuildWaitSnapshot (running=0x2749198, >> cutoff=898498) at snapbuild.c:1400 >> #5 0x0081c7c7 in SnapBuildFindSnapshot (builder=0x2807910, >> lsn=798477760, running=0x2749198) at snapbuild.c:1311 >> #6 0x0081c339 in SnapBuildProcessRunningXacts >> (builder=0x2807910, lsn=798477760, running=0x2749198) at snapbuild.c:1106 > > > Hmm this would suggest that XLOG_RUNNING_XACTS record contains invalid > transaction ids but we specifically skip those in > GetRunningTransactionData(). Can you check pg_waldump output for that > record (lsn is shown in the backtrace)? I investigated this case and it seems that XactLockTableWait() in SnapBuildWaitSnapshot() not always work as expected. XactLockTableWait() waits on LockAcquire() for xid to be completed and if we finally got this lock but transaction is still in progress then such xid assumed to be a subxid. However LockAcquire/LockRelease cycle can happen after transaction set xid, but before XactLockTableInsert(). Namely following history happened for xid 5225 and lead to crash: [backend] LOG: AssignTransactionId: XactTopTransactionId = 5225 [walsender] LOG: LogCurrentRunningXacts: Wrote RUNNING_XACTS xctn=1, xid[0]=5225 [walsender] LOG: XactLockTableWait: LockAcquire 5225 [walsender] LOG: XactLockTableWait: LockRelease 5225 [backend] LOG: AssignTransactionId: LockAcquire ExclusiveLock 5225 [walsender] LOG: TransactionIdIsInProgress: SVC->latestCompletedXid=5224 < xid=5225 => true [backend] LOG: CommitTransaction: ProcArrayEndTransaction xid=5225, ipw=0 [backend] LOG: CommitTransaction: ResourceOwnerRelease locks xid=5225 I’ve quickly checked other usages of XactLockTableWait() and it seems that it is used mostly with xids from heap so tx definetly set it lock somewhere in the past. Not sure what it the best approach to handle that. May be write running xacts only if they already set their lock? Also attaching pgbench script that can reproduce failure. I run it over usual pgbench database in 10 threads. It takes several minutes to crash. reload.pgb Description: Binary data Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: About CMake v2
> On 17 Sep 2016, at 20:21, Yury Zhuravlev wrote: > > Michael Paquier wrote: >> On Sat, Sep 17, 2016 at 1:40 AM, Yury Zhuravlev >> wrote: >>> Michael Paquier wrote: >>> I merged master to my branch and I spent time to porting all changes. I hope >>> send patch in the weekend without terrible flaws. >> >> By the way, I noticed that you did not register this patch in the current >> CF.. > > Now, I published the first version of the patch. This patch not ready for > commit yet and all current task you can read here: > https://github.com/stalkerg/postgres_cmake/issues > > I hope we realy close to end. > In this patch I forbade in-source build for avoid overwriting current > Makefiles. We will can remove all Makefiles only after shall see in CMake. > You don't need support two system. During commitfests CMake build system will > be supported by me. > I need help with buildfarm because my knowledge of Perl is very bad (thought > about rewrite buildfarm to Python). > > I hope for your support. Tried to generate Xcode project out of cmake, build fails on genbki.pl: can't locate Catalog.pm (which itself lives in src/backend/catalog/Catalog.pm) -- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Stats sender and 2pc minor problem
Hello. Statistics sender logic during usual commit and two-phase commit do not strictly matches each other and that leads to delta_live_tuples added to n_live_tup in case of truncate in two phase commit. That can be see in following example: CREATE TABLE trunc_stats_test5(id serial); INSERT INTO trunc_stats_test5 DEFAULT VALUES; INSERT INTO trunc_stats_test5 DEFAULT VALUES; INSERT INTO trunc_stats_test5 DEFAULT VALUES; BEGIN; TRUNCATE trunc_stats_test5; PREPARE TRANSACTION 'twophase_stats'; COMMIT PREPARED 'twophase_stats'; After that pg_stat_user_tables will have n_live_tup = 3 instead of 0. Fix along with test is attached. 2pc-stats.patch Description: Binary data -- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 2 Mar 2017, at 01:20, Petr Jelinek wrote: > > The info gets removed on COMMIT PREPARED but at that point > there is no real difference between replicating it as 2pc or 1pc since > the 2pc behavior is for all intents and purposes lost at that point. > If we are doing 2pc and COMMIT PREPARED happens then we should replicate that without transaction body to the receiving servers since tx is already prepared on them with some GID. So we need a way to construct that GID. It seems that last ~10 messages I’m failing to explain some points about this topic. Or, maybe, I’m failing to understand some points. Can we maybe setup skype call to discuss this and post summary here? Craig? Peter? -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 2 Mar 2017, at 11:00, Craig Ringer wrote: > > We already have it, because we just decoded the PREPARE TRANSACTION. > I'm preparing a patch revision to demonstrate this. Yes, we already have it, but if server reboots between commit prepared (all prepared state is gone) and decoding of this commit prepared then we loose that mapping, isn’t it? > BTW, I've been reviewing the patch in more detail. Other than a bunch > of copy-and-paste that I'm cleaning up, the main issue I've found is > that in DecodePrepare, you call: > >SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid, > parsed->nsubxacts, parsed->subxacts); > > but I am not convinced it is correct to call it at PREPARE TRANSACTION > time, only at COMMIT PREPARED time. We want to see the 2pc prepared > xact's state when decoding it, but there might be later commits that > cannot yet see that state and shouldn't have it visible in their > snapshots. Agree, that is problem. That allows to decode this PREPARE, but after that it is better to mark this transaction as running in snapshot or perform prepare decoding with some kind of copied-end-edited snapshot. I’ll have a look at this. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 16 Mar 2017, at 14:44, Craig Ringer wrote: > > I'm going to try to pick this patch up and amend its interface per our > discussion earlier, see if I can get it committable. I’m working right now on issue with building snapshots for decoding prepared tx. I hope I'll send updated patch later today. > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
>> On 2 Mar 2017, at 11:00, Craig Ringer wrote: >> >> BTW, I've been reviewing the patch in more detail. Other than a bunch >> of copy-and-paste that I'm cleaning up, the main issue I've found is >> that in DecodePrepare, you call: >> >> SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid, >> parsed->nsubxacts, parsed->subxacts); >> >> but I am not convinced it is correct to call it at PREPARE TRANSACTION >> time, only at COMMIT PREPARED time. We want to see the 2pc prepared >> xact's state when decoding it, but there might be later commits that >> cannot yet see that state and shouldn't have it visible in their >> snapshots. > > Agree, that is problem. That allows to decode this PREPARE, but after that > it is better to mark this transaction as running in snapshot or perform > prepare > decoding with some kind of copied-end-edited snapshot. I’ll have a look at > this. > While working on this i’ve spotted quite a nasty corner case with aborted prepared transaction. I have some not that great ideas how to fix it, but maybe i blurred my view and missed something. So want to ask here at first. Suppose we created a table, then in 2pc tx we are altering it and after that aborting tx. So pg_class will have something like this: xmin | xmax | relname 100 | 200| mytable 200 | 0| mytable After previous abort, tuple (100,200,mytable) becomes visible and if we will alter table again then xmax of first tuple will be set current xid, resulting in following table: xmin | xmax | relname 100 | 300| mytable 200 | 0| mytable 300 | 0| mytable In that moment we’ve lost information that first tuple was deleted by our prepared tx. And from POV of historic snapshot that will be constructed to decode prepare first tuple is visible, but actually send tuple should be used. Moreover such snapshot could see both tuples violating oid uniqueness, but heapscan stops after finding first one. I see here two possible workarounds: * Try at first to scan catalog filtering out tuples with xmax bigger than snapshot->xmax as it was possibly deleted by our tx. Than if nothing found scan in a usual way. * Do not decode such transaction at all. If by the time of decoding prepare record we already know that it is aborted than such decoding doesn’t have a lot of sense. IMO intended usage of logical 2pc decoding is to decide about commit/abort based on answers from logical subscribers/replicas. So there will be barrier between prepare and commit/abort and such situations shouldn’t happen. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 20 Mar 2017, at 11:32, Craig Ringer wrote: > > On 19 March 2017 at 21:26, Petr Jelinek wrote: > >> I think only genam would need changes to do two-phase scan for this as >> the catalog scans should ultimately go there. It's going to slow down >> things but we could limit the impact by doing the two-phase scan only >> when historical snapshot is in use and the tx being decoded changed >> catalogs (we already have global knowledge of the first one, and it >> would be trivial to add the second one as we have local knowledge of >> that as well). > > > TBH, I have no idea how to approach the genam changes for the proposed > double-scan method. It sounds like Stas has some idea how to proceed > though (right?) > I thought about having special field (or reusing one of the existing fields) in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin as Petr suggested. Then this logic can reside in ReorderBufferCommit(). However this is not solving problem with catcache, so I'm looking into it right now. > On 17 Mar 2017, at 05:38, Craig Ringer wrote: > > On 16 March 2017 at 19:52, Stas Kelvich wrote: > >> >> I’m working right now on issue with building snapshots for decoding prepared >> tx. >> I hope I'll send updated patch later today. > > > Great. > > What approach are you taking? Just as before I marking this transaction committed in snapbuilder, but after decoding I delete this transaction from xip (which holds committed transactions in case of historic snapshot). > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 20 Mar 2017, at 15:17, Craig Ringer wrote: > >> I thought about having special field (or reusing one of the existing fields) >> in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin >> as Petr suggested. Then this logic can reside in ReorderBufferCommit(). >> However this is not solving problem with catcache, so I'm looking into it >> right now. > > OK, so this is only an issue if we have xacts that change the schema > of tables and also insert/update/delete to their heaps. Right? > > So, given that this is CF3 for Pg10, should we take a step back and > impose the limitation that we can decode 2PC with schema changes or > data row changes, but not both? Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan approach. If I’ll fail to do that during this time then I’ll just update this patch to decode only non-ddl 2pc transactions as you suggested. >> Just as before I marking this transaction committed in snapbuilder, but after >> decoding I delete this transaction from xip (which holds committed >> transactions >> in case of historic snapshot). > > That seems kind of hacky TBH. I didn't much like marking it as > committed then un-committing it. > > I think it's mostly an interface issue though. I'd rather say > SnapBuildPushPrepareTransaction and SnapBuildPopPreparedTransaction or > something, to make it clear what we're doing. Yes, that will be less confusing. However there is no any kind of queue, so SnapBuildStartPrepare / SnapBuildFinishPrepare should work too. > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 20 Mar 2017, at 16:39, Craig Ringer wrote: > > On 20 March 2017 at 20:57, Stas Kelvich wrote: >> >>> On 20 Mar 2017, at 15:17, Craig Ringer wrote: >>> >>>> I thought about having special field (or reusing one of the existing >>>> fields) >>>> in snapshot struct to force filtering xmax > snap->xmax or xmin = >>>> snap->xmin >>>> as Petr suggested. Then this logic can reside in ReorderBufferCommit(). >>>> However this is not solving problem with catcache, so I'm looking into it >>>> right now. >>> >>> OK, so this is only an issue if we have xacts that change the schema >>> of tables and also insert/update/delete to their heaps. Right? >>> >>> So, given that this is CF3 for Pg10, should we take a step back and >>> impose the limitation that we can decode 2PC with schema changes or >>> data row changes, but not both? >> >> Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan >> approach. >> If I’ll fail to do that during this time then I’ll just update this patch to >> decode >> only non-ddl 2pc transactions as you suggested. > > I wasn't suggesting not decoding them, but giving the plugin the > option of whether to proceed with decoding or not. > > As Simon said, have a pre-decode-prepared callback that lets the > plugin get a lock on the 2pc xact if it wants, or say it doesn't want > to decode it until it commits. > > That'd be useful anyway, so we can filter and only do decoding at > prepare transaction time of xacts the downstream wants to know about > before they commit. Ah, got that. Okay. > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling
> On 23 Mar 2017, at 15:53, Craig Ringer wrote: > > On 23 March 2017 at 19:33, Alexey Kondratov > wrote: > >> (1) Add errors handling to COPY as a minimum program > > Huge +1 if you can do it in an efficient way. > > I think the main barrier to doing so is that the naïve approach > creates a subtransaction for every row, which is pretty dire in > performance terms and burns transaction IDs very rapidly. > > Most of our datatype I/O functions, etc, have no facility for being > invoked in a mode where they fail nicely and clean up after > themselves. We rely on unwinding the subtransaction's memory context > for error handling, for releasing any LWLocks that were taken, etc. > There's no try_timestamptz_in function or anything, just > timestamptz_in, and it ERROR's if it doesn't like its input. You > cannot safely PG_TRY / PG_CATCH such an exception and continue > processing to, say, write another row. > > Currently we also don't have a way to differentiate between > > * "this row is structurally invalid" (wrong number of columns, etc) > * "this row is structually valid but has fields we could not parse > into their data types" > * "this row looks structurally valid and has data types we could > parse, but does not satisfy a constraint on the destination table" > > Nor do we have a way to write to any kind of failure-log table in the > database, since a simple approach relies on aborting subtransactions > to clean up failed inserts so it can't write anything for failed rows. > Not without starting a 2nd subxact to record the failure, anyway. If we are optimising COPY for case with small amount of bad rows than 2nd subtransaction seems as not a bad idea. We can try to apply batch in subtx, if it fails on some row N then insert rows [1, N) in next subtx, report an error, commit subtx. Row N+1 can be treated as beginning of next batch. But if there will be some problems with handling everything with subtransaction and since parallelism is anyway mentioned, what about starting bgworker that will perform data insertion and will be controlled by backend? For example backend can do following: * Start bgworker (or even parallel worker) * Get chunk of rows out of the file and try to apply them in batch as subtransaction in bgworker. * If it fails than we can open transaction in backend itself and raise notice / move failed rows to special errors table. > So, having said why it's hard, I don't really have much for you in > terms of suggestions for ways forward. User-defined data types, > user-defined constraints and triggers, etc mean anything involving > significant interface changes will be a hard sell, especially in > something pretty performance-sensitive like COPY. > > I guess it'd be worth setting out your goals first. Do you want to > handle all the kinds of problems above? Malformed rows, rows with > malformed field values, and rows that fail to satisfy a constraint? or > just some subset? > > > > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 27 Mar 2017, at 12:26, Craig Ringer wrote: > > On 27 March 2017 at 09:31, Craig Ringer wrote: > >> We're in the last week of the CF. If you have a patch that's nearly >> ready or getting there, now would be a good time to post it for help >> and input from others. >> >> I would really like to get this in, but we're running out of time. >> >> Even if you just post your snapshot management work, with the cosmetic >> changes discussed above, that would be a valuable start. > > I'm going to pick up the last patch and: I’m heavily underestimated amount of changes there, but almost finished and will send updated patch in several hours. > * Ensure we only add the GID to xact records for 2pc commits and aborts And only during wal_level >= logical. Done. Also patch adds origin info to prepares and aborts. > * Add separate callbacks for prepare, abort prepared, and commit > prepared (of xacts already processed during prepare), so we aren't > overloading the "commit" callback and don't have to create fake empty > transactions to pass to the commit callback; Done. > * Add another callback to determine whether an xact should be > processed at PREPARE TRANSACTION or COMMIT PREPARED time. Also done. > * Rename the snapshot builder faux-commit stuff in the current patch > so it's clearer what's going on. Hm. Okay, i’ll leave that part to you. > * Write tests covering DDL, abort-during-decode, etc I’ve extended test, but it is good to have some more. > Some special care is needed for the callback that decides whether to > process a given xact as 2PC or not. It's called before PREPARE > TRANSACTION to decide whether to decode any given xact at prepare time > or wait until it commits. It's called again at COMMIT PREPARED time if > we crashed after we processed PREPARE TRANSACTION and advanced our > confirmed_flush_lsn such that we won't re-process the PREPARE > TRANSACTION again. Our restart_lsn might've advanced past it so we > never even decode it, so we can't rely on seeing it at all. It has > access to the xid, gid and invalidations, all of which we have at both > prepare and commit time, to make its decision from. It must have the > same result at prepare and commit time for any given xact. We can > probably use a cache in the reorder buffer to avoid the 2nd call on > commit prepared if we haven't crashed/reconnected between the two. Good point. Didn’t think about restart_lsn in case when we are skipping this particular prepare (filter_prepared() -> true, in my terms). I think that should work properly as it use the same code path as it was before, but I’ll look at it. > This proposal does not provide a way to safely decode a 2pc xact that > made catalog changes which may be aborted while being decoded. The > plugin must lock such an xact so that it can't be aborted while being > processed, or defer decoding until commit prepared. It can use the > invalidations for the commit to decide. I had played with that two-pass catalog scan and it seems to be working but after some time I realised that it is not useful for the main case when commit/abort is generated after receiver side will answer to prepares. Also that two-pass scan is a massive change in relcache.c and genam.c (FWIW there were no problems with cache, but some problems with index scan and handling one-to-many queries to catalog, e.g. table with it fields) Finally i decided to throw it and switched to filter_prepare callback and passed there txn structure to allow access to has_catalog_changes field. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 27 Mar 2017, at 16:29, Craig Ringer wrote: > > On 27 March 2017 at 17:53, Stas Kelvich wrote: > >> I’m heavily underestimated amount of changes there, but almost finished >> and will send updated patch in several hours. > > Oh, brilliant! Please post whatever you have before you knock off for > the day anyway, even if it's just a WIP, so I can pick it up tomorrow > my time and poke at its tests etc. > Ok, here it is. Major differences comparing to previous version: * GID is stored to commit/abort records only when wal_level >= logical. * More consistency about storing and parsing origin info. Now it is stored in prepare and abort records when repsession is active. * Some clenup, function renames to get rid of xact_even/gid fields in ReorderBuffer which i used only to copy them ReorderBufferTXN. * Changed output plugin interface to one that was suggested upthread. Now prepare/CP/AP is separate callback, and if none of them is set then 2pc tx will be decoded as 1pc to provide back-compatibility. * New callback filter_prepare() that can be used to switch between 1pc/2pc style of decoding 2pc tx. * test_decoding uses new API and filters out aborted and running prepared tx. It is actually easy to move unlock of 2PCState there to prepare callback to allow decode of running tx, but since that extension is example ISTM that is better not to hold that lock there during whole prepare decoding. However I leaved enough information there about this and about case when that locks are not need at all (when we are coordinating this tx). Talking about locking of running prepared tx during decode, I think better solution would be to use own custom lock here and register XACT_EVENT_PRE_ABORT callback in extension to conflict with this lock. Decode should hold it in shared way, while commit in excluseve. That will allow to lock stuff granularly ang block only tx that is being decoded. However we don’t have XACT_EVENT_PRE_ABORT, but it is several LOCs to add it. Should I? * It is actually doesn’t pass one of mine regression tests. I’ve added expected output as it should be. I’ll try to send follow up message with fix, but right now sending it as is, as you asked. logical_twophase.diff Description: Binary data Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 28 Mar 2017, at 00:19, Stas Kelvich wrote: > > * It is actually doesn’t pass one of mine regression tests. I’ve added > expected output > as it should be. I’ll try to send follow up message with fix, but right now > sending it > as is, as you asked. > > Fixed. I forgot to postpone ReorderBufferTxn cleanup in case of prepare. So it pass provided regression tests right now. I’ll give it more testing tomorrow and going to write TAP test to check behaviour when we loose info whether prepare was sent to subscriber or not. logical_twophase.diff Description: Binary data Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 28 Mar 2017, at 00:25, Andres Freund wrote: > > Hi, > > On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote: >> Ok, here it is. > > On a very quick skim, this doesn't seem to solve the issues around > deadlocks of prepared transactions vs. catalog tables. What if the > prepared transaction contains something like LOCK pg_class; (there's a > lot more realistic examples)? Then decoding won't be able to continue, > until that transaction is committed / aborted? But why is that deadlock? Seems as just lock. In case of prepared lock of pg_class decoding will wait until it committed and then continue to decode. As well as anything in postgres that accesses pg_class, including inability to connect to database and bricking database if you accidentally disconnected before committing that tx (as you showed me some while ago :-). IMO it is issue of being able to prepare such lock, than of decoding. Is there any other scenarios where catalog readers are blocked except explicit lock on catalog table? Alters on catalogs seems to be prohibited. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 28 Mar 2017, at 18:08, Andres Freund wrote: > > On 2017-03-28 15:55:15 +0100, Simon Riggs wrote: >> >> >> That assertion is obviously false... the plugin can resolve this in >> various ways, if we allow it. > > Handling it by breaking replication isn't handling it (e.g. timeouts in > decoding etc). Handling it by rolling back *prepared* transactions > (which are supposed to be guaranteed to succeed!), isn't either. > > >> You can say that in your opinion you prefer to see this handled in >> some higher level way, though it would be good to hear why and how. > > It's pretty obvious why: A bit of DDL by the user shouldn't lead to the > issues mentioned above. > > >> Bottom line here is we shouldn't reject this patch on this point, > > I think it definitely has to be rejected because of that. And I didn't > bring this up at the last minute, I repeatedly brought it up before. > Both to Craig and Stas. Okay. In order to find more realistic cases that blocks replication i’ve created following setup: * in backend: tests_decoding plugins hooks on xact events and utility statement hooks and transform each commit into prepare, then sleeps on latch. If transaction contains DDL that whole statement pushed in wal as transactional message. If DDL can not be prepared or disallows execution in transaction block than it goes as nontransactional logical message without prepare/decode injection. If transaction didn’t issued any DDL and didn’t write anything to wal, then it skips 2pc too. * after prepare is decoded, output plugin in walsender unlocks backend allowing to proceed with commit prepared. So in case when decoding tries to access blocked catalog everything should stop. * small python script that consumes decoded wal from walsender (thanks Craig and Petr) After small acrobatics with that hooks I’ve managed to run whole regression suite in parallel mode through such setup of test_decoding without any deadlocks. I’ve added two xact_events to postgres and allowedn prepare of transactions that touched temp tables since they are heavily used in tests and creates a lot of noise in diffs. So it boils down to 3 failed regression tests out of 177, namely: * transactions.sql — here commit of tx stucks with obtaining SafeSnapshot(). I didn’t look what is happening there specifically, but just checked that walsender isn’t blocked. I’m going to look more closely at this. * prepared_xacts.sql — here select prepared_xacts() sees our prepared tx. It is possible to filter them out, but obviously it works as expected. * guc.sql — here pendingActions arrives on 'DISCARD ALL’ preventing tx from being prepared. I didn’t found the way to check presence of pendingActions outside of async.c so decided to leave it as is. It seems that at least in regression tests nothing can block twophase logical decoding. Is that strong enough argument to hypothesis that current approach doesn’t creates deadlock except locks on catalog which should be disallowed anyway? Patches attached. logical_twophase_v5 is slightly modified version of previous patch merged with Craig’s changes. Second file is set of patches over previous one, that implements logic i’ve just described. There is runtest.sh script that setups postgres, runs python logical consumer in background and starts regression test. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company logical_twophase_v5.diff Description: Binary data logical_twophase_regresstest.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 4 Apr 2017, at 04:23, Masahiko Sawada wrote: > > > I reviewed this patch but when I tried to build contrib/test_decoding > I got the following error. > Thanks! Yes, seems that 18ce3a4a changed ProcessUtility_hook signature. Updated. > There are still some unnecessary code in v5 patch. Actually second diff isn’t intended to be part of the patch, I've just shared the way I ran regression test suite through the 2pc decoding changing all commits to prepare/commits where commits happens only after decoding of prepare is finished (more details in my previous message in this thread). That is just argument against Andres concern that prepared transaction is able to deadlock with decoding process — at least no such cases in regression tests. And that concern is main thing blocking this patch. Except explicit catalog locks in prepared tx nobody yet found such cases and it is hard to address or argue about. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company logical_twophase_v6.diff Description: Binary data logical_twophase_regresstest.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication worker and statistics
> On 27 Mar 2017, at 18:59, Robert Haas wrote: > > On Mon, Mar 27, 2017 at 11:14 AM, Fujii Masao wrote: >> Logical replication worker should call pgstat_report_stat()? >> Currently it doesn't seem to do that and no statistics about >> table accesses by logical replication workers are collected. >> For example, this can prevent autovacuum from working on >> those tables properly. > > Yeah, that doesn't sound good. Seems that nobody is working on this, so i’m going to create the patch. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication worker and statistics
> On 10 Apr 2017, at 05:20, Noah Misch wrote: > > On Wed, Apr 05, 2017 at 05:02:18PM +0300, Stas Kelvich wrote: >>> On 27 Mar 2017, at 18:59, Robert Haas wrote: >>> On Mon, Mar 27, 2017 at 11:14 AM, Fujii Masao wrote: >>>> Logical replication worker should call pgstat_report_stat()? >>>> Currently it doesn't seem to do that and no statistics about >>>> table accesses by logical replication workers are collected. >>>> For example, this can prevent autovacuum from working on >>>> those tables properly. >>> >>> Yeah, that doesn't sound good. >> >> Seems that nobody is working on this, so i’m going to create the patch. > > [Action required within three days. This is a generic notification.] > > The above-described topic is currently a PostgreSQL 10 open item. Peter, > since you committed the patch believed to have created it, you own this open > item. If some other commit is more relevant or if this does not belong as a > v10 open item, please let us know. Otherwise, please observe the policy on > open item ownership[1] and send a status update within three calendar days of > this message. Include a date for your subsequent status update. Testers may > discover new open items at any time, and I want to plan to get them all fixed > well in advance of shipping v10. Consequently, I will appreciate your efforts > toward speedy resolution. Thanks. > > [1] > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com Here is small patch to call statistics in logical worker. Originally i thought that stat collection during logical replication should manually account amounts of changed tuples, but seems that it is already smoothly handled on relation level. So call to pgstat_report_stat() is enough. Also i’ve added statistics checks to logrep tap tests, but that is probably quite fragile without something like wait_for_stats() from regression test stats.sql. call_pgstat_report_stat.diff Description: Binary data logical_worker_stats_test.diff Description: Binary data Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication worker and statistics
> On 10 Apr 2017, at 19:50, Peter Eisentraut > wrote: > > On 4/10/17 05:49, Stas Kelvich wrote: >> Here is small patch to call statistics in logical worker. Originally i >> thought that stat >> collection during logical replication should manually account amounts of >> changed tuples, >> but seems that it is already smoothly handled on relation level. So call to >> pgstat_report_stat() is enough. > > I wonder whether we need a similar call somewhere in tablesync.c. It > seems to work without it, though. I thought it spins up the same worker from worker.c. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling
> On 12 Apr 2017, at 20:23, Robert Haas wrote: > > On Wed, Apr 12, 2017 at 1:18 PM, Nicolas Barbier > wrote: >> 2017-04-11 Robert Haas : >>> If the data quality is poor (say, 50% of lines have errors) it's >>> almost impossible to avoid runaway XID consumption. >> >> Yup, that seems difficult to work around with anything similar to the >> proposed. So the docs might need to suggest not to insert a 300 GB >> file with 50% erroneous lines :-). > > Yep. But it does seem reasonably likely that someone might shoot > themselves in the foot anyway. Maybe we just live with that. > Moreover if that file consists of one-byte lines (plus one byte of newline char) then during its import xid wraparound will happens 18 times =) I think it’s reasonable at least to have something like max_errors parameter to COPY, that will be set by default to 1000 for example. If user will hit that limit then it is a good moment to put a warning about possible xid consumption in case of bigger limit. However I think it worth of quick research whether it is possible to create special code path for COPY in which errors don’t cancel transaction. At least when COPY called outside of transaction block. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failed recovery with new faster 2PC code
> On 17 Apr 2017, at 12:14, Simon Riggs wrote: > > On 15 April 2017 at 23:37, Jeff Janes wrote: >> After this commit, I get crash recovery failures when using prepared >> transactions. >> >> commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71 >> Author: Simon Riggs >> Date: Tue Apr 4 15:56:56 2017 -0400 >> >>Speedup 2PC recovery by skipping two phase state files in normal path > > Thanks Jeff for your tests. > > So that's now two crash bugs in as many days and lack of clarity about > how to fix it. > > Stas, I thought this patch was very important to you, yet two releases > in a row we are too-late-and-buggy. I’m looking at pgstat issue in nearby thread right now and will switch to this shortly. If that’s possible, I’m asking to delay revert for several days. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication - TRAP: FailedAssertion in pgstat.c
> On 17 Apr 2017, at 10:30, Erik Rijkers wrote: > > On 2017-04-16 20:41, Andres Freund wrote: >> On 2017-04-16 10:46:21 +0200, Erik Rijkers wrote: >>> On 2017-04-15 04:47, Erik Rijkers wrote: >>> > >>> > 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch + >>> > 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+ >>> > 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch + >>> > 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch + >>> > 0005-Skip-unnecessary-snapshot-builds.patch >>> I am now using these newer patches: >>> https://www.postgresql.org/message-id/30242bc6-eca4-b7bb-670e-8d0458753a8c%402ndquadrant.com >>> > It builds fine, but when I run the old pbench-over-logical-replication >>> > test I get: >>> > >>> > TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: >>> > "pgstat.c", Line: 828) >>> To get that error: >> I presume this is the fault of >> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=139eb9673cb84c76f493af7e68301ae204199746 >> if you git revert that individual commit, do things work again? > > Yes, compiled from 67c2def11d4 with the above 4 patches, it runs flawlessly > again. (flawlessly= a few hours without any error) > I’ve reproduced failure, this happens under tablesync worker and putting pgstat_report_stat() under the previous condition block should help. However for me it took about an hour of running this script to catch original assert. Can you check with that patch applied? logical_worker.patch Description: Binary data Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Logical replication ApplyContext bloat
Hi, currently logical replication worker uses ApplyContext to decode received data and that context is never freed during transaction processing. Hence if publication side is performing something like 10M row inserts in single transaction, then subscription worker will occupy more than 10G of ram (and can be killed by OOM). Attached patch resets ApplyContext after each insert/update/delete/commit. I’ve tried to reset context only on each 100/1000/1 value of CommandCounter, but didn’t spotted any measurable difference in speed. Problem spotted by Mikhail Shurutov. applycontext_bloat.patch Description: Binary data Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication ApplyContext bloat
> On 19 Apr 2017, at 12:37, Petr Jelinek wrote: > > On 18/04/17 13:45, Stas Kelvich wrote: >> Hi, >> >> currently logical replication worker uses ApplyContext to decode received >> data >> and that context is never freed during transaction processing. Hence if >> publication >> side is performing something like 10M row inserts in single transaction, then >> subscription worker will occupy more than 10G of ram (and can be killed by >> OOM). >> >> Attached patch resets ApplyContext after each insert/update/delete/commit. >> I’ve tried to reset context only on each 100/1000/1 value of >> CommandCounter, >> but didn’t spotted any measurable difference in speed. >> >> Problem spotted by Mikhail Shurutov. >> > > Hmm we also do replication protocol handling inside the ApplyContext > (LogicalRepApplyLoop), are you sure this change is safe in that regard? Memory is bloated by logicalrep_read_* when palloc happens inside. I’ve inserted context reset in ensure_transaction() which is only called in beginning of hande_insert/update/delete/commit when data from previous call of that function isn’t needed. So probably it is safe to clear memory there. At least i don’t see any state that can be accessed later in this functions. Do you have any specific concerns? Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication ApplyContext bloat
> On 19 Apr 2017, at 13:34, Simon Riggs wrote: > > On 19 April 2017 at 11:24, Petr Jelinek wrote: > >> I'd still like you to add comment to the ApplyContext saying that it's >> cleaned after handling each message so nothing that needs to survive for >> example whole transaction can be allocated in it as that's not very well >> visible IMHO (and I know I will forget about it when writing next patch >> in that area). > > It would be better to invent the contexts we want now, so we get the > architecture right for future work. Otherwise we have problems with > "can't backpatch this fix because that infrastructure doesn't exist in > PG10" in a couple of years. > > So I suggest we have > > ApplyMessageContext - cleaned after every message > ApplyTransactionContext - cleaned at EOXact > ApplyContext - persistent Right now ApplyContext cleaned after each transaction and by this patch I basically suggest to clean it after each command counter increment. So in your definitions that is ApplyMessageContext, most short-lived one. We can rename now ApplyContext -> ApplyMessageContext to make that clear and avoid possible name conflict when somebody will decide to keep something during whole transaction or worker lifespan. P.S. It also seems to me now that resetting context in ensure_transaction() isn’t that good idea, probably better just explicitly call reset at the end of each function involved. > > -- > Simon Riggs http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication ApplyContext bloat
> On 19 Apr 2017, at 14:30, Petr Jelinek wrote: > > On 19/04/17 12:46, Stas Kelvich wrote: >> >> Right now ApplyContext cleaned after each transaction and by this patch I >> basically >> suggest to clean it after each command counter increment. >> >> So in your definitions that is ApplyMessageContext, most short-lived one. We >> can >> rename now ApplyContext -> ApplyMessageContext to make that clear and avoid >> possible name conflict when somebody will decide to keep something during >> whole >> transaction or worker lifespan. > > Yeah we can rename, and then rename the ApplyCacheContext to > ApplyContext. That would leave the ApplyTransactionContext from Simon's > proposal which is not really need it anywhere right now and I am not > sure it will be needed given there is still TopTransactionContext > present. If we really want ApplyTransactionContext we can probably just > assign it to TopTransactionContext in ensure_transaction. > >> >> P.S. It also seems to me now that resetting context in ensure_transaction() >> isn’t that >> good idea, probably better just explicitly call reset at the end of each >> function involved. >> > > Well it would definitely improve clarity. > Okay, done. With patch MemoryContextStats() shows following hierarchy during slot operations in apply worker: TopMemoryContext: 83824 total in 5 blocks; 9224 free (8 chunks); 74600 used ApplyContext: 8192 total in 1 blocks; 6520 free (4 chunks); 1672 used ApplyMessageContext: 8192 total in 1 blocks; 6632 free (11 chunks); 1560 used ExecutorState: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used apply_contexts.patch Description: Binary data Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication ApplyContext bloat
> On 19 Apr 2017, at 16:07, Alvaro Herrera wrote: > > Stas Kelvich wrote: > >> With patch MemoryContextStats() shows following hierarchy during slot >> operations in >> apply worker: >> >> TopMemoryContext: 83824 total in 5 blocks; 9224 free (8 chunks); 74600 used >> ApplyContext: 8192 total in 1 blocks; 6520 free (4 chunks); 1672 used >>ApplyMessageContext: 8192 total in 1 blocks; 6632 free (11 chunks); 1560 >> used >> ExecutorState: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used >>ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used > > Please update src/backend/utils/mmgr/README to list the new contexts. Thanks for noting. Added short description of ApplyContext and ApplyMessageContext to README. apply_contexts_2.patch Description: Binary data Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Dynamic instrumentation of lwlock wait times (lwlock flamegraphs)
> On 23 Jun 2017, at 00:08, Andres Freund wrote: > > Hi, > > At pgcon some people were talking about the difficulty of instrumenting > the time actually spent waiting for lwlocks and related measurements. > I'd mentioned that linux these days provides infrastructure to measure > such things in unmodified binaries. > > Attached is a prototype of a script that measures the time spent inside > PGSemaphoreLock(), aggregates that inside the kernel, grouped by pid and > stacktrace. That allows one to generate nice flame graphs showing which > part of the code waits how long for lwlocks. > > The attached script clearly needs improvements, but I thought I'd show > some of the results it can get. To run it you need the the python > library of the 'bcc' project [1], and a sufficiently new kernel. Some > distributions, e.g. newer debian versions, package this as python-bpfcc > and similar. > > The output of the script can be turned into a flamegraph with > https://github.com/brendangregg/FlameGraph 's flamegraph.pl. > > Here's a few examples of a pgbench run. The number is the number of > clients, sync/async indicates synchronous_commit on/off. All the > numbers here were generated with the libpq & pgbench batch patches > applied and in use, but that's just because that was the state of my > working tree. > > http://anarazel.de/t/2017-06-22/pgsemwait_8_sync.svg > http://anarazel.de/t/2017-06-22/pgsemwait_8_async.svg > http://anarazel.de/t/2017-06-22/pgsemwait_64_sync.svg > http://anarazel.de/t/2017-06-22/pgsemwait_64_async.svg > > A bonus, not that representative one is the wait for a pgbench readonly > run after the above, with autovacuum previously disabled: > http://anarazel.de/t/2017-06-22/pgsemwait_64_select.svg > interesting to see how the backends themselves never end up having to > flush WAL themselves, or at least not in a manner triggering contention. > > I plan to write a few more of these, because they're hugely useful to > understand actual locking behaviour. Among them: > - time beteen Acquire/Release of lwlocks, grouped by lwlock > - time beteen Acquire/Release of lwlocks, grouped by stack > - callstack of acquirer and waker of lwlocks, grouped by caller stack, waiter > stack > - ... > > I think it might be interesting to collect a few of these somewhere > centrally once halfway mature. Maybe in src/tools or such. Wow, that’s extremely helpful, thanks a lot. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transactions involving multiple postgres foreign servers
> On 27 Jul 2017, at 04:28, Robert Haas wrote: > > However, you would not > be guaranteed that all of those commits or rollbacks happen at > anything like the same time. So, you would have a sort of eventual > consistency. As far as I understand any solution that provides proper isolation for distributed transactions in postgres will require distributed 2PC somewhere under the hood. That is just consequence of parallelism that database allows — transactions can abort due concurrent operations. So dichotomy is simple: either we need 2PC or restrict write transactions to be physically serial. In particular both Postgres-XL/XC and postgrespro multimaster are using 2PC to commit distributed transaction. Some years ago we created patches to implement transaction manager API and that is just a way to inject consistent snapshots on different nodes, but atomic commit itself is out of scope of TM API (hmm, may be it is better to call it snapshot manager api?). That allows us to use it in quite different environments like fdw and logical replication and both are using 2PC. I want to submit TM API again during this release cycle along with implementation for fdw. And I planned to base it on top of this patch. So I already rebased Masahiko’s patch to current -master and started writing long list of nitpicks, but not finished yet. Also I see the quite a big value in this patch even without tm/snapshots/whatever. Right now fdw doesn’t guarantee neither isolation nor atomicity. And if one isn’t doing cross-node analytical transactions it will be safe to live without isolation. But living without atomicity means that some parts of data can be lost without simple way to detect and fix that. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transactions involving multiple postgres foreign servers
> On 31 Jul 2017, at 20:03, Robert Haas wrote: > > Regardless of whether we share XIDs or DXIDs, we need a more complex > concept of transaction state than we have now. Seems that discussion shifted from 2PC itself to the general issues with distributed transactions. So it is probably appropriate to share here resume of things that we done in area of distributed visibility. During last two years we tried three quite different approaches and finally settled with Clock-SI. At first, to test different approaches we did small patch that wrap calls to visibility-related functions (SetTransactionStatus, GetSnapshot, etc. Described in detail at wiki[1] ) in order to allow overload them from extension. Such approach allows to implement almost anything related to distributed visibility since you have full control about how local visibility is done. That API isn’t hard prerequisite, and if one wants to create some concrete implementation it can be done just in place. However, I think it is good to have such API in some form. So three approaches that we tried: 1) Postgres-XL-like: That is most straightforward way. Basically we need separate network service (GTM/DTM) that is responsible for xid generation, and managing running-list of transactions. So acquiring xid and snapshot is done by network calls. Because of shared xid space it is possible to compare them in ordinary way and get right order. Gap between non-simultaneous commits by 2pc is covered by the fact that we getting our snapshots from GTM, and it will remove xid from running list only when transaction committed on both nodes. Such approach is okay for OLAP-style transactions where tps isn’t high. But OLTP with high transaction rate GTM will immediately became a bottleneck since even write transactions need to get snapshot from GTM. Even if they access only one node. 2) Incremental SI [2] Approach with central coordinator, that can allow local reads without network communications by slightly altering visibility rules. Despite the fact that it is kind of patented, we also failed to achieve proper visibility by implementing algorithms from that paper. It always showed some inconsistencies. May be because of bugs in our implementation, may be because of some typos/mistakes in algorithm description itself. Reasoning in paper wasn’t very clear for us, as well as patent issues, so we just leaved that. 3) Clock-SI [3] It is MS research paper, that describes algorithm similar to ones used in Spanner and CockroachDB, without central GTM and with reads that do not require network roundtrip. There are two ideas behind it: * Assuming snapshot isolation and visibility on node are based on CSN, use local time as CSN, then when you are doing 2PC, collect prepare time from all participating nodes and commit transaction everywhere with maximum of that times. If node during read faces tuples committed by tx with CSN greater then their snapshot CSN (that can happen due to time desynchronisation on node) then it just waits until that time come. So time desynchronisation can affect performance, but can’t affect correctness. * During distributed commit transaction neither running (if it commits then tuple should be already visible) nor committed/aborted (it still can be aborted, so it is illegal to read). So here IN-DOUBT transaction state appears, when reader should wait for writers. We managed to implement that using mentioned XTM api. XID<->CSN mapping is accounted by extension itself. Speed/scalability are also good. I want to resubmit implementation of that algorithm for FDW later in August, along with some isolation tests based on set of queries in [4]. [1] https://wiki.postgresql.org/wiki/DTM#eXtensible_Transaction_Manager_API [2] http://pi3.informatik.uni-mannheim.de/~norman/dsi_jour_2014.pdf [3] https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/samehe-clocksi.srds2013.pdf [4] https://github.com/ept/hermitage Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication ApplyContext bloat
> On 20 Apr 2017, at 17:01, Dilip Kumar wrote: > > On Thu, Apr 20, 2017 at 7:04 PM, Stas Kelvich > wrote: >> Thanks for noting. >> >> Added short description of ApplyContext and ApplyMessageContext to README. > > Typo > > /analysys/analysis > Fixed, thanks. Added this to open items list. apply_contexts_3.patch Description: Binary data Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication ApplyContext bloat
> On 9 May 2017, at 21:53, Peter Eisentraut > wrote: > > On 5/8/17 11:26, Peter Eisentraut wrote: >> I think it would make more sense to reset ApplyMessageContext in >> apply_dispatch(), so that it is done consistently after every message >> (as the name implies), not only some messages. > > Committed that. > >> Also, perhaps ApplyMessageContext should be a child of >> TopTransactionContext. (You have it as a child of ApplyContext, which >> is under TopMemoryContext.) > > Left that as is. Thanks! Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] snapbuild woes
> On 10 May 2017, at 11:43, Petr Jelinek wrote: > > On 09/05/17 22:11, Erik Rijkers wrote: >> On 2017-05-09 21:00, Petr Jelinek wrote: >>> On 09/05/17 19:54, Erik Rijkers wrote: >>>> On 2017-05-09 11:50, Petr Jelinek wrote: >>>> >>> >>> Ah okay, so this is same issue that's reported by both Masahiko Sawada >>> [1] and Jeff Janes [2]. >>> >>> [1] >>> https://www.postgresql.org/message-id/CAD21AoBYpyqTSw%2B%3DES%2BxXtRGMPKh%3DpKiqjNxZKnNUae0pSt9bg%40mail.gmail.com >>> >>> [2] >>> https://www.postgresql.org/message-id/flat/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com >>> >> >> I don't understand why you come to that conclusion: both Masahiko Sawada >> and Jeff Janes have a DROP SUBSCRIPTION in the mix; my cases haven't. >> Isn't that a real difference? >> > Just noted another one issue/feature with snapshot builder — when logical decoding is in progress and vacuum full command is being issued quite a big amount of files appears in pg_logical/mappings and reside there till the checkpoint. Also having consequent vacuum full’s before checkpoint yields even more files. Having two pgbench-filled instances with logical replication between them: for i in {1..100}; do psql -c 'vacuum full' && ls -la pg_logical/mappings | wc -l; done; VACUUM 73 VACUUM 454 VACUUM 1146 VACUUM 2149 VACUUM 3463 VACUUM 5088 <...> VACUUM 44708 <…> // checkpoint happens somewhere here VACUUM 20921 WARNING: could not truncate file "base/16384/61773": Too many open files in system ERROR: could not fsync file "pg_logical/mappings/map-4000-4df-0_A4EA29F8-5aa5-5ae6": Too many open files in system I’m not sure whether this is boils down to some of the previous issues mentioned here or not, so posting here as observation. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 7 Sep 2017, at 18:58, Nikhil Sontakke wrote: > > Hi, > > FYI all, wanted to mention that I am working on an updated version of > the latest patch that I plan to submit to a later CF. > Cool! So what kind of architecture do you have in mind? Same way as is it was implemented before? As far as I remember there were two main issues: * Decodong of aborted prepared transaction. If such transaction modified catalog then we can’t read reliable info with our historic snapshot, since clog already have aborted bit for our tx it will brake visibility logic. There are some way to deal with that — by doing catalog seq scan two times and counting number of tuples (details upthread) or by hijacking clog values in historic visibility function. But ISTM it is better not solve this issue at all =) In most cases intended usage of decoding of 2PC transaction is to do some form of distributed commit, so naturally decoding will happens only with in-progress transactions and we commit/abort will happen only after it is decoded, sent and response is received. So we can just have atomic flag that prevents commit/abort of tx currently being decoded. And we can filter interesting prepared transactions based on GID, to prevent holding this lock for ordinary 2pc. * Possible deadlocks that Andres was talking about. I spend some time trying to find that, but didn’t find any. If locking pg_class in prepared tx is the only example then (imho) it is better to just forbid to prepare such transactions. Otherwise if some realistic examples that can block decoding are actually exist, then we probably need to reconsider the way tx being decoded. Anyway this part probably need Andres blessing. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Cube extension kNN support
Hello, hackers. Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances. Following distance operators introduced: <#> taxicab distance <-> euclidean distance <=> chebyshev distance For example: SELECT * FROM objects ORDER BY objects.coord <-> '(137,42,314)'::cube LIMIT 10; Also there is operator "->" for selecting ordered rows directly from index. This request selects rows ordered ascending by 3rd coordinate: SELECT * FROM objects ORDER BY objects.coord->3 LIMIT 10; For descendent ordering suggested syntax with minus before coordinate. This request selects rows ordered descending by 4th coordinate: SELECT * FROM objects ORDER BY objects.coord->-4 LIMIT 10; Stas Kelvich. distances.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cube extension point support // GSoC'13
Hello There is new version of patch. I have separated ordering operators to different patch (https://commitfest.postgresql.org/action/patch_view?id=1243), fixed formatting issues and implemented backward compatibility with old-style points in cube_is_point() and cube_out(). Also comparing output files I've discovered that this four files is combination of two types of different behavior: 1) SELECT '-1e-700'::cube AS cube; can be (0) or (-0) 2) Amount of zeros in exponent of floating point, i.e. SELECT '1e27'::cube AS cube; can be (1e+027) or (1e+27) On my system (OSX) it is second option in both situations. I've also tested it on FreeBSD 9.0 and Ubuntu 12.04 with the same results. So is there some ideas how can I reproduce such results? Stas. points.diff Description: Binary data On Sep 16, 2013, at 10:48 AM, Heikki Linnakangas wrote: > On 12.07.2013 14:57, Stas Kelvich wrote: >> Hello. >> >> here is a patch adding to cube extension support for compressed >> representation of point cubes. If cube is a point, i.e. has coincident lower >> left and upper right corners, than only one corner is stored. First bit of >> the cube header indicates whether the cube is point or not. Few moments: >> >> * Patch preserves binary compatibility with old indices >> * All functions that create cubes from user input, check whether it is a >> point or not >> * All internal functions that can return cubes takes care of all cases where >> a cube might become a point > > Great! > > cube_is_point() needs to still handle old-style points. An NDBOX without the > point-flag set, where the ll and ur coordinates for each dimension are the > same, still needs to be considered a point. Even if you are careful to never > construct such structs in the code, they can be present on-disk if you have > upgraded from an earlier version with pg_upgrade. Same in cube_out(). > >> * Added tests for checking correct point behavior > > You'll need to adjust all the expected output files, not only cube_1.out. > >> Also this patch includes adapted Alexander Korotkov's patch with kNN-based >> ordering operator, which he wrote for postgresql-9.0beta1 with knngist >> patch. More info there >> http://www.postgresql.org/message-id/aanlktimhfaq6hcibrnk0tlcqmiyhywhwaq2zd87wb...@mail.gmail.com > > To make review easier, it would be better to keep that as a separate patch, > actually. Could you split it up again, please? > > - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Cube extension types support
Hello, hackers. In this patch I've implemented support for different storage types for cubes. Now it supports float8, float4, int4, int2, int1. Type stored in the header of each cube, one for all coordinates. So for cubes with int1 coordinates it can save up to 8x disk space. Typed cubes can be created in two ways: 1) Via cube_suffix() functions, where suffix can be "f4", "i4", "i2", "i1", and arguments are two or one numbers or arrays, i.e. # select cube_i1(1,2) as c; c (1),(2):i1 # select cube_f4(array[1,2,3], array[5,6,7]) as c; c (1, 2, 3),(5, 6, 7):f4 2) Via modificator in the end of string that will be casted to cube, i.e. # select '(1,2,3):i2'::cube as c; c -- (1, 2, 3):i2 When no modificator given float8 will be used by default. Old-style cubes without type in header also treated as float8 for backward compatibility. This patch changes a lot of things in code, so it interfere with others patches to the cube extension. I will update this patch, if other patches will be accepted to commit. types_cumulative.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Cube extension split algorithm fix
Hello, hackers. I've fixed split algorithm that was implemented in cube extension. I've changed it according to the original Guttman paper (old version was more simple algorithm) and also ported Alexander Korotkov's algorithm from box datatype indexing that work faster and better on low dimensions. On my test dataset (1M records, 7 dimensions, real world database of goods) numbers was following: Building index over table (on expression): old: 67.296058 seconds new: 48.842391 seconds Cube point search, mean, 100 queries old: 0.001025 seconds new: 0.000427 seconds Index on field size: old: 562 MB new: 283 MB Stas. split.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Tsvector editing functions
Hello. There is patch that adds some editing routines for tsvector type. tsvector delete(tsvector, text) removes entry from tsvector by lexeme name set unnest(tsvector) expands a tsvector to a set of rows. Each row has following columns: lexeme, postings, weights. text[] to_array(tsvector) converts tsvector to array of lexemes tsvector to_tsvector(text[]) converts array of lexemes to tsvector tsvector_funcs.diff Description: Binary data Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cube extension kNN support
Hello. That is updated version of the patch with proper update scripts. Also i’ve noted that documentation states the wrong thing: “It does not matter which order the opposite corners of a cube are entered in. The cube functions automatically swap values if needed to create a uniform "lower left — upper right" internal representation." But in practice cubes stored "as is" and that leads to problems with getting cubes sorted along specific dimension directly from index. As a simplest workaround i’ve deleted that sentence from docs and implemented two coordinate getters -> and ~>. First one returns coordinate of cube as it stored, and second returns coordinate of cube normalised to (LL,UR)-form. Other way to fix thing is to force ’normalization’ while creating cube. But that can produce wrong sorts with already existing data. > On 09 Jul 2015, at 16:40, Alexander Korotkov wrote: > > Hi! > > On Sat, May 9, 2015 at 6:53 AM, Stas Kelvich wrote: > Patch is pretty ready, last issue was about changed extension interface, so > there should be migration script and version bump. > Attaching a version with all migration stuff. > > I can't see cube--1.0--1.1.sql in the patch. Did forget to include it? > > -- > Alexander Korotkov > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company distances.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cube extension kNN support
Documentation along with style fix. distances2r3.patch Description: Binary data > On 08 Feb 2015, at 00:32, Alexander Korotkov wrote: > > Hi! > > On Sat, Feb 7, 2015 at 12:45 PM, Stas Kelvich wrote: > I had updated old patch with kNN operators for cube data structures. Copying > description from old message: > > Following distance operators introduced: > > <#> taxicab distance > <-> euclidean distance > <=> chebyshev distance > > For example: > SELECT * FROM objects ORDER BY objects.coord <-> '(137,42,314)'::cube LIMIT > 10; > > Also there is operator "->" for selecting ordered rows directly from index. > This request selects rows ordered ascending by 3rd coordinate: > > SELECT * FROM objects ORDER BY objects.coord->3 LIMIT 10; > > For descendent ordering suggested syntax with minus before coordinate. > This request selects rows ordered descending by 4th coordinate: > > SELECT * FROM objects ORDER BY objects.coord->-4 LIMIT 10; > > I've checked the patch. The first notes are so: > 1) Check coding style, in particular braces. Postgres coding style require > using it for multiline statements. > 2) Update documentation according to new features. > > -- > With best regards, > Alexander Korotkov. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cube extension kNN support
Hi! Patch is pretty ready, last issue was about changed extension interface, so there should be migration script and version bump. Attaching a version with all migration stuff. distances2r4.patch Description: Binary data Stas. > On 09 May 2015, at 05:20, Alvaro Herrera wrote: > > Sergey Konoplev wrote: >> On Thu, Mar 27, 2014 at 3:26 PM, Sergey Konoplev wrote: >>> On Sun, Sep 22, 2013 at 4:38 PM, Stas Kelvich >>> wrote: >>>> Here is the patch that introduces kNN search for cubes with euclidean, >>>> taxicab and chebyshev distances. >>> >>> What is the status of this patch? >> >> Referring to our private conversation with Alexander Korotkov, the >> patch is in WIP state currently, and, hopefully, will be ready by 9.5. >> I'm ready to actively participate in its testing on a real world >> production set of data. > > This patch doesn't seem to have received an updated version. Should we > just punt on it? The assumption would be that Stas or Alexander will be > re-submitting this for 9.6. > > -- > Álvaro Herrerahttp://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
Hello. Done with the list of suggestions. Also heavily edit delete function. tsvector_ops.diff Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company > On 27 Nov 2015, at 15:13, Teodor Sigaev wrote: > > Hmm, seems, it will be useful to add two fuctions: > > tsvector filter(tsvector, array_of_weigths) - returns tsvector contains > lexemes with given weights > > tsvector setweight(tsvector, weigth, array_of_lexemes) - sets given weight > for given lexemes > > Stas Kelvich wrote: >> Hello. >> >> There is patch that adds some editing routines for tsvector type. >> >> tsvector delete(tsvector, text) >> removes entry from tsvector by lexeme name >> set unnest(tsvector) >> expands a tsvector to a set of rows. Each row has following columns: >> lexeme, postings, weights. >> text[] to_array(tsvector) >> converts tsvector to array of lexemes >> tsvector to_tsvector(text[]) >> converts array of lexemes to tsvector >> >> >> >> >> >> >> Stas Kelvich >> Postgres Professional: http://www.postgrespro.com >> The Russian Postgres Company >> >> >> >> >> > > -- > Teodor Sigaev E-mail: teo...@sigaev.ru > WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cube extension kNN support
Hello, fixed. cube_distances.diff Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company > On 01 Dec 2015, at 17:52, Teodor Sigaev wrote: > > Patch looks good, but there ara some review notices: > 1 gmake installcheck fails: > *** /.../pgsql/contrib/cube/expected/cube_1.out 2015-12-01 > 17:49:01.768764000 +0300 > --- /.../pgsql/contrib/cube/results/cube.out 2015-12-01 > 17:49:12.190818000 +0300 > *** > *** 1382,1388 > (1 row) > > -- Test of distances > ! -- > SELECT cube_distance('(1,1)'::cube, '(4,5)'::cube); > cube_distance > --- > --- 1382,1388 > (1 row) > > -- Test of distances > ! -- > SELECT cube_distance('(1,1)'::cube, '(4,5)'::cube); > cube_distance > > Seems, there a extra space at the end of string > > 2 Pls, don use in C-code magick constants like 'case 16:'. Use macros to > define some human-readable name (g_cube_distance()) > > 3 Switch in g_cube_distance(): default switch path should generate a error. > It just simplifies a degbug process, may be in future. > > 4 Docs: pls, don't use a strings with unlimited length. > > > Stas Kelvich wrote: >> Hello. >> >> That is updated version of the patch with proper update scripts. >> >> Also i’ve noted that documentation states the wrong thing: >> >> “It does not matter which order the opposite corners of a cube are entered >> in. The cube functions automatically swap values if needed to create a >> uniform "lower left — upper right" internal representation." >> >> But in practice cubes stored "as is" and that leads to problems with getting >> cubes sorted along specific dimension directly from index. >> As a simplest workaround i’ve deleted that sentence from docs and >> implemented two coordinate getters -> and ~>. First one returns >> coordinate of cube as it stored, and second returns coordinate of cube >> normalised to (LL,UR)-form. >> >> Other way to fix thing is to force ’normalization’ while creating cube. But >> that can produce wrong sorts with already existing data. >> >>> On 09 Jul 2015, at 16:40, Alexander Korotkov wrote: >>> >>> Hi! >>> >>> On Sat, May 9, 2015 at 6:53 AM, Stas Kelvich wrote: >>> Patch is pretty ready, last issue was about changed extension interface, so >>> there should be migration script and version bump. >>> Attaching a version with all migration stuff. >>> >>> I can't see cube--1.0--1.1.sql in the patch. Did forget to include it? >>> >>> -- >>> Alexander Korotkov >>> Postgres Professional: http://www.postgrespro.com >>> The Russian Postgres Company >> >> Stas Kelvich >> Postgres Professional: http://www.postgrespro.com >> The Russian Postgres Company >> >> >> >> > > -- > Teodor Sigaev E-mail: teo...@sigaev.ru > WWW: http://www.sigaev.ru/ > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Speedup twophase transactions
Hello. While working with cluster stuff (DTM, tsDTM) we noted that postgres 2pc transactions is approximately two times slower than an ordinary commit on workload with fast transactions — few single-row updates and COMMIT or PREPARE/COMMIT. Perf top showed that a lot of time is spent in kernel on fopen/fclose, so it worth a try to reduce file operations with 2pc tx. Now 2PC in postgres does following: * on prepare 2pc data (subxacts, commitrels, abortrels, invalmsgs) saved to xlog and to file, but file not is not fsynced * on commit backend reads data from file * if checkpoint occurs before commit, then files are fsynced during checkpoint * if case of crash replay will move data from xlog to files In this patch I’ve changed this procedures to following: * on prepare backend writes data only to xlog and store pointer to the start of the xlog record * if commit occurs before checkpoint then backend reads data from xlog by this pointer * on checkpoint 2pc data copied to files and fsynced * if commit happens after checkpoint then backend reads files * in case of crash replay will move data from xlog to files (as it was before patch) Most of that ideas was already mentioned in 2009 thread by Michael Paquier http://www.postgresql.org/message-id/c64c5f8b0908062031k3ff48428j824a9a46f2818...@mail.gmail.com where he suggested to store 2pc data in shared memory. At that time patch was declined because no significant speedup were observed. Now I see performance improvements by my patch at about 60%. Probably old benchmark overall tps was lower and it was harder to hit filesystem fopen/fclose limits. Now results of benchmark are following (dual 6-core xeon server): Current master without 2PC: ~42 ktps Current master with 2PC: ~22 ktps Current master with 2PC: ~36 ktps Benchmark done with following script: \set naccounts 10 * :scale \setrandom from_aid 1 :naccounts \setrandom to_aid 1 :naccounts \setrandom delta 1 100 \set scale :scale+1 BEGIN; UPDATE pgbench_accounts SET abalance = abalance - :delta WHERE aid = :from_aid; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :to_aid; PREPARE TRANSACTION ':client_id.:scale'; COMMIT PREPARED ':client_id.:scale'; 2pc_xlog.diff Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
Thanks, Kevin. > I assume that last one should have been *Patched master with 2PC”? Yes, this list should look like this: Current master without 2PC: ~42 ktps Current master with 2PC: ~22 ktps Patched master with 2PC: ~36 ktps And created CommitFest entry for this patch. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company > On 10 Dec 2015, at 00:37, Kevin Grittner wrote: > > On Wed, Dec 9, 2015 at 12:44 PM, Stas Kelvich > wrote: > >> Now 2PC in postgres does following: >> * on prepare 2pc data (subxacts, commitrels, abortrels, invalmsgs) saved to >> xlog and to file, but file not is not fsynced >> * on commit backend reads data from file >> * if checkpoint occurs before commit, then files are fsynced during >> checkpoint >> * if case of crash replay will move data from xlog to files >> >> In this patch I’ve changed this procedures to following: >> * on prepare backend writes data only to xlog and store pointer to the start >> of the xlog record >> * if commit occurs before checkpoint then backend reads data from xlog by >> this pointer >> * on checkpoint 2pc data copied to files and fsynced >> * if commit happens after checkpoint then backend reads files >> * in case of crash replay will move data from xlog to files (as it was >> before patch) > > That sounds like a very good plan to me. > >> Now results of benchmark are following (dual 6-core xeon server): >> >> Current master without 2PC: ~42 ktps >> Current master with 2PC: ~22 ktps >> Current master with 2PC: ~36 ktps > > I assume that last one should have been *Patched master with 2PC"? > > Please add this to the January CommitFest. > > -- > Kevin Grittner > EDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
Michael, Jeff thanks for reviewing and testing. > On 10 Dec 2015, at 02:16, Michael Paquier wrote: > > This has better be InvalidXLogRecPtr if unused. Yes, that’s better. Changed. > On 10 Dec 2015, at 02:16, Michael Paquier wrote: > +if (gxact->prepare_lsn) > +{ > +XlogReadTwoPhaseData(gxact->prepare_xlogptr, &buf, NULL); > +} > Perhaps you mean prepare_xlogptr here? Yes, my bad. But funnily I have this error even number of times: code in CheckPointTwoPhase also uses prepare_lsn instead of xlogptr, so overall this was working well, that’s why it survived my own tests and probably Jeff’s tests. I think that’s a bad variable naming, for example because lsn in pg_xlogdump points to start of the record, but here start used as xloptr and end as lsn. So changed both variables to prepare_start_lsn and prepare_end_lsn. > On 10 Dec 2015, at 09:48, Jeff Janes wrote: > I've tested this through my testing harness which forces the database > to go through endless runs of crash recovery and checks for > consistency, and so far it has survived perfectly. Cool! I think that patch is most vulnerable to following type of workload: prepare transaction, do a lot of stuff with database to force checkpoints (or even recovery cycles), and commit it. > On 10 Dec 2015, at 09:48, Jeff Janes wrote: > Can you give the full command line? -j, -c, etc. pgbench -h testhost -i && pgbench -h testhost -f 2pc.pgb -T 300 -P 1 -c 64 -j 16 -r where 2pc.pgb as in previous message. Also all this applies to hosts with uniform memory. I tried to run patched postgres on NUMA with 60 physical cores and patch didn’t change anything. Perf top shows that main bottleneck is access to gxact, but on ordinary host with 1/2 cpu’s that access even not in top ten heaviest routines. > On 10 Dec 2015, at 09:48, Jeff Janes wrote: > Why are you incrementing :scale ? That’s a funny part, overall 2pc speed depends on how you will name your prepared transaction. Concretely I tried to use random numbers for gid’s and it was slower than having constantly incrementing gid. Probably that happens due to linear search by gid in gxact array on commit. So I used :scale just as a counter, bacause it is initialised on pgbench start and line like “\set scale :scale+1” works well. (may be there is a different way to do it in pgbench). > I very rapidly reach a point where most of the updates are against > tuples that don't exist, and then get integer overflow problems. Hmm, that’s strange. Probably you set scale to big value, so that 10*:scale is bigger that int4? But i thought that pgbench will change aid columns to bigint if scale is more than 2. 2pc_xlog.v2.diff Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cube extension kNN support
Hi, thanks for the review. > 1) (nitpicking) There seem to be some minor whitespace issues, i.e. > trailing spaces, empty lines being added/removed, etc. Fixed, I think > 2) one of the regression tests started to fail > > SELECT '-1e-700'::cube AS cube; > > This used to return (0) but now I get (-0). Actually that problem emerged because of the first problem. I had extra whitespace in sql file and removed that whitespace from one of the answers file (cube_1.sql), so diff with both cube.sql and cube_1.sql was one line length and you saw diff with cube.sql. In all systems that available to me (osx/linux/freebsd) I saw that right answers file is cube_1.sql. But in other OS’es you can get +/- 0 or e27/e027. I edited that answers files manually, so there probably can be some other typos. > 3) I wonder why the docs were changed like this: > > > - It does not matter which order the opposite corners of a cube are > - entered in. The cube functions > - automatically swap values if needed to create a uniform > - lower left — upper right internal representation. > + When corners coincide cube stores only one corner along with a >special flag in order to reduce size wasted. > > > Was the old behavior removed? I don't think so - it seems to behave > as before, so why to remove this information? Maybe it's not useful? > But then why add the bit about optimizing storage of points? I’ve edited it because the statement was mislead (or at least ambiguous) — cube_in function doesn’t swap coordinates. Simple way to see it: > select '(1,3),(3,1)'::cube; cube --- (1, 3),(3, 1) But LowerLeft-UpperRight representation should be (1,1),(3,3) Updated patch attached. cube_distances.patch Description: Binary data > On 15 Dec 2015, at 21:46, Tomas Vondra wrote: > > Hi, > > On 12/07/2015 03:47 PM, Stas Kelvich wrote: >> Hello, fixed. > > I've looked at the patch today, seems mostly fine to me. > > Three comments though: > > 1) (nitpicking) There seem to be some minor whitespace issues, i.e. > trailing spaces, empty lines being added/removed, etc. > > 2) one of the regression tests started to fail > > SELECT '-1e-700'::cube AS cube; > > This used to return (0) but now I get (-0). As this query existed in > 1.0, it's probably due to change in the C code. Now sure where. > > 3) I wonder why the docs were changed like this: > > > - It does not matter which order the opposite corners of a cube are > - entered in. The cube functions > - automatically swap values if needed to create a uniform > - lower left — upper right internal representation. > + When corners coincide cube stores only one corner along with a >special flag in order to reduce size wasted. > > > Was the old behavior removed? I don't think so - it seems to behave > as before, so why to remove this information? Maybe it's not useful? > But then why add the bit about optimizing storage of points? > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cube extension kNN support
> I don't think that's what the comment says, actually. It rather refers to > code like this: > >result = Min(LL_COORD(c, n - 1), UR_COORD(c, n - 1)); > > i.e. if you specifically ask for a particular corner (ll, in this case), > you'll get the proper value. Hmm, I was confused by phrase “create a uniform _internal_ representation” and actually internally cube stored “as is”. But probably i just misinterpret that. So here is the updated version with old documentation restored. cube_distances.patch Description: Binary data > On 16 Dec 2015, at 16:46, Tomas Vondra wrote: > > Hi, > > On 12/16/2015 01:26 PM, Stas Kelvich wrote: >> Hi, thanks for the review. >> >>> 1) (nitpicking) There seem to be some minor whitespace issues, i.e. >>> trailing spaces, empty lines being added/removed, etc. >> >> >> Fixed, I think >> >>> 2) one of the regression tests started to fail >>> >>> SELECT '-1e-700'::cube AS cube; >>> >>> This used to return (0) but now I get (-0). >> >> Actually that problem emerged because of the first problem. I had > extra whitespace in sql file and removed that whitespace from one of the > answers file (cube_1.sql), so diff with both cube.sql and cube_1.sql was > one line length and you saw diff with cube.sql. >> In all systems that available to me (osx/linux/freebsd) I saw that > right answers file is cube_1.sql. But in other OS’es you can get +/- 0 > or e27/e027. I edited that answers files manually, so there probably can > be some other typos. > > Ah! So that's why I couldn't quickly find the issue in the C code ... > >> >>> 3) I wonder why the docs were changed like this: >>> >>> >>> - It does not matter which order the opposite corners of a cube are >>> - entered in. The cube functions >>> - automatically swap values if needed to create a uniform >>> - lower left — upper right internal representation. >>> + When corners coincide cube stores only one corner along with a >>>special flag in order to reduce size wasted. >>> >>> >>> Was the old behavior removed? I don't think so - it seems to behave >>> as before, so why to remove this information? Maybe it's not useful? >>> But then why add the bit about optimizing storage of points? >> >> I’ve edited it because the statement was mislead (or at least ambiguous) — >> cube_in function doesn’t swap coordinates. >> Simple way to see it: >>> select '(1,3),(3,1)'::cube; >> cube >> --- >> (1, 3),(3, 1) >> >> But LowerLeft-UpperRight representation should be (1,1),(3,3) > > I don't think that's what the comment says, actually. It rather refers to > code like this: > >result = Min(LL_COORD(c, n - 1), UR_COORD(c, n - 1)); > > i.e. if you specifically ask for a particular corner (ll, in this case), > you'll get the proper value. > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
Hi, Tomáš! Thanks for comprehensive review. > On 15 Dec 2015, at 06:07, Tomas Vondra wrote: > > 1) It's a bit difficult to judge the usefulness of the API, as I've > always been a mere user of full-text search, and I never had a need > (or courage) to mess with the tsvectors. OTOH I don't see a good > reason no to have such API, when there's a need for it. > > The API seems to be reasonably complete, with one exception - when > looking at editing function we have for 'hstore', we do have these > variants for delete() > > delete(hstore,text) > delete(hstore,text[]) > delete(hstore,hstore) > > while this patch only adds delete(tsvector,text). Would it make > sense to add variants similar to hstore? It probably does not make > much sense to add delete(tsvector,tsvector), right? But being able > to delete a bunch of lexemes in one go seems like a good thing. > > What do you think? That’s a good idea and actually deleting tsvector from tsvector makes perfect sense. In delete function I used exact string match between string and lexemes in tsvector, but if somebody wants to delete for example “Cats” from tsvector, then he should downcase and singularize this word. Easiest way to do it is to just use to_tsvector() function. Also we can use this function to delete specific positions: like delete('cat:3 fat:2,4'::tsvector, 'fat:2'::tsvector) -> 'cat:3 fat:4'::tsvector. So in attached patch I’ve implemented following: delete(tsin tsvector, lexarrtext[]) — remove any occurence of lexemes inlexarr from tsin delete(tsin tsvector, tsv_filter tsvector) — Delete lexemes and/or positions of tsv_filter from tsin. When lexeme in tsv_filter has no positions function will delete any occurrence of same lexeme in tsin. When tsv_filter lexeme have positions function will delete them from positions of matching lexeme in tsin. If after such removal resulting positions set is empty then function will delete that lexeme from resulting tsvector. Also if we want some level of completeness of API and taking into account that concat() function shift positions on second argument I thought that it can be useful to also add function that can shift all positions of specific value. This helps to undo concatenation: delete one of concatenating tsvectors and then shift positions in resulting tsvector. So I also wrote one another small function: shift(tsin tsvector,offset int16) — Shift all positions in tsin by given offset > > > 2) tsvector_op.c needs a bit of love, to eliminate the two warnings it > currently triggers: > >tsvector_op.c:211:2: warning: ISO C90 forbids mixed ... >tsvector_op.c:635:9: warning: variable ‘lexeme2copy’ set but … > fixed > 3) the patch also touches tsvector_setweight(), only to do change: > > elog(ERROR, "unrecognized weight: %d", cw); > > to > > elog(ERROR, "unrecognized weight: %c", cw); > > That should probably go get committed separately, as a bugfix. > Okay, i’ll submit that as a separate patch. > > 4) I find it rather annoying that there are pretty much no comments in > the code. Granted, there are pretty much no comments in the > surrounding code, but I doubt that's a good reason for not having > any comments in new code. It makes reviews unnecessarily difficult. > Fixed, I think. > > 5) tsvector_concat() is not mentioned in docs at all > Concat mentioned in docs as an operator ||. > > 6) Docs don't mention names of the new parameters in function > signatures, just data types. The functions with a single parameter > probably don't need to do that, but multi-parameter ones should. > Fixed. > 7) Some of the functions use intexterm that does not match the function > name. I see two such cases - to_tsvector and setweight. Is there a > reason for that? > Because sgml compiler wants unique indexterm. Both functions that you mentioned use overloading of arguments and have non-unique name. tsvector_ops.diff Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
Hi. I’ve updated patch and wrote description of thighs that happens with 2PC state data in the beginning of the file. I think now this patch is well documented, but if somebody points me to places that probably requires more detailed description I’m ready to extend that. 2pc_xlog.diff Description: Binary data Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company > On 08 Jan 2016, at 19:29, Alvaro Herrera wrote: > > Simon Riggs wrote: > >> I think we could do better still, but this looks like the easiest 80% and >> actually removes code. >> >> The lack of substantial comments on the patch is a problem though - the >> details above should go in the patch. I'll have a go at reworking this for >> you, this time. > > Is someone submitting an updated patch here? > > -- > Álvaro Herrerahttp://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
Thanks a lot for your edits, now that patch is much more cleaner. > Your comments say > > "In case of crash replay will move data from xlog to files, if that hasn't > happened before." > > but I don't see that in code. Can you show me where that happens? xact.c calls RecreateTwoPhaseFile in xact_redo() function (xact.c:5596) > On 09 Jan 2016, at 18:29, Simon Riggs wrote: > > Hmm, I was just preparing this for commit. > > Please have a look at my mild edits and extended comments. One concern that come into my mind while reading updated patch is about creating extra bool field in GlobalTransactionData structure. While this improves readability, it also increases size of that structure and that size have impact on performance on systems with many cores (say like 60-80). Probably one byte will not make measurable difference, but I think it is good idea to keep GXact as small as possible. As far as I understand the same logic was behind split of PGPROC to PGPROC+PGXACT in 9.2 (comment in proc.h:166) Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
> On 10 Jan 2016, at 12:15, Simon Riggs wrote: > > So we've only optimized half the usage? We're still going to cause > replication delays. Yes, replica will go through old procedures of moving data to and from file. > We can either > > 1) Skip fsyncing the RecreateTwoPhaseFile and then fsync during restartpoints >From what i’ve seen with old 2pc code main performance bottleneck was caused >by frequent creating of files. So better to avoid files if possible. > > 2) Copy the contents to shmem and then write them at restartpoint as we do > for checkpoint > (preferred) Problem with shared memory is that we can’t really predict size of state data, and anyway it isn’t faster then reading data from WAL (I have tested that while preparing original patch). We can just apply the same logic on replica that on master: do not do anything special on prepare, and just read that data from WAL. If checkpoint occurs during recovery/replay probably existing code will handle moving data to files. I will update patch to address this issue. > I think padding will negate the effects of the additional bool. > > If we want to reduce the size of the array GIDSIZE is currently 200, but XA > says maximum 128 bytes. > > Anybody know why that is set to 200? Good catch about GID size. If we talk about further optimisations i see two ways: 1) Optimising access to GXACT. Here we can try to shrink it; introduce more granular locks, e.g. move GIDs out of GXACT and lock GIDs array only once while checking new GID uniqueness; try to lock only part of GXACT by hash; etc. 2) Be optimistic about consequent COMMIT PREPARED. In normal workload next command after PREPARE will be COMMIT/ROLLBACK, so we can save transaction context and release it only if next command isn’t our designated COMMIT/ROLLBACK. But that is a big amount of work and requires changes to whole transaction pipeline in postgres. Anyway I suggest that we should consider that as a separate task. --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
> > On 11 Jan 2016, at 21:40, Jesper Pedersen wrote: > > I have done a run with the patch and it looks really great. > > Attached is the TPS graph - with a 1pc run too - and the perf profile as a > flame graph (28C/56T w/ 256Gb mem, 2 x RAID10 SSD). > Thanks for testing and especially for the flame graph. That is somewhat in between the cases that I have tested. On commodity server with dual Xeon (6C each) 2pc speed is about 80% of 1pc speed, but on 60C/120T system that patch didn’t make significant difference because main bottleneck changes from file access to locks on array of running global transactions. How did you generated names for your PREPARE’s? One funny thing that I’ve spotted that tx rate increased when i was using incrementing counter as GID instead of random string. And can you also share flame graph for 1pc workload? > > On 11 Jan 2016, at 21:43, Simon Riggs wrote: > > Have you measured lwlocking as a problem? > Yes. GXACT locks that wasn’t even in perf top 10 on dual Xeon moves to the first places when running on 60 core system. But Jesper’s flame graph on 24 core system shows different picture. > On 12 Jan 2016, at 01:24, Andres Freund wrote: > > Currently recovery of 2pc often already is a bigger bottleneck than the > workload on the master, because replay has to execute the fsyncs implied by > statefile re-creation serially, whereas on the master they'll usually be > executed in parallel. That’s interesting observation. Simon already pointed me to this problem in 2pc replay, but I didn’t thought that it is so slow. I’m now working on that. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Cube extension improvement, GSoC
Hello, some time ago I started working on the data search system (about 100-200M of records) with queries consisted of several diapason and equality conditions, e.g.: WHERE dim1 BETWEEN 128 AND 137 AND WHERE dim2 BETWEEN 4815 AND 162342 AND WHERE dim3 = 42 ORDER BY dim1 ASC There are 6 or 7 search criteria in my task. In order to avoid full table scan I started using R-Tree over cube data type: CREATE INDEX ON my_table USING GiST(cube(array[dim1, dim2, dim3])) For fast sorting I used Alexander Korotkov's patch for knngist (http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg153676.html), which changes metric for nearest neighbors search and allows to obtain cubes ordered by a specific coordinate. Having changed some data types and function/operator definitions I ported this to 9.2 (https://github.com/kelvich/postgres/commit/bb372). Then, after this I could select ordered records right from the index: SELECT * FROM my_table WHERE cube(array[dim1, dim2, dim3]) <@ cube '(128,4815,42),(137,162342,42)' ORDER BY cube(array[dim1, dim2, dim3]) <*> 5; The main issue of such approach is the index size. In my case it was about 100GB for 100M of records. Therefore index building becomes very expensive disk-related operation. For the same reason reading a large number of records from the index is slow too. I came to Oleg Bartunov, Theodor Sigaev and after a while to Alexander Korotkov for any help. (I'm very thankful to them and glad that they agreed to meet, listen to me and give useful advices). Having discussed it we decided that there was several possible improvements for the cube extension: * Adding point data type support to the cube extension in order to avoid storing of coincident upper left and lower right vertices, which may reduce the volume that leaf nodes occupy almost twice. * Checking the split algorithm with big datasets and, if possible, improving it. * Learning cube extension to store dimensions with different data types. Such index would be good alternative to compound key B-Tree multi-index (suitable for diapason queries and data ordering). * Providing support for KNN with metrics induced by the different norms: euclidean, taxicab norm, p-norm. This can be useful in fields where we can extract signature: similar images search, similar audio search. I'd like to participate in GSoC with this improvements, and I'm very interested in any comments or suggestions about this feature list. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cube extension improvement, GSoC
Hello. >* Learning cube extension to store dimensions with different data types. > Such index would be good alternative to compound key B-Tree multi-index > (suitable for diapason queries and data ordering). > > You mean, a cube containing something else than floats? I don't think we want > to explode the number of datatypes by doing that, casting between them could > be problematic. > > At least option for having float4 cube instead of foat8 cube seems reasonable > for me, because of space economy payed by less accuracy. The main idea was to reduce the size of the index when it is possible. This can be very important when we have many dimensions with low number of different elements. > But I wonder if you could add cube-like operators for arrays. We already have > support for arrays of any datatype, and any number of dimensions. That seems > awfully lot like what the cube extension does. All cube stuff assumes fixed number of dimensions and it is not very useful in arrays where we easily can change number of elements. One can define index on expression over array, i.e. CREATE INDEX ON table USING GIST(cube(array[a,b,c])) and cube-like operations will work. But it is only when we have fixed-size array. And again if array elements is smallints then such cube uses 8 times more space then it's actually needed — four times when converting to float8 and two times when storing coincident cube bounds. > I think we have at least 3 data types more or less similar to cube. > 1) array of ranges > 2) range of arrays > 3) 2d arrays > Semantically cube is most close to array or ranges. However array of ranges > have huge storage overhead. > Also we can declare cube as domain on 2d arrays and declare operations of > that domain. But what we should do when arrays in different records have different numbers of element? Stas Kelvich. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cube extension improvement, GSoC
HI. Thanks, Heikki, for the answer on google-melange. For some reason i didn't receive email notification, so I saw this answer only today. > Do you have access to a server you can use to perform those tests? (...) Yes, i have. I am maintaining MPI cluster in my university, so it is not a problem. Actually tests for this proposal was made on servers from this cluster. But, anyway, thanks for offering help. There is an open question about supporting different datatypes in cube extension. As I understand we have following ideas: * add cube-like operators for arrays. We already have support for arrays of any datatype, and any number of dimensions. If we want to use tree-like data structures for this operators we will run into the same problems with trees and types. And we always can cast array to cube and use this operators. Or I wrongly understand this. * Create support for storing cube coordinates with different data types. (2,4,8-bytes integers, 4,8-bytes floats) Main goal of this is reducing the index size, so in order not to break big amount of code we can store data according to data type size (i.e. | smallint | real | real | double | double | ) and when we load data from the disk or cache we can cast it to float8 and existent code will work. To achieve this behavior two steps should be performed: 1) Store information about coordinates types when the index is created. Good question is where to store this data structure, but I believe it can be done. 2) Change functions that read and write data to disk, so they can cast type to/from float8 using data from previous clause. * Don't do cube with type support Eventually, there is different ways of reducing R-Tree size. For example we can store relative coordinates with dynamic size of MBR (VRMBR), instead of absolute coordinates with fixed sized MBR. There is some evidences, that this can sufficiently reduce size. http://link.springer.com/chapter/10.1007/11427865_13 On May 8, 2013, at 2:35 PM, Alexander Korotkov wrote: > On Sat, May 4, 2013 at 11:19 PM, Stas Kelvich wrote: > > I think we have at least 3 data types more or less similar to cube. > > 1) array of ranges > > 2) range of arrays > > 3) 2d arrays > > Semantically cube is most close to array or ranges. However array of ranges > > have huge storage overhead. > > Also we can declare cube as domain on 2d arrays and declare operations of > > that domain. > > But what we should do when arrays in different records have different numbers > of element? > > We can be faced with absolutely same situation with cube. > > test=# create table cube_test (v cube); > CREATE TABLE > > test=# insert into cube_test values (cube(array[1,2])), (cube(array[1,2,3])); > INSERT 0 2 > > In order to force all cubes to have same number of dimensions excplicit CHECK > on table is required. > As I remember cube treats absent dimensions as zeros. > > -- > With best regards, > Alexander Korotkov. > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cube extension improvement, GSoC
On May 14, 2013, at 4:53 PM, Alexander Korotkov wrote: > Sounds promising. Did you examine how this technique can fit into GiST? In > current GiST interface methods don't have access to parent entries. No, i didn't examine it yet. Anyway in this technique lots of changes should be performed to both cube GiST and storing methods. We can start from hacking with datatype support which only affects storing methods and after that begin investigation of VRMBRs support. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cube extension kNN support
Hi. > cube and earthdistance regression tests fail. Code updated to work with current HEAD. Also added tests to cover new functionality. > Do you have any benchmarks ? This patch just introduces functionality of calculating distances between cubes, so this code don't interfere much with kNN search speed. I think it's better to publish such benchmarks in neighbor patch about split algorithm. Anyway, we can compare kNN with b-tree and full scan: create table test(a1 float, a2 float, a3 float); insert into test (select 100*random(), 100*random(), 100*random() from generate_series(1,100) as s(a)); create index on test using gist(cube(array[a1,a2,a3])); select * from test order by a1 limit 15; -- 227.658 ms select * from test order by cube(array[a1,a2,a3])->1 limit 15; -- 1.275 ms create index on test(a1); select * from test order by a1 limit 15; -- 0.103 ms As we can see, kNN ordering 10 times slower than B-tree (on silly request for R-Tree, just as example), but still 100+ times faster than full scan on this table. Stas. On Sep 25, 2013, at 5:25 PM, Peter Eisentraut wrote: > On 9/22/13 7:38 PM, Stas Kelvich wrote: >> Here is the patch that introduces kNN search for cubes with euclidean, >> taxicab and chebyshev distances. > > cube and earthdistance regression tests fail. distances.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
> On 24 Jan 2017, at 09:42, Michael Paquier wrote: > > On Mon, Jan 23, 2017 at 9:00 PM, Nikhil Sontakke > wrote: >> Speeding up recovery or failover activity via a faster promote is a >> desirable thing. So, maybe, we should look at teaching the relevant >> code about using "KnownPreparedList"? I know that would increase the >> size of this patch and would mean more testing, but this seems to be >> last remaining optimization in this code path. > > That's a good idea, worth having in this patch. Actually we may not > want to call KnownPreparedRecreateFiles() here as promotion is not > synonym of end-of-recovery checkpoint for a couple of releases now. Thanks for review, Nikhil and Michael. I don’t follow here. We are moving data away from WAL to files on checkpoint because after checkpoint there is no guaranty that WAL segment with our prepared tx will be still available. > The difference between those two is likely noise. > > By the way, in those measurements, the OS cache is still filled with > the past WAL segments, which is a rather best case, no? What happens > if you do the same kind of tests on a box where memory is busy doing > something else and replayed WAL segments get evicted from the OS cache > more aggressively once the startup process switches to a new segment? > This could be tested for example on a VM with few memory (say 386MB or > less) so as the startup process needs to access again the past WAL > segments to recover the 2PC information it needs to get them back > directly from disk... One trick that you could use here would be to > tweak the startup process so as it drops the OS cache once a segment > is finished replaying, and see the effects of an aggressive OS cache > eviction. This patch is showing really nice improvements with the OS > cache backing up the data, still it would make sense to test things > with a worse test case and see if things could be done better. The > startup process now only reads records sequentially, not randomly > which is a concept that this patch introduces. > > Anyway, perhaps this does not matter much, the non-recovery code path > does the same thing as this patch, and the improvement is too much to > be ignored. So for consistency's sake we could go with the approach > proposed which has the advantage to not put any restriction on the > size of the 2PC file contrary to what an implementation saving the > contents of the 2PC files into memory would need to do. Maybe i’m missing something, but I don’t see how OS cache can affect something here. Total WAL size was 0x44 * 16 = 1088 MB, recovery time is about 20s. Sequential reading 1GB of data is order of magnitude faster even on the old hdd, not speaking of ssd. Also you can take a look on flame graphs attached to previous message — majority of time during recovery spent in pg_qsort while replaying PageRepairFragmentation, while whole xact_redo_commit() takes about 1% of time. That amount can grow in case of uncached disk read but taking into account total recovery time this should not affect much. If you are talking about uncached access only during checkpoint than here we are restricted with max_prepared_transaction, so at max we will read about hundred of small files (usually fitting into one filesystem page) which will also be barely noticeable comparing to recovery time between checkpoints. Also wal segments cache eviction during replay doesn’t seems to me as standard scenario. Anyway i took the machine with hdd to slow down read speed and run tests again. During one of the runs i launched in parallel bash loop that was dropping os cache each second (while wal fragment replay takes also about one second). 1.5M transactions start segment: 0x06 last segment: 0x47 patched, with constant cache_drop: total recovery time: 86s patched, without constant cache_drop: total recovery time: 68s (while difference is significant, i bet that happens mostly because of database file segments should be re-read after cache drop) master, without constant cache_drop: time to recover 35 segments: 2h 25m (after that i tired to wait) expected total recovery time: 4.5 hours -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
>> >> Yes, that’s also possible but seems to be less flexible restricting us to >> some >> specific GID format. >> >> Anyway, I can measure WAL space overhead introduced by the GID’s inside >> commit records >> to know exactly what will be the cost of such approach. > > Stas, > > Have you had a chance to look at this further? Generally i’m okay with Simon’s approach and will send send updated patch. Anyway want to perform some test to estimate how much disk space is actually wasted by extra WAL records. > I think the approach of storing just the xid and fetching the GID > during logical decoding of the PREPARE TRANSACTION is probably the > best way forward, per my prior mail. I don’t think that’s possible in this way. If we will not put GID in commit record, than by the time when logical decoding will happened transaction will be already committed/aborted and there will be no easy way to get that GID. I thought about several possibilities: * Tracking xid/gid map in memory also doesn’t help much — if server reboots between prepare and commit we’ll lose that mapping. * We can provide some hooks on prepared tx recovery during startup, but that approach also fails if reboot happened between commit and decoding of that commit. * Logical messages are WAL-logged, but they don’t have any redo function so don’t helps much. So to support user-accessible 2PC over replication based on 2PC decoding we should invent something more nasty like writing them into a table. > That should eliminate Simon's > objection re the cost of tracking GIDs and still let us have access to > them when we want them, which is the best of both worlds really. Having 2PC decoding in core is a good thing anyway even without GID tracking =) -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
> On 26 Jan 2017, at 10:34, Michael Paquier wrote: > > On Thu, Jan 26, 2017 at 4:09 PM, Nikhil Sontakke > wrote: >>> I look at this patch from you and that's present for me: >>> https://www.postgresql.org/message-id/CAMGcDxf8Bn9ZPBBJZba9wiyQq->Qk5uqq=vjomnrnw5s+fks...@mail.gmail.com >> >>> --- a/src/backend/access/transam/xlog.c >>> +++ b/src/backend/access/transam/xlog.c >>> @@ -9573,6 +9573,7 @@ xlog_redo(XLogReaderState *record) >>> (errmsg("unexpected timeline ID %u (should be %u) >>> in checkpoint record", >>> checkPoint.ThisTimeLineID, ThisTimeLineID))); >>> >>> +KnownPreparedRecreateFiles(checkPoint.redo); >>> RecoveryRestartPoint(&checkPoint); >>> } >> >> Oh, sorry. I was asking about CheckpointTwoPhase(). I don't see a >> function by this name. And now I see, the name is CheckPointTwoPhase() >> :-) > > My mistake then :D > >>> And actually, when a XLOG_CHECKPOINT_SHUTDOWN record is taken, 2PC >>> files are not flushed to disk with this patch. This is a problem as a >>> new restart point is created... Having the flush in CheckpointTwoPhase >>> really makes the most sense. >> >> Umm, AFAICS, CheckPointTwoPhase() does not get called in the "standby >> promote" code path. > > CreateRestartPoint() calls it via CheckPointGuts() while in recovery. > Huh, glad that this tread received a lot of attention. > On 24 Jan 2017, at 17:26, Nikhil Sontakke wrote: > > We are talking about the recovery/promote code path. Specifically this > call to KnownPreparedRecreateFiles() in PrescanPreparedTransactions(). > > We write the files to disk and they get immediately read up in the > following code. We could not write the files to disk and read > KnownPreparedList in the code path that follows as well as elsewhere. Thanks Nikhil, now I got that. Since we are talking about promotion we are on different timescale and 1-10 second lag matters a lot. I think I have in my mind realistic scenario when proposed recovery code path will hit the worst case: Google cloud. They have quite fast storage, but fsync time is really big and can go up to 10-100ms (i suppose it is network-attacheble). Having say 300 prepared tx, we can delay promotion up to half minute. So i think it worth of examination. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 26 Jan 2017, at 12:51, Craig Ringer wrote: > > * Tracking xid/gid map in memory also doesn’t help much — if server reboots > between prepare > and commit we’ll lose that mapping. > > Er what? That's why I suggested using the prepared xacts shmem state. It's > persistent as you know from your work on prepared transaction files. It has > all the required info. Imagine following scenario: 1. PREPARE happend 2. PREPARE decoded and sent where it should be sent 3. We got all responses from participating nodes and issuing COMMIT/ABORT 4. COMMIT/ABORT decoded and sent After step 3 there is no more memory state associated with that prepared tx, so if will fail between 3 and 4 then we can’t know GID unless we wrote it commit record (or table). -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 31 Jan 2017, at 12:22, Craig Ringer wrote: > > Personally I don't think lack of access to the GID justifies blocking 2PC > logical decoding. It can be added separately. But it'd be nice to have > especially if it's cheap. Agreed. > On 2 Feb 2017, at 00:35, Craig Ringer wrote: > > Stas was concerned about what happens in logical decoding if we crash between > PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back and decode > the whole txn again anyway so it doesn't matter. Not exactly. It seems that in previous discussions we were not on the same page, probably due to unclear arguments by me. >From my point of view there is no problems (or at least new problems comparing >to ordinary 2PC) with preparing transactions on slave servers with something >like “#{xid}#{node_id}” instead of GID if issuing node is coordinator of that >transaction. In case of failure, restart, crash we have the same options about >deciding what to do with uncommitted transactions. My concern is about the situation with external coordinator. That scenario is quite important for users of postgres native 2pc, notably J2EE user. Suppose user (or his framework) issuing “prepare transaction ‘mytxname’;" to servers with ordinary synchronous physical replication. If master will crash and replica will be promoted than user can reconnect to it and commit/abort that transaction using his GID. And it is unclear to me how to achieve same behaviour with logical replication of 2pc without GID in commit record. If we will prepare with “#{xid}#{node_id}” on acceptor nodes, then if donor node will crash we’ll lose mapping between user’s gid and our internal gid; contrary we can prepare with user's GID on acceptors, but then we will not know that GID on donor during commit decode (by the time decoding happens all memory state already gone and we can’t exchange our xid to gid). I performed some tests to understand real impact on size of WAL. I've compared postgres -master with wal_level = logical, after 3M 2PC transactions with patched postgres where GID’s are stored inside commit record too. Testing with 194-bytes and 6-bytes GID’s. (GID max size is 200 bytes) -master, 6-byte GID after 3M transaction: pg_current_xlog_location = 0/9572CB28 -patched, 6-byte GID after 3M transaction: pg_current_xlog_location = 0/96C442E0 so with 6-byte GID’s difference in WAL size is less than 1% -master, 194-byte GID after 3M transaction: pg_current_xlog_location = 0/B7501578 -patched, 194-byte GID after 3M transaction: pg_current_xlog_location = 0/D8B43E28 and with 194-byte GID’s difference in WAL size is about 18% So using big GID’s (as J2EE does) can cause notable WAL bloat, while small GID’s are almost unnoticeable. May be we can introduce configuration option track_commit_gid by analogy with track_commit_timestamp and make that behaviour optional? Any objections to that? -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers