Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD
- NetBSD: crashes under load; this could have been fixed but when I ran the benchmarks in 2012 none of the developers seemed to care. do you mean this? https://mail-index.netbsd.org/tech-kern/2012/08/29/msg013918.html YAMAMOTO Takashi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Securing make check (CVE-2014-0067)
On Fri, Apr 04, 2014 at 02:36:05AM +, YAMAMOTO Takashi wrote: Thanks. To avoid socket path length limitations, I lean toward placing the socket temporary directory under /tmp rather than placing under the CWD: http://www.postgresql.org/message-id/flat/20121129223632.ga15...@tornado.leadboat.com openvswitch has some tricks to overcome the socket path length limitation using symlink. (or procfs where available) iirc these were introduced for debian builds which use deep CWD. That's another reasonable approach. Does it have a notable advantage over placing the socket in a subdirectory of /tmp? Offhand, the security and compatibility consequences look similar. an advantage is that the socket can be placed under CWD and thus automatically obeys its directory permissions etc. YAMAMOTO Takashi -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Securing make check (CVE-2014-0067)
Thanks. To avoid socket path length limitations, I lean toward placing the socket temporary directory under /tmp rather than placing under the CWD: http://www.postgresql.org/message-id/flat/20121129223632.ga15...@tornado.leadboat.com openvswitch has some tricks to overcome the socket path length limitation using symlink. (or procfs where available) iirc these were introduced for debian builds which use deep CWD. http://git.openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=lib/socket- util.c;h=aa0c7196da9926de38b7388b8e28ead12e12913e;hb=HEAD look at shorten_name_via_symlink and shorten_name_via_proc. YAMAMOTO Takashi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SYSV shared memory vs mmap performance
hi, Hi, On Mon, Jan 28, 2013 at 03:27:28PM +, YAMAMOTO Takashi wrote: can you give me a pointer? This bug report for a start: http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=46833 thanks! YAMAMOTO Takashi This is the only one I've filled; I also remember having irc discussions with some netbsd developers about the system getting unresponsive under load. -- Francois Tigeot -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SYSV shared memory vs mmap performance
hi, I'm less optimistic on the NetBSD front: even though I reported major show-stopper bugs (system died under load and was unable to complete a pgbench run), no one seemed to care. can you give me a pointer? YAMAMOTO Takashi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] huge tlb support
Robert Haas robertmh...@gmail.com writes: On Fri, Jun 29, 2012 at 3:52 PM, Andres Freund and...@2ndquadrant.com wrote: In a *very* quick patch I tested using huge pages/MAP_HUGETLB for the mmap'ed memory. So, considering that there is required setup, it seems that the obvious thing to do here is add a GUC: huge_tlb_pages (boolean). The other alternative is to try with MAP_HUGETLB and, if it fails, try again without MAP_HUGETLB. +1 for not making people configure this manually. Also, I was under the impression that recent Linux kernels use hugepages automatically if they can, so I wonder exactly what Andres was testing on ... if you mean the trasparent hugepage feature, iirc it doesn't affect MAP_SHARED mappings like this. YAMAMOTO Takashi regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] huge tlb support
y...@mwd.biglobe.ne.jp (YAMAMOTO Takashi) writes: Also, I was under the impression that recent Linux kernels use hugepages automatically if they can, so I wonder exactly what Andres was testing on ... if you mean the trasparent hugepage feature, iirc it doesn't affect MAP_SHARED mappings like this. Oh! That would explain some things. It seems like a pretty nasty restriction though ... do you know why they did that? i don't know. simply because it wasn't trivial, i guess. the feature was implemented for kvm's guest memory, which is non-shared anonymous memory from POV of host kernel. YAMAMOTO Takashi regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gistVacuumUpdate
hi, On 13.01.2012 06:24, YAMAMOTO Takashi wrote: hi, gistVacuumUpdate was removed when old-style VACUUM FULL was removed. i wonder why. it was not practical and REINDEX is preferred? anyway, the removal seems incomplete and there are some leftovers: F_TUPLES_DELETED F_DELETED XLOG_GIST_PAGE_DELETE Hmm, in theory we might bring back support for deleting pages in the future, I'm guessing F_DELETED and the WAL record type were left in place because of that. Either that, or it was an oversight. It's also good to have the F_DELETED/F_TUPLES_DELETED around, so that new versions don't get confused if they see those set in GiST indexes that originate from an old cluster, upgraded to new version with pg_upgrade. For that purpose, a comment explaining what those used to be would've been enough, though. the loop in gistvacuumcleanup to search F_DELETED pages seems too expensive for pg_upgrade purpose. (while it also checks PageIsNew, is it alone worth the loop?) i'm wondering because what gistVacuumUpdate used to do does not seem to be necessarily tied to the old-style VACUUM FULL. currently, no one will re-union keys after tuple removals, right? YAMAMOTO Takashi -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM in SP-GiST
hi, I started reading the spgist vacuum code last night, and didn't like it at all. I found a number of smaller issues, but it seems to me that the design is just fundamentally wrong. Unless I'm misunderstanding it, the approach is to recursively traverse the tree in sort of the same way that a full-index search would do, except that it continues to hold lock on the immediate parent page of whichever page is currently being visited, so that it can update the downlink if it has to delete the first leaf tuple of a chain. I've got several complaints about that: 1. Holding exclusive locks on upper pages while we visit possibly-hundreds of leaf pages seems pretty bad from a concurrency standpoint. It doesn't hold locks all the way up to the root, so it's not a complete loss, but even a first-level inner page can have a lot of children. 2. I do not trust this code to visit every leaf page (which, if it failed to do so and thereby missed deleting a heap reference, would result in a silently corrupt index). Unlike a regular index search, which makes a list of lower-level targets while examining an inner tuple just once, this code depends on the assumption that it can reacquire lock on an upper page and then resychronize itself with whatever's been done meanwhile to the inner tuple it was looking at. I don't trust that. The mechanism that's being used relies on page LSN having been changed if anything was changed on the page, which does not work in unlogged tables. Furthermore, if the inner tuple did change, it has to rescan *everything* that the inner tuple leads to. That's horrendously inefficient, and it's not even clear that it would ever terminate in the face of a constant stream of concurrent updates. 3. Even if it all worked, it would be slow as can be, because it requires a whole lot of nonsequential accesses to the index. For instance, the same leaf page might be visited many times (once for each leaf chain on the page), and not necessarily close together either. Also, the code doesn't seem to perform any cleanup at all on inner pages. I am not expecting it to try to get rid of childless inner tuples, but surely it ought to at least convert old redirects to dead and get rid of dead tuples at the end of the page, much as for leaf pages? What I'm envisioning doing instead is making a single sequential scan over the entire index. On both leaf and inner pages we could clean redirects and remove end dead tuples. On leaf pages we'd also check for tuples to be deleted. There's no real difficulty in removing deleted tuples as long as they are not at the heads of their chains. (The code would have to reverse-engineer which tuples are at the heads of their chains, but that's easily done while scanning the page; we just make a reverse-lookup map for the nextOffset pointers.) If we have to delete a tuple at the head of its chain, but there's still at least one live tuple in its chain, we could reshuffle the tuples to bring another live tuple to that same offset number, thereby not invalidating the parent downlink. The only hard part is what to do when we delete all members of a chain: we can't reset the parent downlink because we do not know where it is. What I propose to do in that case is install a variant form of dead tuple that just indicates this chain is empty. One way to represent that would be a redirect tuple pointing nowhere, but I think it would be cleaner to repurpose the isDead and isRedirect bits as a two-bit field with four states. We'd have LIVE, DEAD, REDIRECT, and this fourth state for a dead tuple that is not recyclable because we know there's a link to it somewhere. A scan would ignore such a tuple. An insertion that arrives at such a tuple could either replace it (if there's enough room on the page) or convert it to a redirect (if not). Last night I was thinking the fourth state could be named TOMBSTONE, but maybe it'd be better to use DEAD for this case and rename the existing removable dead tuple state to PLACEHOLDER, since that case only exists to preserve the offset numbers of other tuples on the page. Comments? with a concurrent split moving leaf tuples between pages, can't it make bulkdelete fail reporting some TIDs and make CREATE INDEX CONCURRENTLY create duplicates? YAMAMOTO Takashi regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] gistVacuumUpdate
hi, gistVacuumUpdate was removed when old-style VACUUM FULL was removed. i wonder why. it was not practical and REINDEX is preferred? anyway, the removal seems incomplete and there are some leftovers: F_TUPLES_DELETED F_DELETED XLOG_GIST_PAGE_DELETE YAMAMOTO Takashi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] ordering op for WHERE
hi, does it make sense to teach the planner (and the executor?) use an ordering op to optimize queries like the following? select * from t where a - 1000 10 YAMAMOTO Takashi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] synchronous commit vs. hint bits
hi, On Wed, Nov 30, 2011 at 1:37 AM, YAMAMOTO Takashi y...@mwd.biglobe.ne.jp wrote: Yes, I would expect that. What kind of increase are you seeing? Is it causing a problem for you, or are you just making an observation? i was curious because my application uses async commits mainly to avoid frequent fsync. i have no numbers right now. Oh, that's interesting. Why do you want to avoid frequent fsyncs? simply because it was expensive on my environment. I thought the point of synchronous_commit=off was to move the fsyncs to the background, but not necessarily to decrease the frequency. it makes sense. but it's normal for users to abuse features. :) YAMAMOTO Takashi -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] synchronous commit vs. hint bits
hi, On Tue, Nov 29, 2011 at 1:42 AM, YAMAMOTO Takashi y...@mwd.biglobe.ne.jp wrote: On Mon, Nov 7, 2011 at 5:26 PM, Simon Riggs si...@2ndquadrant.com wrote: 5. Make the WAL writer more responsive, maybe using latches, so that it doesn't take as long for the commit record to make it out to disk. I'm working on this already as part of the update for power reduction/group commit/replication performance. I extracted this from my current patch for you to test. is it expected to produce more frequent fsync? Yes, I would expect that. What kind of increase are you seeing? Is it causing a problem for you, or are you just making an observation? i was curious because my application uses async commits mainly to avoid frequent fsync. i have no numbers right now. YAMAMOTO Takashi -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] reduce null bitmap size
hi, y...@mwd.biglobe.ne.jp (YAMAMOTO Takashi) writes: how about making heap_form_tuple and its variants use smaller natts for tuples whose trailing columns are NULL? This idea has been proposed before, and rejected on the basis that it's unlikely to save enough to be worth the cycles needed to check for the case. Keep in mind that you aren't going to save anything at all unless the bitmap length decreases across a MAXALIGN boundary. but it also means that, if one-bit descrease happens to cross the boundary, you can save a MAXALIGN, right? btw, while playing with this, i noticed that pg_column_size+rowtype+aggregate produces unstable results. i guess it's because when nodeAgg handles grp_firstTuple, the tuple is converted to a short varlena. i don't know if it can be a real problem than just a weird looking. YAMAMOTO Takashi test=# create table t (i int); CREATE TABLE test=# insert into t values (1),(2),(3); INSERT 0 3 test=# select pg_column_size(t.*) from t; pg_column_size 28 28 28 (3 rows) test=# select string_agg(pg_column_size(t.*)::text,',') from t; string_agg 25,28,28 (1 row) test=# regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] reduce null bitmap size
hi, how about making heap_form_tuple and its variants use smaller natts for tuples whose trailing columns are NULL? depending on the schema, it can save considerable space. the most of code are ready to deal with such tuples for ALTER TABLE ADD COLUMN anyway. (except system catalogs?) YAMAMOTO Takashi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ProcArrayLock contention
hi, I've been playing with the attached patch, which adds an additional light-weight lock mode, LW_SHARED2. LW_SHARED2 conflicts with LW_SHARED and LW_EXCLUSIVE, but not with itself. The patch changes ProcArrayEndTransaction() to use this new mode. IOW, multiple processes can commit at the same time, and multiple processes can take snapshots at the same time, but nobody can take a snapshot while someone else is committing. Needless to say, I don't we'd really want to apply this, because adding a LW_SHARED2 mode that's probably only useful for ProcArrayLock would be a pretty ugly wart. But the results are interesting. pgbench, scale factor 100, unlogged tables, Nate Boley's 32-core AMD box, shared_buffers = 8GB, maintenance_work_mem = 1GB, synchronous_commit = off, checkpoint_segments = 300, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, wal_writer_delay = 20ms, results are median of three five-minute runs: #clients tps(master) tps(lwshared2) 1 657.984859 683.251582 8 4748.906750 4946.069238 32 10695.160555 17530.390578 80 7727.563437 16099.549506 That's a pretty impressive speedup, but there's trouble in paradise. With 80 clients (but not 32 or fewer), I occasionally get the following error: ERROR: t_xmin is uncommitted in tuple to be updated So it seems that there's some way in which this locking is actually incorrect, though I'm not seeing what it is at the moment. Either that, or there's some bug in the existing code that happens to be exposed by this change. The patch also produces a (much smaller) speedup with regular tables, but it's hard to know how seriously to take that until the locking issue is debugged. Any ideas? latestCompletedXid got backward due to concurrent updates and it fooled TransactionIdIsInProgress? YAMAMOTO Takashi -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tab stop in README
On Sun, Aug 28, 2011 at 8:28 PM, YAMAMOTO Takashi y...@mwd.biglobe.ne.jp wrote: On men, 2011-08-22 at 04:09 +, YAMAMOTO Takashi wrote: i know that postgresql uses ts=4 for C source code. but how about documatation? I'd say ideally don't use any tabs at all. i agree. It appears to be geared for ts=4. Could you send a patch or other indication for what you think needs changing? attached. I'm confused by this patch, because it doesn't seem to get rid of all the tabs in the file. Nor does it seem to replace tabs with spaces. It looks like it's just randomly removing and adding tabs in various places. the patch just fixes indent for ts=4, keep using tabs. should i run expand -t4 and send the result? YAMAMOTO Takashi -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tab stop in README
hi, On men, 2011-08-22 at 04:09 +, YAMAMOTO Takashi wrote: i know that postgresql uses ts=4 for C source code. but how about documatation? I'd say ideally don't use any tabs at all. i agree. src/backend/access/transam/README seems to have both of ts=4 and ts=8 mixed. It appears to be geared for ts=4. Could you send a patch or other indication for what you think needs changing? attached. YAMAMOTO Takashi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index eaac139..e866d9e 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -61,23 +61,23 @@ sequence: / StartTransactionCommand; / StartTransaction; -1)ProcessUtility; BEGIN +1)ProcessUtility; BEGIN \ BeginTransactionBlock; \ CommitTransactionCommand; / StartTransactionCommand; -2) / ProcessQuery;SELECT ... - \ CommitTransactionCommand; +2) / ProcessQuery;SELECT ... + \ CommitTransactionCommand; \ CommandCounterIncrement; / StartTransactionCommand; -3) / ProcessQuery;INSERT ... - \ CommitTransactionCommand; +3) / ProcessQuery;INSERT ... + \ CommitTransactionCommand; \ CommandCounterIncrement; / StartTransactionCommand; / ProcessUtility; COMMIT -4)EndTransactionBlock; +4)EndTransactionBlock; \ CommitTransactionCommand; \ CommitTransaction; @@ -100,9 +100,9 @@ Transaction aborts can occur in two ways: The reason we have to distinguish them is illustrated by the following two situations: - case 1 case 2 - -- -- -1) user types BEGIN1) user types BEGIN + case 1 case 2 + -- -- +1) user types BEGIN1) user types BEGIN 2) user does something 2) user does something 3) user does not like what 3) system aborts for some reason she sees and types ABORT (syntax error, etc) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] tab stop in README
hi, i know that postgresql uses ts=4 for C source code. but how about documatation? src/backend/access/transam/README seems to have both of ts=4 and ts=8 mixed. YAMAMOTO Takashi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Large Objects versus transactional behavior
hi, On Sat, Apr 30, 2011 at 2:58 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: This is related to the SIREAD lock versus ACCESS EXCLUSIVE lock thread, but seemed different enough to merit spinning off a new thread. Our shop hasn't used large objects so far because of the lack of security (until 9.1), so I never noticed the rather unusual transactional semantics of large objects. From the devel documentation: http://developer.postgresql.org/pgdocs/postgres/lo-interfaces.html#LO-OPEN | [...] with INV_READ you cannot write on the descriptor, and the | data read from it will reflect the contents of the large object at | the time of the transaction snapshot that was active when lo_open | was executed, regardless of later writes by this or other | transactions. Reading from a descriptor opened with INV_WRITE | returns data that reflects all writes of other committed | transactions as well as writes of the current transaction. This is | similar to the behavior of REPEATABLE READ versus READ COMMITTED | transaction modes for ordinary SQL SELECT commands. as a novice user who has been annoyed by them, i'm curious about the rationale of the unusual semantics. is there any chance to just make large objects obey the normal semantics in future? YAMAMOTO Takashi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI bug?
hi, hi, I think I see what is going on now. We are sometimes failing to set the commitSeqNo correctly on the lock. In particular, if a lock assigned to OldCommittedSxact is marked with InvalidSerCommitNo, it will never be cleared. The attached patch corrects this: TransferPredicateLocksToNewTarget should initialize a new lock entry's commitSeqNo to that of the old one being transferred, or take the minimum commitSeqNo if it is merging two lock entries. Also, CreatePredicateLock should initialize commitSeqNo for to InvalidSerCommitSeqNo instead of to 0. (I don't think using 0 would actually affect anything, but we should be consistent.) I also added a couple of assertions I used to track this down: a lock's commitSeqNo should never be zero, and it should be InvalidSerCommitSeqNo if and only if the lock is not held by OldCommittedSxact. Takashi, does this patch fix your problem with leaked SIReadLocks? i'm currently running bf6848bc8c82e82f857d48185554bc3e6dcf1013 with this patch applied. i haven't seen the symptom yet. i'll keep it running for a while. i haven't seen the symptom since them. so i guess it was fixed by your patch. thanks! YAMAMOTO Takashi btw, i've noticed the following message in the server log. is it normal? LOG: could not truncate directory pg_serial: apparent wraparound YAMAMOTO Takashi Dan -- Dan R. K. Ports MIT CSAILhttp://drkp.net/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI bug?
hi, YAMAMOTO Takashi y...@mwd.biglobe.ne.jp wrote: LOG: could not truncate directory pg_serial: apparent wraparound Did you get a warning with this text?: memory for serializable conflict tracking is nearly exhausted there is not such a warning near the above aparent wraparound record. not sure if it was far before the record as i've lost the older log files. YAMAMOTO Takashi If not, there's some sort of cleanup bug to fix in the predicate locking's use of SLRU. It may be benign, but we won't really know until we find it. I'm investigating. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI bug?
hi, I think I see what is going on now. We are sometimes failing to set the commitSeqNo correctly on the lock. In particular, if a lock assigned to OldCommittedSxact is marked with InvalidSerCommitNo, it will never be cleared. The attached patch corrects this: TransferPredicateLocksToNewTarget should initialize a new lock entry's commitSeqNo to that of the old one being transferred, or take the minimum commitSeqNo if it is merging two lock entries. Also, CreatePredicateLock should initialize commitSeqNo for to InvalidSerCommitSeqNo instead of to 0. (I don't think using 0 would actually affect anything, but we should be consistent.) I also added a couple of assertions I used to track this down: a lock's commitSeqNo should never be zero, and it should be InvalidSerCommitSeqNo if and only if the lock is not held by OldCommittedSxact. Takashi, does this patch fix your problem with leaked SIReadLocks? i'm currently running bf6848bc8c82e82f857d48185554bc3e6dcf1013 with this patch applied. i haven't seen the symptom yet. i'll keep it running for a while. btw, i've noticed the following message in the server log. is it normal? LOG: could not truncate directory pg_serial: apparent wraparound YAMAMOTO Takashi Dan -- Dan R. K. Ports MIT CSAILhttp://drkp.net/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI bug?
hi, [no residual SIReadLock] i read it as there are many (7057) SIReadLocks somehow leaked. am i wrong? YAMAMOTO Takashi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI bug?
hi, (6) Does the application continue to run relatively sanely, or does it fall over at this point? my application just exits on the error. if i re-run the application without rebooting postgres, it seems that i will get the error sooner than the first run. (but it might be just a matter of luck) If your application hits this again, could you check pg_stat_activity and pg_locks and see if any SIReadLock entries are lingering after the owning transaction and all overlapping transactions are completed? If anything is lingering between runs of your application, it *should* show up in one or the other of these. this is 71ac48fd9cebd3d2a873635a04df64096c981f73 with your two patches. this psql session was the only activity to the server at this point. hoge=# select * from pg_stat_activity; -[ RECORD 1 ]+ datid| 16384 datname | hoge procpid | 7336 usesysid | 10 usename | takashi application_name | psql client_addr | client_hostname | client_port | -1 backend_start| 2011-03-26 12:28:21.882226+09 xact_start | 2011-03-28 11:55:19.300027+09 query_start | 2011-03-28 11:55:19.300027+09 waiting | f current_query| select * from pg_stat_activity; hoge=# select count(*) from pg_locks where mode='SIReadLock'; -[ RECORD 1 ] count | 7057 hoge=# select locktype,count(*) from pg_locks group by locktype; -[ RECORD 1 ] locktype | virtualxid count| 1 -[ RECORD 2 ] locktype | relation count| 1 -[ RECORD 3 ] locktype | tuple count| 7061 hoge=# (7) The message hint would help pin it down, or a stack trace at the point of the error would help more. Is it possible to get either? Looking over the code, it appears that the only places that SSI could generate that error, it would cancel that transaction with the hint You might need to increase max_pred_locks_per_transaction. and otherwise allow normal processing. no message hints. these errors are not generated by SSI code, at least directly. Right, that's because we were using HASH_ENTER instead of HASH_ENTER_NULL. I've posted a patch which should correct that. sure, with your patch it seems that they turned into different ones. PG_DIAG_SEVERITY: WARNING PG_DIAG_SQLSTATE: 53200 PG_DIAG_MESSAGE_PRIMARY: out of shared memory PG_DIAG_SOURCE_FILE: shmem.c PG_DIAG_SOURCE_LINE: 190 PG_DIAG_SOURCE_FUNCTION: ShmemAlloc PG_DIAG_SEVERITY: ERROR PG_DIAG_SQLSTATE: 53200 PG_DIAG_MESSAGE_PRIMARY: out of shared memory PG_DIAG_MESSAGE_HINT: You might need to increase max_pred_locks_per_transaction. PG_DIAG_SOURCE_FILE: predicate.c PG_DIAG_SOURCE_LINE: 2049 PG_DIAG_SOURCE_FUNCTION: CreatePredicateLock Even with the above information it may be far from clear where allocations are going past their maximum, since one HTAB could grab more than its share and starve another which is staying below its maximum. I'll take a look at the possibility of adding a warning or some such when an HTAB expands past its maximum size. I see from your later post that you are running with this patch. Has that reported anything yet? i got nothing except the following one. (in the server log) WARNING: hash table ShmemIndex has more entries than expected DETAIL: The maximum was set to 32 on creation. YAMAMOTO Takashi Thanks, -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI bug?
hi, YAMAMOTO Takashi y...@mwd.biglobe.ne.jp wrote: thanks for quickly fixing problems. Thanks for the rigorous testing. :-) i tested the later version (a2eb9e0c08ee73208b5419f5a53a6eba55809b92) and only errors i got was out of shared memory. i'm not sure if it was caused by SSI activities or not. PG_DIAG_SEVERITY: WARNING PG_DIAG_SQLSTATE: 53200 PG_DIAG_MESSAGE_PRIMARY: out of shared memory PG_DIAG_SOURCE_FILE: shmem.c PG_DIAG_SOURCE_LINE: 190 PG_DIAG_SOURCE_FUNCTION: ShmemAlloc PG_DIAG_SEVERITY: ERROR PG_DIAG_SQLSTATE: 53200 PG_DIAG_MESSAGE_PRIMARY: out of shared memory PG_DIAG_SOURCE_FILE: dynahash.c PG_DIAG_SOURCE_LINE: 925 PG_DIAG_SOURCE_FUNCTION: hash_search_with_hash_value Nor am I. Some additional information would help. (1) Could you post the non-default configuration settings? none. it can happen with just initdb+createdb'ed database. (2) How many connections are in use in your testing? 4. (3) Can you give a rough categorization of how many of what types of transactions are in the mix? all transactions are SERIALIZABLE. no transactions are with READ ONLY. (but some of them are actually select-only.) (4) Are there any long-running transactions? no. (5) How many of these errors do you get in what amount of time? once it start happening, i see them somehow frequently. (6) Does the application continue to run relatively sanely, or does it fall over at this point? my application just exits on the error. if i re-run the application without rebooting postgres, it seems that i will get the error sooner than the first run. (but it might be just a matter of luck) (7) The message hint would help pin it down, or a stack trace at the point of the error would help more. Is it possible to get either? Looking over the code, it appears that the only places that SSI could generate that error, it would cancel that transaction with the hint You might need to increase max_pred_locks_per_transaction. and otherwise allow normal processing. no message hints. these errors are not generated by SSI code, at least directly. (please look at PG_DIAG_SOURCE_xxx in the above log.) YAMAMOTO Takashi Even with the above information it may be far from clear where allocations are going past their maximum, since one HTAB could grab more than its share and starve another which is staying below its maximum. I'll take a look at the possibility of adding a warning or some such when an HTAB expands past its maximum size. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI bug?
hi, thanks for quickly fixing problems. i tested the later version (a2eb9e0c08ee73208b5419f5a53a6eba55809b92) and only errors i got was out of shared memory. i'm not sure if it was caused by SSI activities or not. YAMAMOTO Takashi the following is a snippet from my application log: PG_DIAG_SEVERITY: WARNING PG_DIAG_SQLSTATE: 53200 PG_DIAG_MESSAGE_PRIMARY: out of shared memory PG_DIAG_SOURCE_FILE: shmem.c PG_DIAG_SOURCE_LINE: 190 PG_DIAG_SOURCE_FUNCTION: ShmemAlloc PG_DIAG_SEVERITY: ERROR PG_DIAG_SQLSTATE: 53200 PG_DIAG_MESSAGE_PRIMARY: out of shared memory PG_DIAG_SOURCE_FILE: dynahash.c PG_DIAG_SOURCE_LINE: 925 PG_DIAG_SOURCE_FUNCTION: hash_search_with_hash_value PG_DIAG_SEVERITY: WARNING PG_DIAG_SQLSTATE: 53200 PG_DIAG_MESSAGE_PRIMARY: out of shared memory PG_DIAG_SOURCE_FILE: shmem.c PG_DIAG_SOURCE_LINE: 190 PG_DIAG_SOURCE_FUNCTION: ShmemAlloc PG_DIAG_SEVERITY: ERROR PG_DIAG_SQLSTATE: 53200 PG_DIAG_MESSAGE_PRIMARY: out of shared memory PG_DIAG_SOURCE_FILE: dynahash.c PG_DIAG_SOURCE_LINE: 925 PG_DIAG_SOURCE_FUNCTION: hash_search_with_hash_value PG_DIAG_SEVERITY: WARNING PG_DIAG_SQLSTATE: 53200 PG_DIAG_MESSAGE_PRIMARY: out of shared memory PG_DIAG_SOURCE_FILE: shmem.c PG_DIAG_SOURCE_LINE: 190 PG_DIAG_SOURCE_FUNCTION: ShmemAlloc PG_DIAG_SEVERITY: ERROR PG_DIAG_SQLSTATE: 53200 PG_DIAG_MESSAGE_PRIMARY: out of shared memory PG_DIAG_SOURCE_FILE: dynahash.c PG_DIAG_SOURCE_LINE: 925 PG_DIAG_SOURCE_FUNCTION: hash_search_with_hash_value -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI bug?
hi, Kevin Grittner wrote: I'm proceeding on this basis. Result attached. I found myself passing around the tuple xmin value just about everywhere that the predicate lock target tag was being passed, so it finally dawned on me that this logically belonged as part of the target tag. That simplified the changes, and the net result of following Heikki's suggestion here is the reduction of total lines of code by 178 while adding coverage for missed corner cases and fixing bugs. Thanks again, Heikki! I will test this some more tomorrow. So far I haven't done more than ensure it passes the standard regression tests and the isolation tests added for SSI. The latter took awhile because the hash_any function was including uninitialized bytes past the length of the tag in its calculations. We should probably either fix that or document it. I had to add another couple bytes to the tag to get it to a four byte boundary to fix it. Easy once you know that's how it works... The attached patch can also be viewed here: http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=46fd5ea6728b566c521ec83048bc00a207289dd9 If this stands up to further testing, the only issue I know of for the 9.1 release is to update the documentation of shared memory usage to include the new structures. -Kevin i tested ede45e90dd1992bfd3e1e61ce87bad494b81f54d + ssi-multi-update-1.patch with my application and got the following assertion failure. YAMAMOTO Takashi Core was generated by `postgres'. Program terminated with signal 6, Aborted. #0 0xbbba4cc7 in _lwp_kill () from /usr/lib/libc.so.12 (gdb) bt #0 0xbbba4cc7 in _lwp_kill () from /usr/lib/libc.so.12 #1 0xbbba4c85 in raise (s=6) at /siro/nbsd/src/lib/libc/gen/raise.c:48 #2 0xbbba445a in abort () at /siro/nbsd/src/lib/libc/stdlib/abort.c:74 #3 0x0833d9f2 in ExceptionalCondition ( conditionName=0x8493f6d !(locallock != ((void *)0)), errorType=0x8370451 FailedAssertion, fileName=0x8493ec3 predicate.c, lineNumber=3657) at assert.c:57 #4 0x0827977e in CheckTargetForConflictsIn (targettag=0xbfbfce78) at predicate.c:3657 #5 0x0827bd74 in CheckForSerializableConflictIn (relation=0x99b5bad4, tuple=0xbfbfcf78, buffer=234) at predicate.c:3772 #6 0x080a5be8 in heap_update (relation=0x99b5bad4, otid=0xbfbfd038, newtup=0x99a0aa08, ctid=0xbfbfd03e, update_xmax=0xbfbfd044, cid=1, crosscheck=0x0, wait=1 '\001') at heapam.c:2593 #7 0x081c81ca in ExecModifyTable (node=0x99a0566c) at nodeModifyTable.c:624 #8 0x081b2153 in ExecProcNode (node=0x99a0566c) at execProcnode.c:371 #9 0x081b0f82 in standard_ExecutorRun (queryDesc=0x99b378f0, direction=ForwardScanDirection, count=0) at execMain.c:1247 #10 0xbbaaf352 in pgss_ExecutorRun (queryDesc=0x99b378f0, direction=ForwardScanDirection, count=0) at pg_stat_statements.c:541 #11 0xbbab5ee5 in explain_ExecutorRun (queryDesc=0x99b378f0, direction=ForwardScanDirection, count=0) at auto_explain.c:201 #12 0x08288ae3 in ProcessQuery (plan=0x99bb404c, sourceText=0x99b3781c UPDATE file SET ctime = $1 WHERE fileid = $2, params=value optimized out, dest=0x84d1f38, completionTag=0xbfbfd2f0 ) at pquery.c:197 #13 0x08288d0a in PortalRunMulti (portal=0x99b3f01c, isTopLevel=value optimized out, dest=dwarf2_read_address: Corrupted DWARF expression. ) at pquery.c:1268 #14 0x0828984a in PortalRun (portal=0x99b3f01c, count=2147483647, isTopLevel=1 '\001', dest=0x99b07428, altdest=0x99b07428, completionTag=0xbfbfd2f0 ) at pquery.c:822 #15 0x08286c22 in PostgresMain (argc=2, argv=0xbb9196a4, username=0xbb9195f8 takashi) at postgres.c:2004 #16 0x082413f6 in ServerLoop () at postmaster.c:3590 #17 0x082421a8 in PostmasterMain (argc=3, argv=0xbfbfe594) at postmaster.c:1110 #18 0x081e0d09 in main (argc=3, argv=0xbfbfe594) at main.c:199 (gdb) fr 4 #4 0x0827977e in CheckTargetForConflictsIn (targettag=0xbfbfce78) at predicate.c:3657 3657Assert(locallock != NULL); (gdb) list 3652 3653locallock = (LOCALPREDICATELOCK *) 3654 hash_search_with_hash_value(LocalPredicateLockHash, 3655 targettag, targettaghash, 3656 HASH_FIND, NULL); 3657Assert(locallock != NULL); 3658Assert(locallock-held); 3659locallock-held = false; 3660 3661if (locallock-childLocks == 0) (gdb) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref
Re: [HACKERS] SSI bug?
hi, YAMAMOTO Takashi wrote: with your previous patch or not? With, thanks. i tried. unfortunately i can still reproduce the original loop problem. WARNING: [0] target 0xbb51ef18 tag 4000:4017:7e3:78:0 prior 0xbb51f148 next 0xb b51edb0 WARNING: [1] target 0xbb51f148 tag 4000:4017:7e3:77:0 prior 0xbb51ef90 next 0xb b51ef18 WARNING: [2] target 0xbb51ef90 tag 4000:4017:7e3:74:0 prior 0xbb51edb0 next 0xb b51f148 WARNING: [3] target 0xbb51edb0 tag 4000:4017:7e3:71:0 prior 0xbb51ef18 next 0xb b51ef90 WARNING: found a loop YAMAMOTO Takashi -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI bug?
hi, YAMAMOTO Takashi wrote: might be unrelated to the loop problem, but... Aha! I think it *is* related. There were several places where data was uninitialized here; mostly because Dan was working on this piece while I was working on separate issues which added the new fields. I missed the interaction on integrating the two efforts. :-( The uninitialized fields could lead to all the symptoms you saw. I've reviewed, looking for other similar issues and didn't find any. Could you try the attached patch and see if this fixes the issues you've seen? with your previous patch or not? YAMAMOTO Takashi -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI bug?
hi, might be unrelated to the loop problem, but... i got the following SEGV when runnning vacuum on a table. (the line numbers in predicate.c is different as i have local modifications.) oldlocktag.myTarget was NULL. it seems that TransferPredicateLocksToNewTarget sometimes use stack garbage for newpredlocktag.myTarget. vacuum on the table succeeded with the attached patch. the latter part of the patch was necessary to avoid targetList corruption, which later seems to make DeleteChildTargetLocks loop inifinitely. YAMAMOTO Takashi #0 0x0823cf6c in PredicateLockAcquire (targettag=0xbfbfa734) at predicate.c:1835 #1 0x0823f18a in PredicateLockPage (relation=0x99b4dcf0, blkno=1259) at predicate.c:2206 #2 0x080ac978 in _bt_search (rel=0x99b4dcf0, keysz=2, scankey=0x99a05040, nextkey=0 '\0', bufP=0xbfbfa894, access=1) at nbtsearch.c:97 #3 0x080a996d in _bt_pagedel (rel=0x99b4dcf0, buf=value optimized out, stack=0x0) at nbtpage.c:1059 #4 0x080aacc2 in btvacuumscan (info=0xbfbfbcc4, stats=0x99a01328, callback=0x8184d50 lazy_tid_reaped, callback_state=0x99a012e0, cycleid=13675) at nbtree.c:981 #5 0x080ab15c in btbulkdelete (fcinfo=0xbfbfb9e0) at nbtree.c:573 #6 0x082fde74 in FunctionCall4 (flinfo=0x99b86958, arg1=3217013956, arg2=0, arg3=135810384, arg4=2577404640) at fmgr.c:1437 #7 0x080a4fd0 in index_bulk_delete (info=0xbfbfbcc4, stats=0x0, callback=0x8184d50 lazy_tid_reaped, callback_state=0x99a012e0) at indexam.c:738 #8 0x08184cd4 in lazy_vacuum_index (indrel=0x99b4dcf0, stats=0x99a023e0, vacrelstats=0x99a012e0) at vacuumlazy.c:938 #9 0x081854b6 in lazy_vacuum_rel (onerel=0x99b47650, vacstmt=0x99b059d0, bstrategy=0x99a07018, scanned_all=0xbfbfcfd8 ) at vacuumlazy.c:762 #10 0x08184265 in vacuum_rel (relid=16424, vacstmt=0x99b059d0, do_toast=1 '\001', for_wraparound=0 '\0', scanned_all=0xbfbfcfd8 ) at vacuum.c:978 #11 0x081845ea in vacuum (vacstmt=0x99b059d0, relid=0, do_toast=1 '\001', bstrategy=0x0, for_wraparound=0 '\0', isTopLevel=1 '\001') at vacuum.c:230 #12 0xbbab50c3 in pgss_ProcessUtility (parsetree=0x99b059d0, queryString=0x99b05018 vacuum (verbose,analyze) pgfs.dirent;, params=0x0, isTopLevel=1 '\001', dest=0x99b05b80, completionTag=0xbfbfd21a ) at pg_stat_statements.c:603 #13 0x082499ea in PortalRunUtility (portal=0x99b33018, utilityStmt=0x99b059d0, isTopLevel=1 '\001', dest=0x99b05b80, completionTag=0xbfbfd21a ) at pquery.c:1191 #14 0x0824a79e in PortalRunMulti (portal=0x99b33018, isTopLevel=4 '\004', dest=0x99b05b80, altdest=0x99b05b80, completionTag=0xbfbfd21a ) at pquery.c:1298 #15 0x0824b21a in PortalRun (portal=0x99b33018, count=2147483647, isTopLevel=1 '\001', dest=0x99b05b80, altdest=0x99b05b80, completionTag=0xbfbfd21a ) at pquery.c:822 #16 0x08247dc7 in exec_simple_query ( query_string=0x99b05018 vacuum (verbose,analyze) pgfs.dirent;) at postgres.c:1059 #17 0x08248a79 in PostgresMain (argc=2, argv=0xbb912650, username=0xbb9125c0 takashi) at postgres.c:3943 #18 0x0820e231 in ServerLoop () at postmaster.c:3590 #19 0x0820ef88 in PostmasterMain (argc=3, argv=0xbfbfe59c) at postmaster.c:1110 #20 0x081b6439 in main (argc=3, argv=0xbfbfe59c) at main.c:199 (gdb) list 1830 offsetof(PREDICATELOCK, xactLink)); 1831 1832oldlocktag = predlock-tag; 1833Assert(oldlocktag.myXact == sxact); 1834oldtarget = oldlocktag.myTarget; 1835oldtargettag = oldtarget-tag; 1836 1837if (TargetTagIsCoveredBy(oldtargettag, *newtargettag)) 1838{ 1839uint32 oldtargettaghash; (gdb) diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c index 722d0f8..4dde6ae 100644 --- a/src/backend/storage/lmgr/predicate.c +++ b/src/backend/storage/lmgr/predicate.c @@ -2537,8 +2558,8 @@ TransferPredicateLocksToNewTarget(const PREDICATELOCKTARGETTAG oldtargettag, if (!found) { SHMQueueInit((newtarget-predicateLocks)); - newpredlocktag.myTarget = newtarget; } + newpredlocktag.myTarget = newtarget; oldpredlock = (PREDICATELOCK *) SHMQueueNext((oldtarget-predicateLocks), @@ -2588,10 +2609,12 @@ TransferPredicateLocksToNewTarget(const PREDICATELOCKTARGETTAG oldtargettag, outOfShmem = true; goto exit; } - SHMQueueInsertBefore((newtarget-predicateLocks), - (newpredlock-targetLink)); - SHMQueueInsertBefore((newpredlocktag.myXact-predicateLocks), - (newpredlock
Re: [HACKERS] SSI bug?
hi, I wrote: it seems likely that such a cycle might be related to this new code not properly allowing for some aspect of tuple cleanup. I found a couple places where cleanup could let these fall through the cracks long enough to get stale and still be around when a tuple ID is re-used, causing problems. Please try the attached patch and see if it fixes the problem for you. If it does, then there's no need to try to track the other things I was asking about. thanks. unfortunately the problem still happens with the patch. YAMAMOTO Takashi Thanks! -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI bug?
hi, all of the following answers are with the patch you provided in other mail applied. YAMAMOTO Takashi y...@mwd.biglobe.ne.jp wrote: i have seen this actually happen. i've confirmed the creation of the loop with the attached patch. it's easily reproducable with my application. i can provide the full source code of my application if you want. (but it isn't easy to run unless you are familiar with the recent version of NetBSD) i haven't found a smaller reproducible test case yet. I've never used NetBSD, so maybe a few details will help point me in the right direction faster than the source code. Has your application ever triggered any of the assertions in the code? (In particular, it would be interesting if it ever hit the one right above where you patched.) the assertion right above is sometimes triggered. sometimes not. How long was the loop? see below. Did you notice whether the loop involved multiple tuples within a single page? if i understand correctly, yes. the following is a snippet of my debug code (dump targets when triggerCheckTargetForConflictsIn loops 1000 times) and its output. the same locktag_field3 value means the same page, right? + for (t = target, i = 0; t != NULL; i++) { + elog(WARNING, [%u] target %p tag % PRIx32 :% PRIx32 :% PRIx32 + :% PRIx16 :% PRIx16 prior %p next %p, i, t, + t-tag.locktag_field1, + t-tag.locktag_field2, + t-tag.locktag_field3, + t-tag.locktag_field4, + t-tag.locktag_field5, + t-priorVersionOfRow, + t-nextVersionOfRow); + t = t-priorVersionOfRow; + if (t == target) { + elog(WARNING, found a loop); + break; + } + } WARNING: [0] target 0xbb51f238 tag 4000:4017:53b:6c:0 prior 0xbb51f350 next 0xbb51f350 WARNING: [1] target 0xbb51f350 tag 4000:4017:53b:69:0 prior 0xbb51f238 next 0xbb51f238 WARNING: found a loop another sample: WARNING: [0] target 0xbb51f530 tag 4000:4017:565:ae:0 prior 0xbb51f1e8 next 0xbb51f300 WARNING: [1] target 0xbb51f1e8 tag 4000:4017:565:ad:0 prior 0xbb51f580 next 0xbb51f530 WARNING: [2] target 0xbb51f580 tag 4000:4017:565:ac:0 prior 0xbb51f300 next 0xbb51f1e8 WARNING: [3] target 0xbb51f300 tag 4000:4017:565:ab:0 prior 0xbb51f530 next 0xbb51f580 WARNING: found a loop the table seems mostly hot-updated, if it matters. hoge=# select * from pg_stat_user_tables where relid=16407; -[ RECORD 1 ]-+ relid | 16407 schemaname| pgfs relname | file seq_scan | 0 seq_tup_read | 0 idx_scan | 53681 idx_tup_fetch | 52253 n_tup_ins | 569 n_tup_upd | 12054 n_tup_del | 476 n_tup_hot_upd | 12041 n_live_tup| 93 n_dead_tup| 559 last_vacuum | last_autovacuum | last_analyze | last_autoanalyze | vacuum_count | 0 autovacuum_count | 0 analyze_count | 4922528128875102208 autoanalyze_count | 7598807461784802080 (values in the last two columns seems bogus. i don't know if it's related or not.) Did this coincide with an autovacuum of the table? no. (assuming that autovacuum=off in postgresql.conf is enough to exclude the possibility.) These last two are of interest because it seems likely that such a cycle might be related to this new code not properly allowing for some aspect of tuple cleanup. Thanks for finding this and reporting it, and thanks in advance for any further detail you can provide. thanks for looking. YAMAMOTO Takashi -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI bug?
hi, YAMAMOTO Takashi y...@mwd.biglobe.ne.jp wrote: it seems that PredicateLockTupleRowVersionLink sometimes create a loop of targets (it founds an existing 'newtarget' whose nextVersionOfRow chain points to the 'oldtarget') and it later causes CheckTargetForConflictsIn loop forever. Is this a hypothetical risk based on looking at the code, or have you seen this actually happen? Either way, could you provide more details? (A reproducible test case would be ideal.) i have seen this actually happen. i've confirmed the creation of the loop with the attached patch. it's easily reproducable with my application. i can provide the full source code of my application if you want. (but it isn't easy to run unless you are familiar with the recent version of NetBSD) i haven't found a smaller reproducible test case yet. YAMAMOTO Takashi This being the newest part of the code, I'll grant that it is the most likely to have an unidentified bug; but given that the pointers are from one predicate lock target structure identified by a tuple ID to one identified by the tuple ID of the next version of the row, it isn't obvious to me how a cycle could develop. -Kevin diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c index 722d0f8..3e1a3e2 100644 --- a/src/backend/storage/lmgr/predicate.c +++ b/src/backend/storage/lmgr/predicate.c @@ -2350,7 +2350,25 @@ PredicateLockTupleRowVersionLink(const Relation relation, newtarget-nextVersionOfRow = NULL; } else + { Assert(newtarget-priorVersionOfRow == NULL); +#if 0 + Assert(newtarget-nextVersionOfRow == NULL); +#endif + if (newtarget-nextVersionOfRow != NULL) { + PREDICATELOCKTARGET *t; + + elog(WARNING, new %p has next %p\n, + newtarget, newtarget-nextVersionOfRow); + for (t = newtarget-nextVersionOfRow; t != NULL; + t = t-nextVersionOfRow) { + if (oldtarget != t) { + elog(WARNING, creating a loop new=%p old=%p\n, + newtarget, oldtarget); + } + } + } + } newtarget-priorVersionOfRow = oldtarget; oldtarget-nextVersionOfRow = newtarget; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] SSI bug?
hi, it seems that PredicateLockTupleRowVersionLink sometimes create a loop of targets (it founds an existing 'newtarget' whose nextVersionOfRow chain points to the 'oldtarget') and it later causes CheckTargetForConflictsIn loop forever. YAMAMOTO Takashi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [NOVICE] systable_getnext_ordered
hi, I wrote: y...@mwd.biglobe.ne.jp (YAMAMOTO Takashi) writes: after systable_getnext_ordered returned NULL, is it ok to call it again? I wouldn't rely on it working. i'm wondering because inv_truncate seems to do it and expecting NULL. Hmm, that may well be a bug. Have you tested it? I looked at this a bit more closely, and basically the answer is that inv_truncate is accidentally failing to fail. What will actually happen if systable_getnext_ordered is called another time, at least when using a btree index, is that it will start the same requested scan over again, ie deliver all the tuples it did the first time (which is none, in this case). That's an implementation artifact, but since the behavior is undefined in the first place, it's not wrong. Now, if inv_truncate's initial call on systable_getnext_ordered returns NULL (ie, the truncation point is past the current EOF page), it will fall through to the Write a brand new page code, which will create and insert a partial page at the truncation point. It then goes to the delete-all-remaining-pages loop. Because that starts a fresh scan with the very same scan key conditions, you might expect that it would find and delete the page it just inserted --- causing the apparent EOF of the blob to be wrong afterwards. It accidentally fails to do that because the new tuple postdates the snapshot it's scanning with. So the loop terminates having found no matching tuples, and all is well. So this code is confusing, inefficient (performing a useless search of the index), only works because of an obscure consideration not explained in the comments, and sets a bad precedent for people to follow. I'm going to go change it to explicitly not do the final loop if the initial search failed. It's not a bug, exactly, but it's sure lousy coding. Thanks for pointing it out. thanks for the quick investigation and fix. the attached patch is to avoid unnecessary detoast'ing and EOF marker pages when possible. does it make sense? YAMAMOTO Takashi regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers diff --git a/src/backend/storage/large_object/inv_api.c b/src/backend/storage/large_object/inv_api.c index 01e3492..60a9c69 100644 --- a/src/backend/storage/large_object/inv_api.c +++ b/src/backend/storage/large_object/inv_api.c @@ -178,10 +178,14 @@ myLargeObjectExists(Oid loid, Snapshot snapshot) static int32 getbytealen(bytea *data) { - Assert(!VARATT_IS_EXTENDED(data)); - if (VARSIZE(data) VARHDRSZ) - elog(ERROR, invalid VARSIZE(data)); - return (VARSIZE(data) - VARHDRSZ); + Size size; + + size = toast_raw_datum_size(PointerGetDatum(data)); + if (size VARHDRSZ) + elog(ERROR, invalid size); + if (size VARHDRSZ + LOBLKSIZE) + elog(ERROR, too large page); + return (size - VARHDRSZ); } @@ -359,22 +363,12 @@ inv_getsize(LargeObjectDesc *obj_desc) { Form_pg_largeobject data; bytea *datafield; - boolpfreeit; if (HeapTupleHasNulls(tuple)) /* paranoia */ elog(ERROR, null field found in pg_largeobject); data = (Form_pg_largeobject) GETSTRUCT(tuple); datafield = (data-data); /* see note at top of file */ - pfreeit = false; - if (VARATT_IS_EXTENDED(datafield)) - { - datafield = (bytea *) - heap_tuple_untoast_attr((struct varlena *) datafield); - pfreeit = true; - } lastbyte = data-pageno * LOBLKSIZE + getbytealen(datafield); - if (pfreeit) - pfree(datafield); } systable_endscan_ordered(sd); @@ -724,8 +718,9 @@ inv_write(LargeObjectDesc *obj_desc, const char *buf, int nbytes) void inv_truncate(LargeObjectDesc *obj_desc, int len) { - int32 pageno = (int32) (len / LOBLKSIZE); - int off; + const int32 pageno = (int32) (len / LOBLKSIZE); + int32 reqpageno; + const int off = len % LOBLKSIZE; /* offset in the page */ ScanKeyData skey[2]; SysScanDesc sd; HeapTuple oldtuple; @@ -741,6 +736,7 @@ inv_truncate(LargeObjectDesc *obj_desc, int len) Datum values[Natts_pg_largeobject]; boolnulls[Natts_pg_largeobject]; boolreplace[Natts_pg_largeobject]; + boolprevpagefull; CatalogIndexState indstate; Assert(PointerIsValid(obj_desc)); @@ -770,10 +766,20 @@ inv_truncate(LargeObjectDesc *obj_desc, int len
Re: [HACKERS] [NOVICE] systable_getnext_ordered
hi, thanks for taking a look. y...@mwd.biglobe.ne.jp (YAMAMOTO Takashi) writes: the attached patch is to avoid unnecessary detoast'ing and EOF marker pages when possible. does it make sense? The blob page size is already chosen not to allow for out-of-line storage, not to mention that pg_largeobject doesn't have a TOAST table. So I think avoiding detoasting is largely a waste of time. doesn't detoasting involve decompression? I'm unexcited about the other consideration too --- it looks to me like it just makes truncation slower, more complicated, and hence more bug-prone, in return for a possible speedup that probably nobody will ever notice. slower? it depends, i guess. my primary motivation of that part of the patch was to save some space for certain workloads. (besides that, leaving unnecessary rows isn't neat, but it might be a matter of taste.) YAMAMOTO Takashi regards, tom lane -- Sent via pgsql-novice mailing list (pgsql-nov...@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-novice -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers