Re: [PATCH] Exorcise "zero-dimensional" arrays (Was: Re: [HACKERS] Should array_length() Return NULL)
2013-04-03 20:58 keltezéssel, Gavin Flower írta: On 04/04/13 05:36, David E. Wheeler wrote: On Apr 3, 2013, at 9:30 AM, Tom Lane wrote: Fortran ... Basic ... actually I'd have thought that zero was a minority position. Fashions change I guess. I say we turn the default lower bound up to 11. David In keeping with the level of irrationality in this thread, maybe we should set it to an irrational number like the square root of 2, or transcend our selves and make in a transcendental number like pi! :-) I suppose using the square root of minus one would be consider too imaginative??? :-) Nah, that would make arrays have 2 dimensions as a minimum... :-) Cheers, Gavin -- -- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review]
On Thursday, April 04, 2013 2:52 AM Robert Haas wrote: > On Wed, Apr 3, 2013 at 2:54 PM, Tom Lane wrote: > > Robert Haas writes: > >> On Tue, Apr 2, 2013 at 12:19 PM, Peter Eisentraut > wrote: > >>> It's weird that SET LOCAL and SET SESSION actually *set* the value, > and > >>> the second key word determines how long the setting will last. SET > >>> PERSISTENT doesn't actually set the value. I predict that this > will be > >>> a new favorite help-it-doesn't-work FAQ. > > > >> I think this is another argument against this particular syntax. I > >> have always thought that something along the lines of ALTER SYSTEM > >> would be more appropriate. ALTER DATABASE .. SET and ALTER ROLE .. > >> SET don't change the value immediately either, and nobody gets > >> confused about that to my knowledge. But I can see where SET > >> PERSISTENT could cause that sort of confusion. > > > > Yeah, I think I argued for using the SET syntax to start with, but > > I'm coming around to the position that SET PERSISTENT is too much > > unlike the behavior of other varieties of SET. ALTER is sounding > > more attractive to me now. Not sure about "ALTER SYSTEM" in > particular > > though --- it's not clear that that has any real merit other than > > already existing as a keyword. (Not that that's negligible.) > > ALTER CONFIGURATION is another alternative using an existing keyword > > that might be worth considering. > > Yeah, I thought about something like that. Aside from saving on > keywords, the reason I like ALTER SYSTEM or similar is that I suspect > there will be other system-wide things that we may want to let people > ALTER in the future, so I think that route might avoid an unnecessary > proliferation of top-level commands. I am not, however, deadly > attached to the idea, if someone's got a good reason for preferring > something else. I think second parameter in SET command telling the scope should be fine. As I could see Oracle also has similar syntax for it's ALTER SYSTEM Command (Alter System Scope [Memory|Spfile|Both]). Description in short: SPFILE indicates that the change is made in the server parameter file. The new setting takes effect when the database is next shut down and started up again. MEMORY indicates that the change is made in memory, takes effect immediately, and persists until the database is shut down. The only reason to show above example is that second parameter telling Scope exists in other databases as well. However if you are not convinced with above reasoning, then the Alter syntax can be as follows: ALTER SYSTEM SET configuration_parameter {TO | =} {value, | 'value'}; With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Thursday, April 04, 2013 7:19 AM Greg Smith wrote: > On 4/2/13 11:54 AM, Robert Haas wrote: > > But, having said that, I still think the best idea is what Andres > > proposed, which pretty much matches my own thoughts: the bgwriter > > needs to populate the free list, so that buffer allocations don't > have > > to wait for linear scans of the buffer array. > > I was hoping this one would make it to a full six years of being on the > TODO list before it came up again, missed it by a few weeks. The > funniest part is that Amit even submitted a patch on this theme a few > months ago without much feedback: > http://www.postgresql.org/message- > id/6C0B27F7206C9E4CA54AE035729E9C382852FF97@szxeml509-mbs > That stalled where a few things have, on a) needing more regression > test workloads, and b) wondering just what the deal with large > shared_buffers setting degrading performance was. For "b)", below are links where it decreased due to large shared buffers. http://www.postgresql.org/message-id/attachment/27489/Results.htm http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C38285442C 5@szxeml509-mbx As per my observation, it occur when I/O starts. The dip could be due to fluctuation or may be due to some OS scheduling or it could be due to Eviction of dirty pages sooner than it would otherwise. I think the further investigation can be more meaningful if the results can be taken by someone else other than me. One idea to proceed in this line could be we start with this patch and then based on results, do the further experiments to make it more useful. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Drastic performance loss in assert-enabled build in HEAD
2013/4/3 Tom Lane : > Kevin Grittner writes: > >> To be honest, I don't think I've personally seen a single use case >> for matviews where they could be used if you couldn't count on an >> error if attempting to use them without the contents reflecting a >> materialization of the associated query at *some* point in time. > > Well, if we remove the WITH NO DATA clause from CREATE MATERIALIZED > VIEW, that minimum requirement is satisfied no? An argument against that is that computing the contents may be very expensive. > Granting that throwing an error is actually of some use to some people, > I would not think that people would want to turn it on via a command > that throws away the existing view contents altogether, nor turn it off > with a full-throated REFRESH. There are going to need to be ways to > incrementally update matviews, and ways to disable/enable access that > are not tied to a complete rebuild, not to mention being based on > user-determined rather than hard-wired criteria for what's too stale. > So I don't think this is a useful base to build on. Am I correct when I think that you are saying here, that the “zero pages == unscannable” logic is not very future-proof? In that case I concur, and I also think that this knowledge leaks in way too many other places (the VACUUM bug mentioned by Kevin is a good example). > If you feel that scannability disable is an absolute must for version 0, > let's invent a matview reloption or some such to implement it and let > users turn it on and off as they wish. That seems a lot more likely > to still be useful two years from now. (In the context of making an unlogged matview unscannable after a crash:) Is it imaginable that such a reloption could (in a future implementation) be changed during or right after crash recovery? For example, by storing the set of “truncated by crash recovery” relations in a shared catalog table, which is then inspected when connecting to a database to continue the truncation (in the case of a matview by making it unscannable)? > And if you're absolutely convinced that unlogged matviews mustn't work as I > suggest, we can lose those from 9.3, too. +1. Having unlogged matviews without having incremental updates yet, isn’t super useful anyway. > What I'd actually rather see us spending time on right now is making > some provision for incremental updates, which I will boldly propose > could be supported by user-written triggers on the underlying tables > if we only diked out the prohibitions against INSERT/UPDATE/DELETE on > matviews, and allowed them to operate on a matview's contents just like > it was a table. Now admittedly that would foreclose allowing matviews > to be updatable in the updatable-view sense, but that's a feature I > would readily give up if it meant users could build incremental update > mechanisms this year and not two years down the road. Please make the syntax for updating the “extent” (physical representation) of a matview different from updating the view’s logical contents. Examples: (1) Require to use a special function to update the extent: SELECT pg_mv_maintain('INSERT INTO example_matview ...'); While parsing the INSERT, the parser would know that it must interpret “example_matview” as the matview’s extent; As currently the extent and the view are the same, nothing must be done except for only allowing the INSERT when it is parsed in the context of pg_mv_maintain, and otherwise saying that matviews aren’t updatable yet (“NOTICE: did you mean to update the extent? in that case use pg_mv_maintain”). (2) Use a different schema (cf. TOAST) for the extent, e.g., view “public.example_matview” vs. extent “pg_mv_extent.example_matview”. I imagine future implementations to possibly require multiple extents anyway, e.g., for storing the “not yet applied changesets” or other intermediate things. > Why exactly do you think that what I'm suggesting would cause a dump and > reload to not regenerate the data? Expensiveness: Matviews are used in cases where the data is expensive to compute. > We *need* to get rid of that aspect of things. If you must have > scannability state in version 0, okay, but it has to be a catalog property > not this. +1 Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE EXTENSION BLOCKS
Hi, I though we were more specific about an extension's object itself not living in a schema in our documentation, but I agree we still have room for progress here. "David E. Wheeler" writes: > +Note that only the extension objects will be placed into the named > +schema; the extension itself is a database-global object. I think you're patching the right place, but I'm not sure about the term "database-global object", that I can't find by grepping in sgml/ref. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] corrupt pages detected by enabling checksums
On 4 April 2013 02:39, Andres Freund wrote: > Ok, I think I see the bug. And I think its been introduced in the > checkpoints patch. Well spotted. (I think you mean checksums patch). > If by now the first backend has proceeded to PageSetLSN() we are writing > different data to disk than the one we computed the checksum of > before. Boom. Right, so nothing else we were doing was wrong, that's why we couldn't spot a bug. The problem is that we aren't replaying enough WAL because the checksum on the WAL record is broke. > I think the whole locking interactions in MarkBufferDirtyHint() need to > be thought over pretty carefully. When we write out a buffer with checksums enabled, we take a copy of the buffer so that the checksum is consistent, even while other backends may be writing hints to the same bufer. I missed out on doing that with XLOG_HINT records, so the WAL CRC can be incorrect because the data is scanned twice; normally that would be OK because we have an exclusive lock on the block, but with hints we only have share lock. So what we need to do is take a copy of the buffer before we do XLogInsert(). Simple patch to do this attached for discussion. (Not tested). We might also do this by modifying the WAL record to take the whole block and bypass the BkpBlock mechanism entirely. But that's more work and doesn't seem like it would be any cleaner. I figure lets solve the problem first then discuss which approach is best. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services copy_before_XLOG_HINT.v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Wed, Apr 3, 2013 at 9:49 PM, Greg Smith wrote: > On 4/2/13 11:54 AM, Robert Haas wrote: >> But, having said that, I still think the best idea is what Andres >> proposed, which pretty much matches my own thoughts: the bgwriter >> needs to populate the free list, so that buffer allocations don't have >> to wait for linear scans of the buffer array. > > I was hoping this one would make it to a full six years of being on the TODO > list before it came up again, missed it by a few weeks. The funniest part > is that Amit even submitted a patch on this theme a few months ago without > much feedback: > http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382852FF97@szxeml509-mbs > That stalled where a few things have, on a) needing more regression test > workloads, and b) wondering just what the deal with large shared_buffers > setting degrading performance was. Those are impressive results. I think we should seriously consider doing something like that for 9.4. TBH, although more workloads to test is always better, I don't think this problem is so difficult that we can't have some confidence in a theoretical analysis. If I read the original thread correctly (and I haven't looked at the patch itself), the proposed patch would actually invalidate buffers before putting them on the freelist. That effectively amounts to reducing shared_buffers, so workloads that are just on the edge of what can fit in shared_buffers will be harmed, and those that benefit incrementally from increased shared_buffers will be as well. What I think we should do instead is collect the buffers that we think are evictable and stuff them onto the freelist without invalidating them. When a backend allocates from the freelist, it can double-check that the buffer still has usage_count 0. The odds should be pretty good. But even if we sometimes notice that the buffer has been touched again after being put on the freelist, we haven't expended all that much extra effort, and that effort happened mostly in the background. Consider a scenario where only 10% of the buffers have usage count 0 (which is not unrealistic). We scan 5000 buffers and put 500 on the freelist. Now suppose that, due to some accident of the workload, 75% of those buffers get touched again before they're allocated off the freelist (which I believe to be a pessimistic estimate for most workloads). Now, that means that only 125 of those 500 buffers will succeed in satisfying an allocation request. That's still a huge win, because it means that each backend only has examine an average of 4 buffers before it finds one to allocate. If it had needed to do the freelist scan itself, it would have had to touch 40 buffers before finding one to allocate. In real life, I think the gains are apt to be, if anything, larger. IME, it's common for most or all of the buffer pool to be pinned at usage count 5. So you could easily have a situation where the arena scan has to visit millions of buffers to find one to allocate. If that's happening in the background instead of the foreground, it's a huge win. Also, note that there's nothing to prevent the arena scan from happening in parallel with allocations off of the freelist - so while foreground processes are emptying the freelist, the background process can be looking for more things to add to it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Number of spinlocks
With --disable-spinlocks, we need to know the number of spinlocks in the system at startup, so that we can reserve enough semaphores to mimic the spinlocks. It's calculated in SpinLockSemas(): /* * Report number of semaphores needed to support spinlocks. */ int SpinlockSemas(void) { /* * It would be cleaner to distribute this logic into the affected modules, * similar to the way shmem space estimation is handled. * * For now, though, we just need a few spinlocks (10 should be plenty) * plus one for each LWLock and one for each buffer header. Plus one * for each XLog insertion slot in xlog.c. */ return NumLWLocks() + NBuffers + 10 + NUM_XLOGINSERT_SLOTS; } Ten spinlocks might've been plenty in 2001 when that comment was written, but we have reached that number. I grepped the sources for SpinLockInit, and found that we use: 1 spinlock for each LWLock 1 spinlock for each buffer 1 spinlock for each wal sender process slot (max_wal_senders in total) 1 spinlock for each partitioned hash table (2 in predicate.c, 2 in lock.c, 1 in buf_table.c) 1 spinlock in XLogCtl->info_lck 1 spinlock for WAL receiver (WalRcv->mutex) 1 spinlock for hot standby xid tracking (procArray->known_assigned_xids_lck) 1 spinlock for shared memory allocator (ShmemLock) 1 spinlock for shared inval messaging (shmInvalBuffer->msgnumLock) 1 spinlock for the proc array freelist (ProcStructLock) 1 spinlock for fast-path lock mechanism (FastPathStrongRelationLocks->mutex) 1 spinlock for the checkpointer (CheckpointerShmem->ckpt_lck) That's a fixed number of 13 spinlocks, plus 1 for each LWLock, buffer, and wal sender. I'll go and adjust SpinLockSemas() to take the walsenders into account, and bump the fixed number from 10 to 30. That should be enough headroom for the next ten years. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] corrupt pages detected by enabling checksums
On 2013-04-04 13:30:40 +0100, Simon Riggs wrote: > On 4 April 2013 02:39, Andres Freund wrote: > > > Ok, I think I see the bug. And I think its been introduced in the > > checkpoints patch. > > Well spotted. (I think you mean checksums patch). Heh, yes. I was slightly tired at that point ;) > > If by now the first backend has proceeded to PageSetLSN() we are writing > > different data to disk than the one we computed the checksum of > > before. Boom. > > Right, so nothing else we were doing was wrong, that's why we couldn't > spot a bug. The problem is that we aren't replaying enough WAL because > the checksum on the WAL record is broke. Well, there might be other bugs we can't see yet... But lets hope not. > > I think the whole locking interactions in MarkBufferDirtyHint() need to > > be thought over pretty carefully. > > When we write out a buffer with checksums enabled, we take a copy of > the buffer so that the checksum is consistent, even while other > backends may be writing hints to the same bufer. > > I missed out on doing that with XLOG_HINT records, so the WAL CRC can > be incorrect because the data is scanned twice; normally that would be > OK because we have an exclusive lock on the block, but with hints we > only have share lock. So what we need to do is take a copy of the > buffer before we do XLogInsert(). > > Simple patch to do this attached for discussion. (Not tested). > > We might also do this by modifying the WAL record to take the whole > block and bypass the BkpBlock mechanism entirely. But that's more work > and doesn't seem like it would be any cleaner. I figure lets solve the > problem first then discuss which approach is best. Unfortunately I find that approach unacceptably ugly. I don't think its ok to make an already *very* complicated function (XLogInsert()) in one of the most complicated files (xlog.c) even more complicated. It literally took me years to understand a large percentage of it. I even think the XLOG_HINT escape hatch should be taken out and be replaced by something outside of XLogInsert(). Don't think that approach would work with as few lines anyway, since you need to properly pass it into XLogCheckBuffer() et al which checks the buffer tags and such - which it needs to properly compute the struct BkpBlock. Also we iterate over rdata->page for the CRC computation. I think the route you quickly sketched is more realistic. That would remove all knowledge obout XLOG_HINT from generic code hich is a very good thing, I spent like 15minutes yesterday wondering whether the early return in there might be the cause of the bug... Something like: XLogRecPtr XLogSaveBufferForHint(Buffer buffer) { XLogRecPtr recptr = InvalidXLogRecPtr; XLogRecPtr lsn; XLogRecData rdata[2]; BkbBlock bkp; /* * make sure no checkpoint can happen while we decide about logging * something or not, since we don't hold any lock preventing that * otherwise. */ Assert(MyPgXact->delayChkpt); /* update RedoRecPtr */ GetRedoRecPtr(); /* setup phony rdata element */ rdata[0].buffer = buffer; rdata[0].buffer_std = true; /* is that actually ok? */ /* force full pages on, otherwise checksums won't work? */ if (XLogCheckBuffer(rdata, true, &lsn, &bkp)) { char copied_buffer[BLCKSZ]; /* * copy buffer so we don't have to worry about * concurrent hint bit or lsn updates. We assume pd_lower/upper * cannot be changed without an exclusive lock, so the contents * bkp are not racy. */ memcpy(copied_buffer, BufferGetBlock(buffer), BLCKSZ); rdata[0].data = bkp; rdata[0].len = sizeof(BkbBlock); rdata[0].buffer = InvalidBuffer; rdata[0].buffer_std = false; rdata[0].next = &(rdata[1]); rdata[1].data = copied_buffer; rdata[1].len = BLCKSZ; rdata[1].buffer = InvalidBuffer; rdata[1].buffer_std = false; rdata[1].next = NULL; recptr = XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata); } return recptr; } That should work? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Why there is a PG_GETARG_UINT32 and PG_RETURN_UINT32?
Hi guys. I am wondering when I can use the PG_GETARG_UINT32 and PG_RETURN_UINT32. If postgres has no unsigned int type, what is the use of these macros?
Re: [PATCH] Exorcise "zero-dimensional" arrays (Was: Re: [HACKERS] Should array_length() Return NULL)
On Wed, Apr 3, 2013 at 11:11 PM, Tom Lane wrote: > In any case, the whole exercise is pointless if we don't change the > visible behavior of array_dims et al. So I think the idea that this > would be without visible consequence is silly. What's up for argument > is just how much incompatibility is acceptable. The only reasonable answer for this (a provably used, non-security, non-standards violating, non-gross functionality breakage case) is *zero*. Our historically cavalier attitude towards compatibility breakage has been an immense disservice to our users and encourages very bad upgrade habits and is, IMNSHO, embarrassing. Changing the way array_dims works for a minor functionality enhancement is gratuitous and should be done, if at all, via a loudly advertised deprecation/replacement cycle with a guarding GUC (yes, I hate them too, but not nearly as much as the expense of qualifying vast code bases against random compatibility breakages every release). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multi-pass planner
On Wed, Apr 3, 2013 at 9:40 PM, Greg Stark wrote: > I used to advocate a similar idea. But when questioned on list I tried to > work out the details and ran into some problem coming up with a concrete > plan. > > How do you compare a plan that you think has a 99% chance of running in 1ms > but a 1% chance of taking 1s against a plan that has a 90% chance of 1ms and > a 10% chance of taking 100ms? Which one is actually riskier? They might even > both have the same 95% percentile run-time. > > And additionally there are different types of unknowns. Do you want to treat > plans where we have a statistical sample that gives us a probabilistic > answer the same as plans where we think our model just has a 10% chance of > being wrong? The model is going to either be consistently right or > consistently wrong for a given query but the sample will vary from run to > run. (Or vice versa depending on the situation). One idea that someone through up against a wall recently during an EnterpriseDB development meeting - I think it was Kevin Grittner but it might have been Noah Misch, or some combination - was to have a GUC for estimate_worstcase_fraction. So, when computing the cost of a path, we'd compute our current expected-case estimate, and also a worst-case estimate, and then compute the final cost as: estimated_cost = estimate_worstcase_fraction * worst_case_estimate + (1 - estimate_worstcase_fraction) * expected_case_estimate I think Kevin and I both have the intuition that even a rather small value for estimate_worstcase_fraction - like 0.01 or 0.001 or even smaller - would prevent a pretty significant fraction of the problems people encounter in this area. But users could change it, in general or for particular queries, if it ended up being wrong in their environment. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE EXTENSION BLOCKS
On Apr 4, 2013, at 5:16 AM, Dimitri Fontaine wrote: > "David E. Wheeler" writes: >> +Note that only the extension objects will be placed into the named >> +schema; the extension itself is a database-global object. > > I think you're patching the right place, but I'm not sure about the term > "database-global object", that I can't find by grepping in sgml/ref. Yeah, I wasn't sure, either, but figured someone here would know what to call those sorts of things. Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] corrupt pages detected by enabling checksums
On 4 April 2013 15:53, Andres Freund wrote: > Unfortunately I find that approach unacceptably ugly. Yeh. If we can confirm its a fix we can discuss a cleaner patch and that is much better. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql crash fix
On Tue, Apr 2, 2013 at 08:48:53PM -0400, Bruce Momjian wrote: > I found that psql will crash if given a PSQLRC value containing a tilde: > > $ PSQLRC="~/x" psql test > *** glibc detected *** psql: free(): invalid pointer: > 0x7fffb7c933ec *** > > This is on Debian Squeeze 6.0.7. The fix is to pstrdup() the value > returned by getenv(), so it can be free()'ed later --- you can't free > getenv()-returned values: > >As typically implemented, getenv() returns a pointer to a string >within the environment list. The caller must take care not to >modify this string, since that would change the environment of >the process. > > This bug exists in 9.2 and git head. I also removed the return value > from expand_tilde() as no caller was using it. Applied. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Regex with > 32k different chars causes a backend crash
On 04.04.2013 03:32, Noah Misch wrote: On Wed, Apr 03, 2013 at 08:09:15PM +0300, Heikki Linnakangas wrote: --- a/src/include/regex/regguts.h +++ b/src/include/regex/regguts.h @@ -148,6 +148,7 @@ typedef short color; /* colors of characters */ typedef int pcolor; /* what color promotes to */ +#define MAX_COLOR 32767 /* max value that fits in 'color' datatype */ This should use SHRT_MAX, no? (Not that any supported platform differs here.) I considered that, but I got all confused on whether limits.h needs to be included and where, if we use that. So I just used a constant 32767 in the end. Committed that way. I opened a ticket in TCL bug tracker for this: https://sourceforge.net/tracker/?func=detail&aid=3610026&group_id=10894&atid=110894. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] puzzling JSON bug
David Wheeler has presented me with a nasty bug case. If I do this: select '{"members": { "add": [3, 4]}}'::json #> '{members,add}'; then I get a crash. If I comment out the pfree() at json.c:parse_object_field() lines 378-9 then I get back the right result but instead get a warning like this: WARNING: problem in alloc set ExprContext: bogus aset link in block 0x1efaa80, chunk 0x1efb1f0 I'm not quite sure where I should go looking for what I've done wrong here. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Join cost estimates
Tom, all, * Tom Lane (t...@sss.pgh.pa.us) wrote: > So that's at least going in the right direction. I agree that this is going in the right direction; it certainly would make the plan that I *expect* to be chosen more likely, however.. I've been fiddling with this on the very much larger overall database where this test case came from and have found that hashing the large table can actually be *faster* and appears to cause a more consistent and constant amount of disk i/o (which is good). The test case exhibits a bit of why this is the case- the per-tuple hash lookup is way closer to the per-tuple cost of building the hash table than I'd expect. per-tuple cost to build the hash table (41M tuples): 0.33us per-tuple cost to scan/do hash lookups (41M tuples): 0.29us (with a small hash table of only 41K entries) The total difference being: 1233.854 vs. 1428.424, or only 194.570ms in favor of scanning the big table instead of hashing it. These numbers are just from those posted with my original email: http://explain.depesz.com/s/FEq http://explain.depesz.com/s/FOU I've seen much worse though- I have one case where hash-the-big-table took 5s and hash-the-small-table took almost 10s (total times). I'm trying to see if I can pull that out and isolate how it's different (and see if it was just due to other load on the box). What I'm trying to get at in this overall email is: why in the world is it so expensive to do hash lookups? I would have expected the cost of the hash table to be *much* more than the cost to do a hash lookup, and that doing hash lookups against a small hash table would be fast enough to put serious pressure on the i/o subsystem. Instead, the building of the hash table actually puts more pressure and can end up being more efficient overall. We have a process that basically does this a whole bunch and the "hash-the-big-table" operation takes about 4.7 hrs, while the "hash-the-small-table" approach went well past 5 hours and was only about 70% complete. Thoughts? Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Drastic performance loss in assert-enabled build in HEAD
Early versions of the matview patch had a relisvalid flag in pg_class to determine whether the relation was scannable. The name was chosen based on a similarity to the purpose of indisvalid, although the proliferation of new bools for related issues left me wondering if a "char" would be a better idea. Based on on-list reviews I removed that in favor of basing the state on a zero-length heap file, in an attempt to work better with unlogged matviews. I can easily look back through my development branch to find the commits which made this change and revert them if this approach is preferred. I realize this would need to be Real Soon Now, but before reverting to the earlier code I want to allow a *little* more time for opinions. Responses to particular points below. Tom Lane wrote: > Granting that throwing an error is actually of some use to some people, > I would not think that people would want to turn it on via a command > that throws away the existing view contents altogether, nor turn it off > with a full-throated REFRESH. There is no doubt that in future versions we will want to be able to disallow scans based on other criteria than just whether the data is valid as of *some* point in time. I can't imagine a circumstance under which we would want to allow scans if it doesn't. So, at least for this release and as a default for future releases, I think it makes sense that a matview last CREATEd or REFRESHed WITH NO DATA should not be scannable. Additional knobs for the users to control this can and should be added in future releases. > There are going to need to be ways to incrementally update > matviews, and ways to disable/enable access that are not tied to > a complete rebuild, not to mention being based on user-determined > rather than hard-wired criteria for what's too stale. Absolutely. So far as I know, nobody has ever suggested or expected otherwise. This paper provides a useful summary of techniques for incremental updates, with references to more detailed papers: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.40.2254&rep=rep1&type=pdf I expect to be working on implementing the most obvious special cases and a "catch-all" general solution for 9.4. Efficient incremental update seems to depend on the ability to have at least one new "hidden" system column, so once we get to a good point for discussing 9.4 issues, I'll be on about that. > If you feel that scannability disable is an absolute must for version 0, > let's invent a matview reloption or some such to implement it and let > users turn it on and off as they wish. That seems a lot more likely > to still be useful two years from now. What about the idea of a relisvalid bool or some "char" column in pg_class? > And if you're absolutely convinced that unlogged matviews mustn't > work as I suggest, we can lose those from 9.3, too. I was loath to do that, but as Nicolas points out, they really aren't that interesting without incremental update. Perhaps it is better to drop those until next release and have a more sophisticated way to deal with invalidation of those -- or as you suggested, just punt them to be empty. (I would hate to do that, but it might be the lesser of evils.) > What I'd actually rather see us spending time on right now is making > some provision for incremental updates, which I will boldly propose > could be supported by user-written triggers on the underlying tables > if we only diked out the prohibitions against INSERT/UPDATE/DELETE on > matviews, and allowed them to operate on a matview's contents just like > it was a table. Now admittedly that would foreclose allowing matviews > to be updatable in the updatable-view sense, but that's a feature I > would readily give up if it meant users could build incremental update > mechanisms this year and not two years down the road. I think that should be the last resort, after we have a more automated declarative way of maintaining the majority of cases. Since I don't see too much there where using that technique with matviews would give you more than you could do right now with tables and triggers, hand-coding the incremental maintenance is very low on my list of priorities. In any event, I don't want to rush into any such thing this close to release; that seems to me to be clearly 9.4 material. >> Individual judges have a "dashboard" showing such things as the >> number of court cases which have gone beyond certain thresholds >> without action > If you need 100% accuracy, which is what this scenario appears to be > demanding, matviews are not for you (yet). The existing implementation > certainly is nowhere near satisfying this scenario. No, we're talking about timelines measured in weeks or months. A nightly update is acceptable, although there are occasional gripes by a judge that they would like to see the results of cleaning up such cases sooner than the next morning. An hour latency would probably make them happy. If async increm
Re: [HACKERS] pg_dump selectively ignores extension configuration tables
On Mar 25, 2013, at 12:01 PM, Vibhor Kumar wrote: > > On Mar 25, 2013, at 10:48 AM, Joe Conway wrote: > >> On 03/25/2013 08:12 AM, Vibhor Kumar wrote: >>> Since, nobody has picked this one. >>> >>> If there is no objection,then I can test this patch against 9.1 & 9.2. >> >> Here are diffs for 9.1 and 9.2. The previous email was against 9.3 dev. > > Thanks Joe! > > will test both for 9.1 and 9.2 I did some testing on this patch with 9.1 and 9.2 source code. Testing included following: 1. Configured PostGIS with 9.1 and 9.2 2. verified all switches of pg_dump with regression db. 3. Checked other extensions, to verify if this impacting those. Everything is working as expected and I haven't found any issue during my test with this patch. While testing this patch, some thoughts came in my mind and thought to share on this thread. Is it possible, if we can have two switches for extension in pg_dump: 1. extension dump with user data in extension tables. 2. User data-only dump from extensions. Thanks & Regards, Vibhor Kumar EnterpriseDB Corporation The Enterprise PostgreSQL Company Blog:http://vibhork.blogspot.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Join cost estimates
Stephen Frost writes: > I've been fiddling with this on the very much larger overall database > where this test case came from and have found that hashing the large > table can actually be *faster* and appears to cause a more consistent > and constant amount of disk i/o (which is good). Interesting. > What I'm trying to get at in this overall email is: why in the world is > it so expensive to do hash lookups? perf or oprofile reveal anything? Also, I assume that the cases you are looking at are large enough that even the "small" table doesn't fit in a single hash batch? It could well be that the answer has to do with some bogus or at least unintuitive behavior of the batching process, and it isn't really at all a matter of individual hash lookups being slow. (You never did mention what work_mem setting you're testing, anyway.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Join cost estimates
* Tom Lane (t...@sss.pgh.pa.us) wrote: > > What I'm trying to get at in this overall email is: why in the world is > > it so expensive to do hash lookups? > > perf or oprofile reveal anything? Working on a test case actually- I've got one now: http://snowman.net/~sfrost/test_case2.sql In this example, hashing the large table is actually 2 seconds *faster* than hashing the small table (again, all on my laptop). > Also, I assume that the cases you are looking at are large enough that > even the "small" table doesn't fit in a single hash batch? No, quite the opposite, sorry for not mentioning that before. Either side fits completely into memory w/ a single batch. The explain analyze's that I posted before show that, either way, there's only one batch involved. > (You never did mention what work_mem setting you're testing, anyway.) With the test case above (where I got a 2s faster run time by hashing the big table) used a work_mem of 1GB. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Multi-pass planner
Robert Haas writes: > for estimate_worstcase_fraction. So, when computing the cost of a > path, we'd compute our current expected-case estimate, and also a > worst-case estimate, and then compute the final cost as: There also was the idea for the executor to be able to handle alternate plans and some heuristic to determine that the actual cost of running a plan is much higher than what's been estimated, so much so as to switch to starting from scratch with the other plan instead. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Join cost estimates
* Tom Lane (t...@sss.pgh.pa.us) wrote: > perf or oprofile reveal anything? Here's what we get from oprofile (perhaps not too surprising): Hash the small table / scan the big table: samples cum. samples %cum. % linenr info image name symbol name 167374 16737447.9491 47.9491nodeHash.c:915 postgres ExecScanHashBucket 8504125241524.3624 72.3115mcount.c:60 libc-2.15.so __mcount_internal 28370280785 8.1274 80.4389_mcount.S:33 libc-2.15.so mcount 15856296641 4.5424 84.9814(no location information) [vdso] (tgid:30643 range:0x7fffe6fff000-0x7fffe6ff) [vdso] (tgid:30643 range:0x7fffe6fff000-0x7fffe6ff) 6291 302932 1.8022 86.7836xact.c:682 postgres TransactionIdIsCurrentTransactionId 4555 307487 1.3049 88.0885instrument.c:70 postgres InstrStopNode 3849 311336 1.1027 89.1912heapam.c:711postgres heapgettup_pagemode 3567 314903 1.0219 90.2130nodeHashjoin.c:63 postgres ExecHashJoin Hash the big table / scan the small table: samples cum. samples %cum. % linenr info image name symbol name 112060 11206039.2123 39.2123mcount.c:60 libc-2.15.so __mcount_internal 3654714860712.7886 52.0009nodeHash.c:709 postgres ExecHashTableInsert 3357018217711.7469 63.7477_mcount.S:33 libc-2.15.so mcount 16383198560 5.7328 69.4805(no location information) [vdso] (tgid:30643 range:0x7fffe6fff000-0x7fffe6ff) [vdso] (tgid:30643 range:0x7fffe6fff000-0x7fffe6ff) 13200211760 4.6190 74.0995(no location information) no-vmlinux /no-vmlinux 6345 218105 2.2203 76.3197xact.c:682 postgres TransactionIdIsCurrentTransactionId 5250 223355 1.8371 78.1568nodeHash.c:915 postgres ExecScanHashBucket 4797 228152 1.6786 79.8354heapam.c:711postgres heapgettup_pagemode 4661 232813 1.6310 81.4664aset.c:563 postgres AllocSetAlloc 4588 237401 1.6054 83.0718instrument.c:70 postgres InstrStopNode 3550 240951 1.2422 84.3140memcpy-ssse3-back.S:60 libc-2.15.so __memcpy_ssse3_back 3013 243964 1.0543 85.3684aset.c:1109 postgres AllocSetCheck Looking at the 'Hash the small table / scan the big table', opannotate claims that this is bar far the worst offender: 147588 42.2808 :hashTuple = hashTuple->next; While most of the time in the 'Hash the big table / scan the small table' is in: 34572 12.0975 :hashTuple->next = hashtable->buckets[bucketno]; Neither of those strike me as terribly informative though. To be honest, I've not really played w/ oprofile all that much. Now that I've got things set up to support this, I'd be happy to provide more info if anyone has suggestions on how to get something more useful. It does look like reducing bucket depth, as I outlined before through the use of a 2-level hashing system, might help speed up ExecScanHashBucket, as it would hopefully have very few (eg: 1-2) entries to consider instead of more. Along those same lines, I really wonder if we're being too generous wrt the bucket-depth goal of '10' instead of, say, '1', especially when we've got plenty of work_mem available. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Hash Join cost estimates
* Stephen Frost (sfr...@snowman.net) wrote: > 8504125241524.3624 72.3115mcount.c:60 > libc-2.15.so __mcount_internal > 28370280785 8.1274 80.4389_mcount.S:33 > libc-2.15.so mcount [...] And as a side-note, I'm rebuilding w/o profiling, asserts, etc and will run it again, though I don't really expect the top-hitters to change. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] puzzling JSON bug
Andrew Dunstan writes: > David Wheeler has presented me with a nasty bug case. > If I do this: > select '{"members": { "add": [3, 4]}}'::json #> '{members,add}'; > then I get a crash. > If I comment out the pfree() at json.c:parse_object_field() lines 378-9 > then I get back the right result but instead get a warning like this: > WARNING: problem in alloc set ExprContext: bogus aset link in block > 0x1efaa80, chunk 0x1efb1f0 > I'm not quite sure where I should go looking for what I've done wrong here. Routine array-overrun memory stomp. The chunk header data for "fname"'s alloc chunk is being overwritten here: Watchpoint 2: *(int *) 1075253088 Old value = 1074925616 New value = -1 0x50fe14 in get_array_start (state=0x40170e88) at jsonfuncs.c:688 688 _state->array_level_index[lex_level] = -1; It appears that lex_level is 2 but only enough room for 2 entries has been allocated in array_level_index[]. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] puzzling JSON bug
On 04/04/2013 03:39 PM, Tom Lane wrote: Andrew Dunstan writes: David Wheeler has presented me with a nasty bug case. If I do this: select '{"members": { "add": [3, 4]}}'::json #> '{members,add}'; then I get a crash. If I comment out the pfree() at json.c:parse_object_field() lines 378-9 then I get back the right result but instead get a warning like this: WARNING: problem in alloc set ExprContext: bogus aset link in block 0x1efaa80, chunk 0x1efb1f0 I'm not quite sure where I should go looking for what I've done wrong here. Routine array-overrun memory stomp. The chunk header data for "fname"'s alloc chunk is being overwritten here: Watchpoint 2: *(int *) 1075253088 Old value = 1074925616 New value = -1 0x50fe14 in get_array_start (state=0x40170e88) at jsonfuncs.c:688 688 _state->array_level_index[lex_level] = -1; It appears that lex_level is 2 but only enough room for 2 entries has been allocated in array_level_index[]. OK, many thanks, will fix. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] corrupt pages detected by enabling checksums
Andres, Thank you for diagnosing this problem! On Thu, 2013-04-04 at 16:53 +0200, Andres Freund wrote: > I think the route you quickly sketched is more realistic. That would > remove all knowledge obout XLOG_HINT from generic code hich is a very > good thing, I spent like 15minutes yesterday wondering whether the early > return in there might be the cause of the bug... I like this approach. It may have some performance impact though, because there are a couple extra spinlocks taken, and an extra memcpy. The code looks good to me except that we should be consistent about the page hole -- XLogCheckBuffer is calculating it, but then we copy the entire page. I don't think anything can change the size of the page hole while we have a shared lock on the buffer, so it seems OK to skip the page hole during the copy. Another possible approach is to drop the lock on the buffer and re-acquire it in exclusive mode after we find out we'll need to do XLogInsert. It means that MarkBufferDirtyHint may end up blocking for longer, but would avoid the memcpy. I haven't really thought through the details. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Join cost estimates
* Stephen Frost (sfr...@snowman.net) wrote: > It does look like reducing bucket depth, as I outlined before through > the use of a 2-level hashing system, might help speed up > ExecScanHashBucket, as it would hopefully have very few (eg: 1-2) > entries to consider instead of more. Along those same lines, I really > wonder if we're being too generous wrt the bucket-depth goal of '10' > instead of, say, '1', especially when we've got plenty of work_mem > available. Rerunning using a minimally configured build (only --enable-openssl and --enable-debug passed to configure) with NTUP_PER_BUCKET set to '1' results in a couple of interesting things- First, the planner actually picks the plan to hash the small table and seqscan the big one. That also, finally, turns out to be *faster* for this test case. explain analyze results here: Hash small table / seqscan big table: http://explain.depesz.com/s/nP1 Hash big table / seqscan small table: http://explain.depesz.com/s/AUv Here's the oprofile reports: Hash small table / seqscan big table: samples cum. samples %cum. % linenr info image name symbol name 3902339023 52.8574 52.8574nodeHash.c:915 postgres ExecScanHashBucket 3743 42766 5.0700 57.9273xact.c:682 postgres TransactionIdIsCurrentTransactionId 3110 45876 4.2126 62.1399nodeHashjoin.c:63 postgres ExecHashJoin 2561 48437 3.4689 65.6088heapam.c:711postgres heapgettup_pagemode 2427 50864 3.2874 68.8962heapam.c:300postgres heapgetpage 2395 53259 3.2441 72.1403heaptuple.c:1028postgres slot_deform_tuple 2395 55654 3.2441 75.3843heaptuple.c:1135postgres slot_getattr 2383 58037 3.2278 78.6122nodeHash.c:786 postgres ExecHashGetHashValue 1811 59848 2.4530 81.0652tqual.c:1044postgres HeapTupleSatisfiesMVCC 1796 61644 2.4327 83.4979execScan.c:111 postgres ExecScan 1298 62942 1.7582 85.2561hashfunc.c:517 postgres hash_uint32 1274 64216 1.7257 86.9817execProcnode.c:356 postgres ExecProcNode 1011 65227 1.3694 88.3511heapam.c:1453 postgres heap_getnext 905 66132 1.2258 89.5770execTuples.c:333postgres ExecStoreTuple 858 66990 1.1622 90.7392fmgr.c:1291 postgres FunctionCall1Coll 835 67825 1.1310 91.8702execQual.c:668 postgres ExecEvalScalarVarFast 834 68659 1.1297 92.mcxt.c:126 postgres MemoryContextReset 818 69477 1.1080 94.1078nodeSeqscan.c:48postgres SeqNext Hash big table / seqscan small table: samples cum. samples %cum. % linenr info image name symbol name 3861238612 41.2901 41.2901nodeHash.c:709 postgres ExecHashTableInsert 7435 46047 7.9507 49.2408(no location information) no-vmlinux /no-vmlinux 4900 50947 5.2399 54.4806aset.c:563 postgres AllocSetAlloc 3803 54750 4.0668 58.5474xact.c:682 postgres TransactionIdIsCurrentTransactionId 3335 58085 3.5663 62.1137heapam.c:711postgres heapgettup_pagemode 2532 60617 2.7076 64.8213nodeHash.c:786 postgres ExecHashGetHashValue 2523 63140 2.6980 67.5193memcpy-ssse3-back.S:60 libc-2.15.so __memcpy_ssse3_back 2518 65658 2.6926 70.2119heaptuple.c:1028postgres slot_deform_tuple 2378 68036 2.5429 72.7549heapam.c:300postgres heapgetpage 2374 70410 2.5387 75.2935heaptuple.c:1135postgres slot_getattr 1852 72262 1.9805 77.2740nodeHash.c:915 postgres ExecScanHashBucket 1831 74093 1.9580 79.2320tqual.c:1044postgres HeapTupleSatisfiesMVCC 1732 75825 1.8521 81.0841heapam.c:1453 postgres heap_getnext 1320 77145 1.4116 82.4957nodeHash.c:76 postgres MultiExecHash 1219 78364 1.3
Re: [HACKERS] Multi-pass planner
On Thu, Apr 4, 2013 at 2:53 PM, Dimitri Fontaine wrote: > Robert Haas writes: >> for estimate_worstcase_fraction. So, when computing the cost of a >> path, we'd compute our current expected-case estimate, and also a >> worst-case estimate, and then compute the final cost as: > > There also was the idea for the executor to be able to handle alternate > plans and some heuristic to determine that the actual cost of running a > plan is much higher than what's been estimated, so much so as to switch > to starting from scratch with the other plan instead. Yeah. The thing is, if the plan has any side effects, that's not really an option. And even if it doesn't, it may throw away a lot of work. One thing we could do is switch from a unparameterized nested loop to a hash join if the outer side turns out to be much larger than expected, but that's only going to benefit a pretty narrow set of use cases. Which is why I think a planner-based approach is probably more promising. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Join cost estimates
Stephen Frost writes: > Looking with opannotate, there's two main hotspots in > ExecScanHashBucket: > 12846 17.4001 :hashTuple = > hashtable->buckets[hjstate->hj_CurBucketNo]; > and > 22100 29.9348 :hashTuple = hashTuple->next; Those are, of course, pretty trivial statements; so the way I read this is that the fundamental slowdown comes from the hash table being large compared to the CPU's cache, so that you're incurring lots of cache misses at these particular fetches. (You might be able to confirm that if you can set oprofile to count cache misses rather than wall clock time.) > I'm certainly curious about those, but I'm also very interested in the > possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable > depending on the work_mem setting. Not sure about that. That would make the hash-bucket-header array larger without changing the size of the rest of the hash table, thus probably making the CPU cache situation worse not better (which would manifest as more time at the first of these two statements relative to the second). Can you add some instrumentation code to get us information about the average/max length of the bucket chains? And maybe try to figure out how many distinct hash values per bucket, which would give us a clue whether your two-level-list idea is worth anything. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PATCH] Exorcise "zero-dimensional" arrays (Was: Re: [HACKERS] Should array_length() Return NULL)
On Thu, Apr 4, 2013 at 11:10 AM, Merlin Moncure wrote: > The only reasonable answer for this (a provably used, non-security, > non-standards violating, non-gross functionality breakage case) is > *zero*. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clang compiler warning on 9.3 HEAD
Will Leinweber wrote: > On ref 8507907 when compiling with clang on os x, I got this warning which > seems like a possible bug. > > I thought to report this because I imagine clang isn't frequently used > day-to-day by most. Ugh. My fault. Yes, this is a bug. I don't see any nice way to convert ObjectType to ObjectClass or the other way around; it seems to me the only simple way to fix this is to add a new function EventTriggerSupportsObjectClass(), which takes ObjectClass instead of ObjectType. Patch attached. Now, it annoys me that we now have three places that know about object types supported by event triggers: there's a large struct of command tag substrings (event_trigger_support), then there's these two functions. It might be better to add ObjectType and ObjectClass entries to the struct, so that only the struct needs to know about that. The problem is that these two functions would have to walk the struct every time, instead of being a simple switch. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services *** a/src/backend/catalog/dependency.c --- b/src/backend/catalog/dependency.c *** *** 210,216 deleteObjectsInList(ObjectAddresses *targetObjects, Relation *depRel, ObjectAddress *thisobj = targetObjects->refs + i; if ((!(flags & PERFORM_DELETION_INTERNAL)) && ! EventTriggerSupportsObjectType(getObjectClass(thisobj))) { EventTriggerSQLDropAddObject(thisobj); } --- 210,216 ObjectAddress *thisobj = targetObjects->refs + i; if ((!(flags & PERFORM_DELETION_INTERNAL)) && ! EventTriggerSupportsObjectClass(getObjectClass(thisobj))) { EventTriggerSQLDropAddObject(thisobj); } *** a/src/backend/commands/event_trigger.c --- b/src/backend/commands/event_trigger.c *** *** 912,917 EventTriggerSupportsObjectType(ObjectType obtype) --- 912,939 } /* + * Do event triggers support this object class? + */ + bool + EventTriggerSupportsObjectClass(ObjectClass objclass) + { + switch (objclass) + { + case OCLASS_DATABASE: + case OCLASS_TBLSPACE: + case OCLASS_ROLE: + /* no support for global objects */ + return false; + case OCLASS_EVENT_TRIGGER: + /* no support for event triggers on event triggers */ + return false; + default: + break; + } + return true; + } + + /* * Prepare event trigger state for a new complete query to run, if necessary; * returns whether this was done. If it was, EventTriggerEndCompleteQuery must * be called when the query is done, regardless of whether it succeeds or fails *** a/src/include/commands/event_trigger.h --- b/src/include/commands/event_trigger.h *** *** 13,18 --- 13,19 #ifndef EVENT_TRIGGER_H #define EVENT_TRIGGER_H + #include "catalog/dependency.h" #include "catalog/objectaddress.h" #include "catalog/pg_event_trigger.h" #include "nodes/parsenodes.h" *** *** 41,46 extern Oid AlterEventTriggerOwner(const char *name, Oid newOwnerId); --- 42,48 extern void AlterEventTriggerOwner_oid(Oid, Oid newOwnerId); extern bool EventTriggerSupportsObjectType(ObjectType obtype); + extern bool EventTriggerSupportsObjectClass(ObjectClass objclass); extern void EventTriggerDDLCommandStart(Node *parsetree); extern void EventTriggerDDLCommandEnd(Node *parsetree); extern void EventTriggerSQLDrop(Node *parsetree); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] corrupt pages detected by enabling checksums
On 2013-04-04 12:59:36 -0700, Jeff Davis wrote: > Andres, > > Thank you for diagnosing this problem! > > On Thu, 2013-04-04 at 16:53 +0200, Andres Freund wrote: > > I think the route you quickly sketched is more realistic. That would > > remove all knowledge obout XLOG_HINT from generic code hich is a very > > good thing, I spent like 15minutes yesterday wondering whether the early > > return in there might be the cause of the bug... > > I like this approach. It may have some performance impact though, > because there are a couple extra spinlocks taken, and an extra memcpy. I don't think its really slower. Earlier the code took WalInsertLock everytime, even if we ended up not logging anything. Thats far more epensive than a single spinlock. And the copy should also only be taken in the case we need to log. So I think we end up ahead of the current state. > The code looks good to me except that we should be consistent about the > page hole -- XLogCheckBuffer is calculating it, but then we copy the > entire page. I don't think anything can change the size of the page hole > while we have a shared lock on the buffer, so it seems OK to skip the > page hole during the copy. I don't think it can change either, but I doubt that there's a performance advantage by not copying the hole. I'd guess the simpler code ends up faster. > Another possible approach is to drop the lock on the buffer and > re-acquire it in exclusive mode after we find out we'll need to do > XLogInsert. It means that MarkBufferDirtyHint may end up blocking for > longer, but would avoid the memcpy. I haven't really thought through the > details. That sounds like it would be prone to deadlocks. So I would dislike to go there. Will write up a patch tomorrow. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PATCH] Exorcise "zero-dimensional" arrays (Was: Re: [HACKERS] Should array_length() Return NULL)
Robert Haas writes: > On Thu, Apr 4, 2013 at 11:10 AM, Merlin Moncure wrote: >> The only reasonable answer for this (a provably used, non-security, >> non-standards violating, non-gross functionality breakage case) is >> *zero*. > +1. Well, if we're going to take that hard a line on it, then we can't change anything about array data storage or the existing functions' behavior; which leaves us with either doing nothing at all, or inventing new functions that have saner behavior while leaving the old ones in place. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Join cost estimates
On Thu, Apr 04, 2013 at 04:16:12PM -0400, Stephen Frost wrote: > * Stephen Frost (sfr...@snowman.net) wrote: > > It does look like reducing bucket depth, as I outlined before through > > the use of a 2-level hashing system, might help speed up > > ExecScanHashBucket, as it would hopefully have very few (eg: 1-2) > > entries to consider instead of more. Along those same lines, I really > > wonder if we're being too generous wrt the bucket-depth goal of '10' > > instead of, say, '1', especially when we've got plenty of work_mem > > available. > > Rerunning using a minimally configured build (only --enable-openssl > and --enable-debug passed to configure) with NTUP_PER_BUCKET set to '1' > results in a couple of interesting things- > > First, the planner actually picks the plan to hash the small table and > seqscan the big one. That also, finally, turns out to be *faster* for > this test case. > > ... > > I'm certainly curious about those, but I'm also very interested in the > possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable > depending on the work_mem setting. It's only used in > ExecChooseHashTableSize, so while making it variable or depending on > work_mem could slow planning down a bit, it's not a per-tuple cost item. > +1 for adjusting this based on work_mem value. Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE EXTENSION BLOCKS
"David E. Wheeler" writes: > On Apr 3, 2013, at 11:41 AM, Alvaro Herrera wrote: >> Right -- an extension is not considered to live within a schema, they >> are database-global. The objects might live in a particular schema (if >> it is "relocatable"), and there's support to move those to a different >> schema, but this doesn't affect the extension itself. > Thanks. I humbly submit this patch to help prevent silly questions like this > in the future. I think this should be addressed in extend.sgml not only on the CREATE EXTENSION reference page. After thinking awhile I came up with the attached wording. Further wordsmithing anyone? regards, tom lane diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml index 672d0df..bc1cd59 100644 *** a/doc/src/sgml/extend.sgml --- b/doc/src/sgml/extend.sgml *** *** 359,364 --- 359,370 extension.) Also notice that while a table can be a member of an extension, its subsidiary objects such as indexes are not directly considered members of the extension. + Another important point is that schemas can belong to extensions, but not + vice versa: an extension as such has an unqualified name and does not + exist within any schema. The extension's member objects, + however, will belong to schemas whenever appropriate for their object + types. It may or may not be appropriate for an extension to own the + schema(s) its member objects are within. diff --git a/doc/src/sgml/ref/create_extension.sgml b/doc/src/sgml/ref/create_extension.sgml index 4f3b9a5..9c9bf6f 100644 *** a/doc/src/sgml/ref/create_extension.sgml --- b/doc/src/sgml/ref/create_extension.sgml *** CREATE EXTENSION [ IF NOT EXISTS ] + + Remember that the extension itself is not considered to be within any + schema: extensions have unqualified names that must be unique + database-wide. But objects belonging to the extension can be within + schemas. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] corrupt pages detected by enabling checksums
On Thu, Apr 4, 2013 at 5:30 AM, Simon Riggs wrote: > On 4 April 2013 02:39, Andres Freund wrote: > > > If by now the first backend has proceeded to PageSetLSN() we are writing > > different data to disk than the one we computed the checksum of > > before. Boom. > > Right, so nothing else we were doing was wrong, that's why we couldn't > spot a bug. The problem is that we aren't replaying enough WAL because > the checksum on the WAL record is broke. > This brings up a pretty frightening possibility to me, unrelated to data checksums. If a bit gets twiddled in the WAL file due to a hardware issue or a "cosmic ray", and then a crash happens, automatic recovery will stop early with the failed WAL checksum with an innocuous looking message. The system will start up but will be invisibly inconsistent, and will proceed to overwrite that portion of the WAL file which contains the old data (real data that would have been necessary to reconstruct, once the corruption is finally realized ) with an end-of-recovery checkpoint record and continue to chew up real data from there. I don't know a solution here, though, other than trusting your hardware. Changing timelines upon ordinary crash recovery, not just media recovery, seems excessive but also seems to be exactly what timelines were invented for, right? Back to the main topic here, Jeff Davis mentioned earlier "You'd still think this would cause incorrect results, but...". I didn't realize the significance of that until now. It does produce incorrect query results. I was just bailing out before detecting them. Once I specify ignore_checksum_failure=1 my test harness complains bitterly about the data not being consistent with what the Perl program knows it is supposed to be. > I missed out on doing that with XLOG_HINT records, so the WAL CRC can > be incorrect because the data is scanned twice; normally that would be > OK because we have an exclusive lock on the block, but with hints we > only have share lock. So what we need to do is take a copy of the > buffer before we do XLogInsert(). > > Simple patch to do this attached for discussion. (Not tested). > We might also do this by modifying the WAL record to take the whole > block and bypass the BkpBlock mechanism entirely. But that's more work > and doesn't seem like it would be any cleaner. I figure lets solve the > problem first then discuss which approach is best. > I've tested your patch it and it seems to do the job. It has survived far longer than unpatched ever did, with neither checksum warnings, nor complaints of inconsistent data. (I haven't analyzed the code much, just the results, and leave the discussion of the best approach to others) Thanks, Jeff
Re: [HACKERS] Clang compiler warning on 9.3 HEAD
Alvaro Herrera writes: > Now, it annoys me that we now have three places that know about object > types supported by event triggers: there's a large struct of command tag > substrings (event_trigger_support), then there's these two functions. > It might be better to add ObjectType and ObjectClass entries to the > struct, so that only the struct needs to know about that. The problem > is that these two functions would have to walk the struct every time, > instead of being a simple switch. Yeah. For the moment I think efficiency trumps having the information in just one place, ie I agree with doing it like this. If we find we're constantly forgetting to fix the functions when we need to, it'd be time enough to revisit the decision. One thing you could do that might make it less likely we'd miss things is to have the switches explicitly cover all the possible enum members, instead of defaulting the majority of them as you do here. I know when I think about adding a new object type, I tend to choose the most similar existing type and grep for references to it. Likely you could even draft the code so that most compilers would warn about an enum value not covered in the switch. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] matview scannability rehash (was Re: Drastic performance loss in assert-enabled build in HEAD)
Subject updated to account for the wider topics now appearing. On Wed, Apr 03, 2013 at 05:49:18PM -0400, Tom Lane wrote: > What I'd actually rather see us spending time on right now is making > some provision for incremental updates, which I will boldly propose > could be supported by user-written triggers on the underlying tables > if we only diked out the prohibitions against INSERT/UPDATE/DELETE on > matviews, and allowed them to operate on a matview's contents just like > it was a table. Now admittedly that would foreclose allowing matviews > to be updatable in the updatable-view sense, but that's a feature I > would readily give up if it meant users could build incremental update > mechanisms this year and not two years down the road. I can't see taking MVs in the direction of allowing arbitrary DML; that's what tables are for. Users wishing to do that should keep using current methods: CREATE VIEW mv_impl AS SELECT ...; CREATE TABLE mv AS SELECT * FROM mv_impl; -- REFRESH analogue: fancier approaches exist BEGIN; TRUNCATE mv; INSERT INTO mv SELECT * FROM mv_impl; COMMIT; If anything, I'd help these users by introducing mechanisms for obtaining a TRUNCATE;INSERT with the bells and whistles of REFRESH MATERIALIZED VIEW. Namely, bulk index rebuilds, skipping WAL, and frozen tuples. > > ... Making sure that > > the heap has at least one page if data has been generated seems > > like a not-entirely-unreasonable way to do that, although there > > remains at least one vacuum bug to fix if we keep it, in addition > > to Tom's concerns. > > No. This is an absolute disaster. It's taking something we have always > considered to be an irrelevant implementation detail and making it into > user-visible DDL state, despite the fact that it doesn't begin to satisfy > basic transactional behaviors. We *need* to get rid of that aspect of > things. If you must have scannability state in version 0, okay, but > it has to be a catalog property not this. In large part, this ended up outside the catalogs due to key limitations of the startup process. This isn't the first time we've arranged an unusual dance for this reason, and it probably won't be the last. Sure, we could take out unlogged MVs to evade the problem, but re-adding them will mean either restoring relfilenode-based bookkeeping or attacking the startup process limitation directly. There exist fundamental challenges to a direct attack, like the potential inconsistency of system catalogs themselves. We could teach pg_relation_is_scannable() that unlogged MVs are always non-scannable during recovery, then somehow update system catalogs in all databases at the end of recovery. Not a project for 9.3, and I wouldn't be surprised to see it mushroom in complexity. The currently-committed approach is a good one given the applicable constraints. A slight variation on the committed approach would be to add a "_scannable" relation fork. The fork would either be absent (not scannable if an MV) or empty (possibly-scannable). Create it in CREATE MATERIALIZED VIEW sans WITH NO DATA and in REFRESH MATERIALIZED VIEW. Remove it in TRUNCATE. When the startup process reinitializes an unlogged relation, it would also remove any _scannable fork. This has a few advantages over the current approach: VACUUM won't need a special case, and pg_upgrade will be in a better position to blow away all traces if we introduce a better approach. The disadvantage is an extra inode per healthy MV. (Though it does not lead to a 9.3 solution, I'll note that an always-present relation metapage would help here.) Note that I said "possibly-scannable" -- the relfilenode-based indicator (whether the committed approach or something else) doesn't need to remain the only input to the question of scannability. If 9.5 introduces the concept of age-based scannability expiration, the applicable timestamp could go in pg_class, and pg_relation_is_scannable() could check both that and the relfilenode-based indicator. Concerning the original $SUBJECT, I would look at fixing the performance problem by having pg_relation_is_scannable() use an algorithm like this: 1. Grab the pg_class entry from syscache. If it's not found, return NULL. 2. If it's not a matview, return false. 3. Lock the matview and try to open a relcache entry. Return NULL on failure. 4. Return the scannability as reported by the relcache. This boils down to the CASE statement you noted upthread, except putting the fast-exit logic in the function rather than in its caller(s). nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Drastic performance loss in assert-enabled build in HEAD
On Thu, Apr 04, 2013 at 12:28:01PM +0200, Nicolas Barbier wrote: > 2013/4/3 Tom Lane : > > And if you're absolutely convinced that unlogged matviews mustn't work as I > > suggest, we can lose those from 9.3, too. > > +1. Having unlogged matviews without having incremental updates yet, > isn't super useful anyway. I would have surmised the opposite: since an unlogged MV requires a full refresh at unpredictable moments, logged MVs will be preferred where a refresh is prohibitively expensive. Why might unlogged-MV applications desire incremental updates more acutely than logged-MV applications? -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] matview scannability rehash (was Re: Drastic performance loss in assert-enabled build in HEAD)
Noah Misch writes: > On Wed, Apr 03, 2013 at 05:49:18PM -0400, Tom Lane wrote: >> No. This is an absolute disaster. It's taking something we have always >> considered to be an irrelevant implementation detail and making it into >> user-visible DDL state, despite the fact that it doesn't begin to satisfy >> basic transactional behaviors. We *need* to get rid of that aspect of >> things. If you must have scannability state in version 0, okay, but >> it has to be a catalog property not this. > In large part, this ended up outside the catalogs due to key limitations of > the startup process. Yeah, I realize that there's no other (easy) way to make unlogged matviews reset to an invalid state on crash, but that doesn't make this design choice less of a disaster. It boxes us into something that's entirely unable to support transitions between scannable and unscannable states by any means short of a complete rewrite of the matview contents; which seems fundamentally incompatible with any sort of incremental update scenario. And I remain of the opinion that it's going to box us into not being able to fix the problems because of pg_upgrade on-disk-compatibility issues. We will be far better off to drop unlogged matviews until we can come up with a better design. If that's so far off that no one can see it happening, well, that's tough. Leaving the door open for incremental maintenance is more important. > A slight variation on the committed approach would be to add a "_scannable" > relation fork. Not very transaction-safe, I think (consider crash midway through a transaction that adds or removes the fork), and in any case orders of magnitude more expensive than looking at a pg_class field. This really needs to be catalog state, not filesystem state. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] matview scannability rehash (was Re: Drastic performance loss in assert-enabled build in HEAD)
Tom Lane escribió: > Noah Misch writes: > > A slight variation on the committed approach would be to add a "_scannable" > > relation fork. > > Not very transaction-safe, I think (consider crash midway through a > transaction that adds or removes the fork), and in any case orders of > magnitude more expensive than looking at a pg_class field. This really > needs to be catalog state, not filesystem state. We could revive the pg_class_nt patch proposed a decade ago, perhaps ... -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch to add \watch to psql
Will Leinweber writes: > Here is an updated patch that addresses several of the points brought up so > far, such as the sleep, internationalization banner, and zero wait check, > and it removes the premature input check. I whacked this around some more, added basic docs, and committed it. > Unfortunately rl_clear_screen() is not included at all in libedit, causing > compilation to fail, and I was completely unable to find a way to > distinguish libedit from readline on OS X. It tries extraordinarily hard to > pretend that it's readline. Instead falling back to simple control > characters to clear the screen worked very well on both linux and OS X. I took that out; "works on the two cases I tried" does not mean "portable". It's possible we could do something involving having configure check for rl_clear_screen() etc, but that seems like more work than is justified, not to mention that the results would then be platform-dependent *by design*. Frankly I kinda prefer the behavior without a screen clear anyway; though this may just prove that I'm not accustomed to using the original "watch". Anyway, that's open to a followup patch if anybody is sufficiently set on doing it differently. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch to add \watch to psql
On Thu, Apr 4, 2013 at 5:04 PM, Tom Lane wrote: > > I whacked this around some more, added basic docs, and committed it. > > Thanks! > > Unfortunately rl_clear_screen() is not included at all in libedit, > causing > > compilation to fail, and I was completely unable to find a way to > > distinguish libedit from readline on OS X. It tries extraordinarily hard > to > > pretend that it's readline. Instead falling back to simple control > > characters to clear the screen worked very well on both linux and OS X. > > I took that out; "works on the two cases I tried" does not mean "portable". > Completely understandable. I'm very excited to have helped contribute something, however small this was, to the project. Thanks again, Will
Re: [HACKERS] corrupt pages detected by enabling checksums
On Thu, 2013-04-04 at 14:21 -0700, Jeff Janes wrote: > This brings up a pretty frightening possibility to me, unrelated to > data checksums. If a bit gets twiddled in the WAL file due to a > hardware issue or a "cosmic ray", and then a crash happens, automatic > recovery will stop early with the failed WAL checksum with > an innocuous looking message. The system will start up but will be > invisibly inconsistent, and will proceed to overwrite that portion of > the WAL file which contains the old data (real data that would have > been necessary to reconstruct, once the corruption is finally realized > ) with an end-of-recovery checkpoint record and continue to chew up > real data from there. I've been worried about that for a while, and I may have even seen something like this happen before. We could perhaps do some checks, but in general it seems hard to solve without writing flushing some data to two different places. For example, you could flush WAL, and then update an LSN stored somewhere else indicating how far the WAL has been written. Recovery could complain if it gets an error in the WAL before that point. But obviously, that makes WAL flushes expensive (in many cases, about twice as expensive). Maybe it's not out of the question to offer that as an option if nobody has a better idea. Performance-conscious users could place the extra LSN on an SSD or NVRAM or something; or maybe use commit_delay or async commits. It would only need to store a few bytes. Streaming replication mitigates the problem somewhat, by being a second place to write WAL. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] corrupt pages detected by enabling checksums
On Thu, 2013-04-04 at 22:39 +0200, Andres Freund wrote: > I don't think its really slower. Earlier the code took WalInsertLock > everytime, even if we ended up not logging anything. Thats far more > epensive than a single spinlock. And the copy should also only be taken > in the case we need to log. So I think we end up ahead of the current > state. Good point. > > The code looks good to me except that we should be consistent about the > > page hole -- XLogCheckBuffer is calculating it, but then we copy the > > entire page. I don't think anything can change the size of the page hole > > while we have a shared lock on the buffer, so it seems OK to skip the > > page hole during the copy. > > I don't think it can change either, but I doubt that there's a > performance advantage by not copying the hole. I'd guess the simpler > code ends up faster. I was thinking more about the WAL size, but I don't have a strong opinion. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] corrupt pages detected by enabling checksums
Jeff Davis writes: > On Thu, 2013-04-04 at 14:21 -0700, Jeff Janes wrote: >> This brings up a pretty frightening possibility to me, unrelated to >> data checksums. If a bit gets twiddled in the WAL file due to a >> hardware issue or a "cosmic ray", and then a crash happens, automatic >> recovery will stop early with the failed WAL checksum with >> an innocuous looking message. The system will start up but will be >> invisibly inconsistent, and will proceed to overwrite that portion of >> the WAL file which contains the old data (real data that would have >> been necessary to reconstruct, once the corruption is finally realized >> ) with an end-of-recovery checkpoint record and continue to chew up >> real data from there. > I've been worried about that for a while, and I may have even seen > something like this happen before. We could perhaps do some checks, but > in general it seems hard to solve without writing flushing some data to > two different places. For example, you could flush WAL, and then update > an LSN stored somewhere else indicating how far the WAL has been > written. Recovery could complain if it gets an error in the WAL before > that point. > But obviously, that makes WAL flushes expensive (in many cases, about > twice as expensive). > Maybe it's not out of the question to offer that as an option if nobody > has a better idea. Performance-conscious users could place the extra LSN > on an SSD or NVRAM or something; or maybe use commit_delay or async > commits. It would only need to store a few bytes. At least on traditional rotating media, this is only going to perform well if you dedicate two drives to the purpose. At which point you might as well just say "let's write two copies of WAL". Or maybe three copies, so that you can take a vote when they disagree. While this is not so unreasonable on its face for ultra-high-reliability requirements, I can't escape the feeling that we'd just be reinventing software RAID. There's no reason to think that we can deal with this class of problems better than the storage system can. > Streaming replication mitigates the problem somewhat, by being a second > place to write WAL. Yeah, if you're going to do this at all it makes more sense for the redundant copies to be on other machines. So the questions that that leads to are how smart is our SR code about dealing with a master that tries to re-send WAL regions it already sent, and what a slave ought to do in such a situation if the new data doesn't match. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PATCH] Exorcise "zero-dimensional" arrays (Was: Re: [HACKERS] Should array_length() Return NULL)
On 5 April 2013 07:43, Tom Lane wrote: > Well, if we're going to take that hard a line on it, then we can't > change anything about array data storage or the existing functions' > behavior; which leaves us with either doing nothing at all, or > inventing new functions that have saner behavior while leaving the > old ones in place. And then we are in the awkward position of trying to explain the differences in behaviour between the old and new functions ... presumably with a dash of "for historical reasons" and a sprinkling of "to preserve compatibility" in every other paragraph. The other suggestion that had been tossed around elsewhere upthread was inventing a new type that serves the demand for a straightforward mutable list, which has exactly one dimension, and which may be sensibly empty. Those few who are interested in dimensions >= 2 could keep on using "arrays", with all their backwards-compatible silliness intact, and everybody else could migrate to "lists" at their leisure. I don't hate the latter idea from a user perspective, but from a developer perspective I suspect there are valid objections to be made. Cheers, BJ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] corrupt pages detected by enabling checksums
On Thu, 2013-04-04 at 21:06 -0400, Tom Lane wrote: > I can't escape the feeling that we'd just be reinventing software RAID. > There's no reason to think that we can deal with this class of problems > better than the storage system can. The goal would be to reliably detect a situation where WAL that has been flushed successfully was later corrupted; not necessarily to recover when we find it. Unless it's something like ZFS, I don't think most software RAID will reliably detect corruption. A side benefit is that it would also help catch bugs like this in the future. I'm not advocating for this particular solution, but I do concur with Jeff Janes that it's a little bit scary and that we can entertain some ideas about how to mitigate it. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PATCH] Exorcise "zero-dimensional" arrays (Was: Re: [HACKERS] Should array_length() Return NULL)
Brendan Jurd writes: > The other suggestion that had been tossed around elsewhere upthread > was inventing a new type that serves the demand for a straightforward > mutable list, which has exactly one dimension, and which may be > sensibly empty. Those few who are interested in dimensions >= 2 could > keep on using "arrays", with all their backwards-compatible silliness > intact, and everybody else could migrate to "lists" at their leisure. > I don't hate the latter idea from a user perspective, but from a > developer perspective I suspect there are valid objections to be made. The real problem with that is that the existing arrays have glommed onto the syntax that is both most natural and SQL-spec-required. I don't think there is a lot of room to shove in a different kind of critter there. (There's been a remarkable lack of attention to the question of spec compliance in this thread, btw. Surely the standard has something to say on the matter of zero-length arrays?) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PATCH] Exorcise "zero-dimensional" arrays (Was: Re: [HACKERS] Should array_length() Return NULL)
On 5 April 2013 13:04, Tom Lane wrote: > (There's been a remarkable lack of attention to the question > of spec compliance in this thread, btw. Surely the standard has > something to say on the matter of zero-length arrays?) From 4.10 in my draft copy of "Foundation", arrays are one of two "collection" types (the other being multisets), and: "A collection is a composite value comprising zero or more elements, each a value of some data type DT" "The number of elements in C is the cardinality of C" "An array is a collection A in which each element is associated with exactly one ordinal position in A. If n is the cardinality of A, then the ordinal position p of an element is an integer in the range 1 (one) ≤ p ≤ n." The language specifically allows for zero elements, and does not contemplate multiple dimensions. The specification for the array constructor syntax (6.36) and array element reference by subscript (6.23) also make it fairly clear that only 1-D arrays were being considered. I'd say we've already gone way off-menu by having multidims. A more compliant approach would have been to implement arrays as 1-D only, and then maybe have a separate thing ("matrices"?) for multidims. While I was in there I noticed CARDINALITY, which would be pretty easy to add and would at least provide a more productive way to get the "real" length of an array without disrupting existing functionality: " ::= CARDINALITY " Cheers, BJ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PATCH] Exorcise "zero-dimensional" arrays (Was: Re: [HACKERS] Should array_length() Return NULL)
Brendan Jurd writes: > On 5 April 2013 13:04, Tom Lane wrote: >> (There's been a remarkable lack of attention to the question >> of spec compliance in this thread, btw. Surely the standard has >> something to say on the matter of zero-length arrays?) > The language specifically allows for zero elements, and does not > contemplate multiple dimensions. The specification for the array > constructor syntax (6.36) and array element reference by subscript > (6.23) also make it fairly clear that only 1-D arrays were being > considered. > I'd say we've already gone way off-menu by having multidims. Yeah, we knew that. I don't have a problem with seeing multidim arrays as an extension to the standard though. The point is that the behavior for the 1-D, default-lower-bound case ought to match the standard. > While I was in there I noticed CARDINALITY, which would be pretty easy > to add and would at least provide a more productive way to get the > "real" length of an array without disrupting existing functionality: Yeah, that would at least fix the null-result-for-empty-array problem for that particular functionality. Still, this is ammunition for the position that null results for empty arrays are just broken. BTW ... if you check the archives you will find that we had cardinality() for a short while, and removed it before 8.4 release, because we couldn't agree on what it ought to return when given a multi-dimensional array. I'm afraid that issue is still unresolved. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why there is a PG_GETARG_UINT32 and PG_RETURN_UINT32?
On Thursday, April 04, 2013 8:30 PM Rodrigo Barboza wrote: > Hi guys. > I am wondering when I can use the PG_GETARG_UINT32 and PG_RETURN_UINT32. > If postgres has no unsigned int type, what is the use of these macros? They are mainly used for contrib module functionality or some built-in functions which are not exposed. For example, bt_page_items() receives relation name (text) and block number (int), but internally the block number is unit32 as max blocks can be oxFFFE. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why there is a PG_GETARG_UINT32 and PG_RETURN_UINT32?
I am creating my own uint32 type and faced this function. But my args are always of type my_int and one of the signed int types as postgres doesn't hava unsigned. Could I use those functions in operations between those types? I can't see a place for this, I don't know if I am missing something On Fri, Apr 5, 2013 at 1:12 AM, Amit Kapila wrote: > On Thursday, April 04, 2013 8:30 PM Rodrigo Barboza wrote: > > > Hi guys. > > I am wondering when I can use the PG_GETARG_UINT32 and PG_RETURN_UINT32. > > If postgres has no unsigned int type, what is the use of these macros? > > They are mainly used for contrib module functionality or some built-in > functions which are not exposed. > For example, bt_page_items() receives relation name (text) and block number > (int), but internally the block number > is unit32 as max blocks can be oxFFFE. > > With Regards, > Amit Kapila. > >
Re: [HACKERS] Multi-pass planner
On Friday, April 05, 2013 1:59 AM Robert Haas wrote: > On Thu, Apr 4, 2013 at 2:53 PM, Dimitri Fontaine > wrote: > > Robert Haas writes: > >> for estimate_worstcase_fraction. So, when computing the cost of a > >> path, we'd compute our current expected-case estimate, and also a > >> worst-case estimate, and then compute the final cost as: > > > > There also was the idea for the executor to be able to handle > alternate > > plans and some heuristic to determine that the actual cost of running > a > > plan is much higher than what's been estimated, so much so as to > switch > > to starting from scratch with the other plan instead. > > Yeah. The thing is, if the plan has any side effects, that's not > really an option. And even if it doesn't, it may throw away a lot of > work. Why to throw away all the work, it could as well try to repair the plan or even if plan repair is not possible, it can keep multiple plans and next time can try to choose best among available one's. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PATCH] Exorcise "zero-dimensional" arrays (Was: Re: [HACKERS] Should array_length() Return NULL)
On 5 April 2013 15:05, Tom Lane wrote: > Brendan Jurd writes: >> While I was in there I noticed CARDINALITY, which would be pretty easy >> to add and would at least provide a more productive way to get the >> "real" length of an array without disrupting existing functionality: > > Yeah, that would at least fix the null-result-for-empty-array problem > for that particular functionality. Still, this is ammunition for the > position that null results for empty arrays are just broken. > > BTW ... if you check the archives you will find that we had > cardinality() for a short while, and removed it before 8.4 release, > because we couldn't agree on what it ought to return when given a > multi-dimensional array. I'm afraid that issue is still unresolved. Well for what it's worth I would expect cardinality() to return the total number of elements in the array (per ArrayGetNItems). It's consistent with the spec's identification of an array as a "collection". You can chunk the elements into dimensions however you want, but it's still a collection of elements, and the cardinality is still the number of elements. The "nesting" interpretation doesn't accord with our internal representation, nor with our requirement that multidim arrays be regular, nor with the fact that we can't put an array of texts inside an array of ints. Our array input syntaxes for multidim arrays look nest-ish but what they produce is not nested. Cheers, BJ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Thursday, April 04, 2013 6:12 PM Robert Haas wrote: > On Wed, Apr 3, 2013 at 9:49 PM, Greg Smith > wrote: > > On 4/2/13 11:54 AM, Robert Haas wrote: > >> But, having said that, I still think the best idea is what Andres > >> proposed, which pretty much matches my own thoughts: the bgwriter > >> needs to populate the free list, so that buffer allocations don't > have > >> to wait for linear scans of the buffer array. > > > > I was hoping this one would make it to a full six years of being on > the TODO > > list before it came up again, missed it by a few weeks. The funniest > part > > is that Amit even submitted a patch on this theme a few months ago > without > > much feedback: > > http://www.postgresql.org/message- > id/6C0B27F7206C9E4CA54AE035729E9C382852FF97@szxeml509-mbs > > That stalled where a few things have, on a) needing more regression > test > > workloads, and b) wondering just what the deal with large > shared_buffers > > setting degrading performance was. > > Those are impressive results. I think we should seriously consider > doing something like that for 9.4. TBH, although more workloads to > test is always better, I don't think this problem is so difficult that > we can't have some confidence in a theoretical analysis. If I read > the original thread correctly (and I haven't looked at the patch > itself), the proposed patch would actually invalidate buffers before > putting them on the freelist. That effectively amounts to reducing > shared_buffers, so workloads that are just on the edge of what can fit > in shared_buffers will be harmed, and those that benefit incrementally > from increased shared_buffers will be as well. > > What I think we should do instead is collect the buffers that we think > are evictable and stuff them onto the freelist without invalidating > them. When a backend allocates from the freelist, it can double-check > that the buffer still has usage_count 0. The odds should be pretty > good. But even if we sometimes notice that the buffer has been > touched again after being put on the freelist, we haven't expended all > that much extra effort, and that effort happened mostly in the > background. If we just put it to freelist, then next time if it get allocated directly from bufhash table, then who will remove it from freelist or do you think that, in BufferAlloc, if it gets from bufhash table, then it should verify if it's in freelist, then remove from freelist. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why there is a PG_GETARG_UINT32 and PG_RETURN_UINT32?
On Friday, April 05, 2013 10:00 AM Rodrigo Barboza wrote: > I am creating my own uint32 type and faced this function. > But my args are always of type my_int and one of the signed int types as postgres doesn't hava unsigned. > Could I use those functions in operations between those types? It should not be a problem if your signed type doesn't have any negative value. > I can't see a place for this, I don't know if I am missing something >On Fri, Apr 5, 2013 at 1:12 AM, Amit Kapila wrote: >>On Thursday, April 04, 2013 8:30 PM Rodrigo Barboza wrote: >> Hi guys. >> I am wondering when I can use the PG_GETARG_UINT32 and PG_RETURN_UINT32. >> If postgres has no unsigned int type, what is the use of these macros? > They are mainly used for contrib module functionality or some built-in > functions which are not exposed. > For example, bt_page_items() receives relation name (text) and block number > (int), but internally the block number > is unit32 as max blocks can be oxFFFE. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers