Re: [HACKERS] Support for REINDEX CONCURRENTLY
On Sat, Mar 23, 2013 at 10:20 PM, Andres Freund wrote: > On 2013-03-22 07:38:36 +0900, Michael Paquier wrote: > > Is someone planning to provide additional feedback about this patch at > some > > point? > > Yes, now that I have returned from my holidays - or well, am returning > from them, I do plan to. But it should probably get some implementation > level review from somebody but Fujii and me... > Yeah, it would be good to have an extra pair of fresh eyes looking at those patches. Thanks, -- Michael
[HACKERS] Re: [GENERAL] timeofday() and clock_timestamp() produce different results when casting to timestamptz
On 03/23/2013 08:16 AM, Tom Lane wrote: Adrian Klaver writes: Seems the time zone info still thinks Moscow time is being setting forward an hour for DST when in fact the time remains constant through the year. I think the discrepancy is between this in timezone/data/europe: Zone Europe/Moscow 2:30:20 - LMT 1880 2:30 - MMT 1916 Jul 3 # Moscow Mean Time 2:30:48 Russia %s 1919 Jul 1 2:00 3:00 Russia MSK/MSD 1922 Oct 2:00 - EET 1930 Jun 21 3:00 Russia MSK/MSD 1991 Mar 31 2:00s 2:00 Russia EE%sT 1992 Jan 19 2:00s 3:00 Russia MSK/MSD 2011 Mar 27 2:00s 4:00 - MSK and this in timezone/tznames/Default: MSD 14400 D # Moscow Daylight Time # (Europe/Moscow) MSK 10800# Moscow Time # (Europe/Moscow) We really need to figure out a way to update the tznames data automatically, or at least notice when it's become inconsistent with the underlying Olson database. So I temporary fix would be to go into /share/timezonesets/Default and change : MSK 10800# Moscow Time to MSK 14400# Moscow Time and then you get: test=> set time zone 'Europe/Moscow'; SET test=> select timeofday(), clock_timestamp(); timeofday |clock_timestamp -+--- Sun Mar 24 03:50:45.066537 2013 MSK | 2013-03-24 03:50:45.066582+04 (1 row) test=> select timeofday()::timestamptz, clock_timestamp()::timestamptz; timeofday |clock_timestamp ---+--- 2013-03-24 03:50:52.485092+04 | 2013-03-24 03:50:52.485188+04 (1 row) regards, tom lane -- Adrian Klaver adrian.kla...@gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] avoid buffer underflow in errfinish()
A side question: at src/backend/storage/lmgr/proc.c:1150, is there a null pointer deference for `autovac'? There is a null pointer check `autovac != NULL', but the pointer is already dereferenced earlier when initializing `autovac_pgxact'. Is this null pointer check redundant, or should we move the dereference `autovac->pgprocno' after the check? Thanks. On Sat, Mar 23, 2013 at 6:38 PM, Xi Wang wrote: > CHECK_STACK_DEPTH checks if errordata_stack_depth is negative. > Move the dereference of &errordata[errordata_stack_depth] after > the check to avoid out-of-bounds read. > --- > src/backend/utils/error/elog.c |4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c > index 3a211bf..47a0a8b 100644 > --- a/src/backend/utils/error/elog.c > +++ b/src/backend/utils/error/elog.c > @@ -393,13 +393,15 @@ void > errfinish(int dummy,...) > { > ErrorData *edata = &errordata[errordata_stack_depth]; > - int elevel = edata->elevel; > + int elevel; > MemoryContext oldcontext; > ErrorContextCallback *econtext; > > recursion_depth++; > CHECK_STACK_DEPTH(); > > + elevel = edata->elevel; > + > /* > * Do processing in ErrorContext, which we hope has enough reserved > space > * to report an error. > -- > 1.7.10.4 > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH] avoid buffer underflow in errfinish()
CHECK_STACK_DEPTH checks if errordata_stack_depth is negative. Move the dereference of &errordata[errordata_stack_depth] after the check to avoid out-of-bounds read. --- src/backend/utils/error/elog.c |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index 3a211bf..47a0a8b 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -393,13 +393,15 @@ void errfinish(int dummy,...) { ErrorData *edata = &errordata[errordata_stack_depth]; - int elevel = edata->elevel; + int elevel; MemoryContext oldcontext; ErrorContextCallback *econtext; recursion_depth++; CHECK_STACK_DEPTH(); + elevel = edata->elevel; + /* * Do processing in ErrorContext, which we hope has enough reserved space * to report an error. -- 1.7.10.4 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls
Thanks - I've added it here: https://commitfest.postgresql.org/action/patch_view?id=1096 . I've also attached a revised version that makes IGNORE and RESPECT UNRESERVED keywords (following the pattern of NULLS_FIRST and NULLS_LAST). Nick On 23 March 2013 14:34, Tom Lane wrote: > Nicholas White writes: > > The SQL standard defines a RESPECT NULLS or IGNORE NULLS option for lead, > > lag, [...]. This is not implemented in PostgreSQL > > (http://www.postgresql.org/docs/devel/static/functions-window.html) > > I've had a go at implementing this, and I've attached the resulting > patch. > > It's not finished yet, but I was hoping to find out if my solution is > along > > the right lines. > > Since we're trying to get 9.3 to closure, this patch probably isn't > going to get much attention until the 9.4 development cycle starts > (in a couple of months, likely). In the meantime, please add it to > the next commitfest list so we remember to come back to it: > https://commitfest.postgresql.org/action/commitfest_view?id=18 > > One comment just from a quick eyeball look is that we really hate > adding new keywords that aren't UNRESERVED, because that risks > breaking existing applications. Please see if you can refactor the > grammar to make those new entries unreserved. > > regards, tom lane > lead-lag-ignore-nulls.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Fri, Mar 22, 2013 at 4:06 AM, Atri Sharma wrote: > > > Not yet, I figured this might be a problem and am designing test cases > for the same. I would be glad for some help there please. > Perhaps this isn't the help you were looking for, but I spent a long time looking into this a few years ago. Then I stopped and decided to work on other things. I would recommend you do so too. If I have to struggle to come up with an artificial test case that shows that there is a problem, then why should I believe that there actually is a problem? If you take a well-known problem (like, say, bad performance at shared_buffers > 8GB (or even lower, on Windows)) and create an artificial test case to exercise and investigate that, that is one thing. But why invent pathological test cases with no known correspondence to reality? There are plenty of real problems to work on, and some of them are just as intellectually interesting as the artificial problems are. My conclusions were: 1) If everything fits in shared_buffers, then the replacement policy doesn't matter. 2) If shared_buffers is much smaller than RAM (the most common case, I believe), then what mostly matters is your OS's replacement policy, not pgsql's. Not much a pgsql hacker can do about this, other than turn into a kernel hacker. 3) If little of the highly-used data fits in RAM. then any non-absurd replacement policy is about as good as any other non-absurd one. 4) If most, but not quite all, of the highly-used data fits shared_buffers and shared_buffers takes most of RAM (or at least, most of RAM not already needed for other things like work_mem and executables), then the replacement policy matters a lot. But different policies suit different work-loads, and there is little reason to think we can come up with a way to choose between them. (Also, in these conditions, performance is very chaotic. You can run the same algorithm for a long time, and it can suddenly switch from good to bad or the other way around, and then stay in that new mode for a long time). Also, even if you come up with a good algorithm, if you make the data set 20% smaller or 20% larger, it is no longer a good algorithm. 5) Having buffers enter with usage_count=0 rather than 1 would probably be slightly better most of the time under conditions described in 4, but there is no way get enough evidence of this over enough conditions to justify making a change. And besides, how often do people run with shared_buffers being most of RAM, and the hot data just barely not fitting in it? If you want some known problems that are in this general area, we have: 1) If all data fits in RAM but not shared_buffers, and you have a very large number of CPUs and a read-only or read-mostly workload, then BufFreelistLock can be a major bottle neck. (But, on a Amazon high-CPU instance, I did not see this very much. I suspect the degree of problem depends a lot on whether you have a lot of sockets with a few CPUs each, versus one chip with many CPUs). This is very easy to come up with model cases for, pgbench -S -c8 -j8, for example, can often show it. 2) A major reason that people run with shared_buffers much lower than RAM is that performance seems to suffer with shared_buffers > 8GB under write-heavy workloads, even with spread-out checkpoints. This is frequently reported as a real world problem, but as far as I know has never been reduced to a simple reproducible test case. (Although there was a recent thread, maybe "High CPU usage / load average after upgrading to Ubuntu 12.04", that I thought might be relevant to this. I haven't had the time to seriously study the thread, or the hardware to investigate it myself) Cheers, Jeff
Re: [HACKERS] Page replacement algorithm in buffer cache
Jeff Janes writes: > I'm more convinced in the other direction, new pages should enter with 0 > rather than with 1. I think that the argument that a new buffer needs to > be given more of an opportunity to get used again is mostly bogus. IIRC, the argument for starting at 1 not 0 is that otherwise a new page might have an infinitesmally small lifespan, if the clock sweep should reach it just after it gets entered into the buffers. By starting at 1, the uncertainty in a new page's lifespan runs from 1 to 2 sweep times not 0 to 1 sweep time. I think though that this argument only holds water if the buffer didn't get found via the clock sweep to start with --- otherwise, it ought to have just about one clock sweep of time before the sweep comes back to it. It does apply to buffers coming off the freelist, though. Thus, if we were to get rid of the freelist then maybe we could change the starting usage_count ... but whether that's a good idea in itself is pretty uncertain. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Fri, Mar 22, 2013 at 7:27 PM, Ants Aasma wrote: > On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure wrote: >> well if you do a non-locking test first you could at least avoid some >> cases (and, if you get the answer wrong, so what?) by jumping to the >> next buffer immediately. if the non locking test comes good, only >> then do you do a hardware TAS. >> >> you could in fact go further and dispense with all locking in front of >> usage_count, on the premise that it's only advisory and not a real >> refcount. so you only then lock if/when it's time to select a >> candidate buffer, and only then when you did a non locking test first. >> this would of course require some amusing adjustments to various >> logical checks (usage_count <= 0, heh). > > Moreover, if the buffer happens to miss a decrement due to a data > race, there's a good chance that the buffer is heavily used and > wouldn't need to be evicted soon anyway. (if you arrange it to be a > read-test-inc/dec-store operation then you will never go out of > bounds) yeah. There's something to be said to have an upper bound in the length of time to get a page out (except in the special case when most of them are pinned). Right now, any page contention on a buffer header for any reason can shut down buffer allocation, and that's just not good. It's obviously not very likely to happen but I think it can does does happen. The more I think about it the more I think's a bad idea to spin during buffer allocation for any reason, ever. > However, clocksweep and usage_count maintenance is not what is > causing contention because that workload is distributed. The issue is > pinning and unpinning. There we need an accurate count and there are > some pages like index roots that get hit very heavily. Things to do > there would be in my opinion convert to a futex based spinlock so when > there is contention it doesn't completely kill performance and then > try to get rid of the contention. Converting to lock-free pinning > won't help much here as what is killing us here is the cacheline > bouncing. Yup -- futexes are another way to go. They are linux specific though. > One way to get rid of contention is the buffer nailing idea that > Robert came up with. If some buffer gets so hot that maintaining > refcount on the buffer header leads to contention, promote that buffer > to a nailed status, let everyone keep their pin counts locally and > sometime later revisit the nailing decision and if necessary convert > pins back to the buffer header. Yeah this is a more general (albeit more complicated) solution and would likely be fantastic. Is it safe to assume that refcounting is the only likely cause of contention? > One other interesting idea I have seen is closeable scalable nonzero > indication (C-SNZI) from scalable rw-locks [1]. The idea there is to > use a tree structure to dynamically stripe access to the shared lock > counter when contention is detected. Downside is that considerable > amount of shared memory is needed so there needs to be some way to > limit the resource usage. This is actually somewhat isomorphic to the > nailing idea. > > The issue with the current buffer management algorithm is that it > seems to scale badly with increasing shared_buffers. I think the > improvements should concentrate on finding out what is the problem > there and figuring out how to fix it. A simple idea to test would be > to just partition shared buffers along with the whole clock sweep > machinery into smaller ones, like the buffer mapping hash tables > already are. This should at the very least reduce contention for the > clock sweep even if it doesn't reduce work done per page miss. > > [1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf I'll have to take a look. Removing *all spinning* from from page allocation though feels like it might be worthwhile to test (got to give some bonus points for being a very local change and simple to implement). I wonder if with more shared buffers you tend to sweep more buffers per allocation. (IIRC Jeff J was skeptical of that). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Thu, Mar 21, 2013 at 9:51 PM, Atri Sharma wrote: > Hello all, > > Sorry if this is a naive question. > > I was going through Greg Smith's slides on buffer > cache( > http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCache.pdf). > When going through the page replacement algorithm that we use i.e. > clocksweep algorithm, I felt a potential problem in our current > system. > > Specifically, when a new entry is allocated in the buffer, it's > USAGE_COUNT is set to 1. On each sweep of the algorithm, the > USAGE_COUNT is decremented and an entry whose USAGE_COUNT becomes > zero is replaced. > It is replaced when the usage_count is already found to be zero, not when it is made zero. > I feel that this could lead to a bias towards replacement of > relatively younger pages in the cache over older pages. An entry > which has just entered the cache with USAGE_COUNT=1 could be replaced > soon, but it may be needed frequently in the near future, Well, it may be needed. But then again, it may not be needed. And that old page, that also may be needed frequently in the future (which is far more likely than a new page--after all an old page likely got old for a reason). The best evidence that it will be needed again is that it actually has been needed again. I'm more convinced in the other direction, new pages should enter with 0 rather than with 1. I think that the argument that a new buffer needs to be given more of an opportunity to get used again is mostly bogus. You cannot bully the shared_buffers into being larger than it is. If all the incoming buffers get "more opportunity", that just means the buffer-clock ticks twice as fast, and really none of them has more opportunity when you measure that opportunity against an outside standard (wall time, or work-load accomplished). All of our children cannot be above average. Cheers, Jeff
Re: [HACKERS] Page replacement algorithm in buffer cache
> > > Partitioned clock sweep strikes me as a bad idea... you could certainly get > unlucky and end up with a lot of hot stuff in one partition. > > Another idea that'sbeen broughht up inthe past is to have something in the > background keep a minimum number of buffers on the free list. That's how OS > VM systems I'm familiar with work, so there's precedent for it. > > I recall there were at least some theoretical concerns about this, but I > don't remember if anyone actually tested the idea. One way to handle this could be to have dynamic membership of pages in the partitions. Based on activity for a page, it could be moved to another partition. In this manner, we *could* distribute the hot and not so hot buffer pages and hence it could help. Regards, Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls
Nicholas White writes: > The SQL standard defines a RESPECT NULLS or IGNORE NULLS option for lead, > lag, [...]. This is not implemented in PostgreSQL > (http://www.postgresql.org/docs/devel/static/functions-window.html) > I've had a go at implementing this, and I've attached the resulting patch. > It's not finished yet, but I was hoping to find out if my solution is along > the right lines. Since we're trying to get 9.3 to closure, this patch probably isn't going to get much attention until the 9.4 development cycle starts (in a couple of months, likely). In the meantime, please add it to the next commitfest list so we remember to come back to it: https://commitfest.postgresql.org/action/commitfest_view?id=18 One comment just from a quick eyeball look is that we really hate adding new keywords that aren't UNRESERVED, because that risks breaking existing applications. Please see if you can refactor the grammar to make those new entries unreserved. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SDP query optimizer
On 23-03-2013 10:15, Andres Freund wrote: I just want to mention that unless you skew the statistics for the individual tables from their empty/default state this mostly measures a pretty degenerate case where optima are very rare and not very differentiated. Thats a useful thing to test, but not to have as the target to optimize for. So it might be interesting to run that thing with some table stats/contents stats set up. Yes, the search space obtained from this experiment may be very simpler than a real case. Beyond this experiment, I can construct classical queries used to evaluate this kind of algorithm, as stars, cliques, chains and cycles. Beyond these queries I have no idea how can I further test it. Regards, Adriano Lange -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls
> The SQL standard defines a RESPECT NULLS or IGNORE NULLS option for lead, lag, [...]. This is not implemented in PostgreSQL (http://www.postgresql.org/docs/devel/static/functions-window.html) I've had a go at implementing this, and I've attached the resulting patch. It's not finished yet, but I was hoping to find out if my solution is along the right lines. In particular, I'm storing the ignore-nulls flag in the frameOptions of a window function definition, and am adding a function to the windowapi.h to get at these options. I'm keeping the last non-null value in WinGetPartitionLocalMemory (which I hope is the right place), but I'm not using any of the *GetDatum macros to access it. An example of my change's behaviour: nwhite=# select *, lag(num,0) ignore nulls over (order by generate_series) from nwhite-# (select generate_series from generate_series(0,10)) s nwhite-# left outer join nwhite-# numbers n nwhite-# on (s.generate_series = n.num); generate_series | num | lag -+-+- 0 | | 1 | 1 | 1 2 | | 1 3 | | 1 4 | 4 | 4 5 | 5 | 5 6 | | 5 7 | | 5 8 | | 5 9 | 9 | 9 10 | | 9 (11 rows) I'd find this feature really useful, so I hope you can help me get my patch to a contributable state. Thanks - Nick lead-lag-ignore-nulls.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] timeofday() and clock_timestamp() produce different results when casting to timestamptz
Adrian Klaver writes: > Seems the time zone info still thinks Moscow time is being setting > forward an hour for DST when in fact the time remains constant through > the year. I think the discrepancy is between this in timezone/data/europe: Zone Europe/Moscow 2:30:20 - LMT 1880 2:30 - MMT 1916 Jul 3 # Moscow Mean Time 2:30:48 Russia %s 1919 Jul 1 2:00 3:00 Russia MSK/MSD 1922 Oct 2:00 - EET 1930 Jun 21 3:00 Russia MSK/MSD 1991 Mar 31 2:00s 2:00 Russia EE%sT 1992 Jan 19 2:00s 3:00 Russia MSK/MSD 2011 Mar 27 2:00s 4:00 - MSK and this in timezone/tznames/Default: MSD 14400 D # Moscow Daylight Time # (Europe/Moscow) MSK 10800# Moscow Time # (Europe/Moscow) We really need to figure out a way to update the tznames data automatically, or at least notice when it's become inconsistent with the underlying Olson database. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 2013-03-23 15:36:03 +0200, Ants Aasma wrote: > On Sat, Mar 23, 2013 at 3:10 PM, Andres Freund wrote: > >> Andres showed that switching out the existing CRC for zlib's would > >> result in 8-30% increase in INSERT-SELECT speed > >> (http://www.postgresql.org/message-id/201005202227.49990.and...@anarazel.de) > >> with the speeded up CRC still showing up as 10% of the profile. So I > >> guess another 5% speedup by doing the CRC 8 bytes at a time instead of > >> the used 4. And another couple % by using Fletcher or SIMD. > > > > I am not sure the considerations for WAL are the same as for page checksums > > - > > the current WAL code only computes the CRCs in rather small chunks, so very > > pipelineable algorithms/implementations don't necessarly show the same > > benefit > > for WAL as they do for page checksums... > > Sure, but I think that WAL checksums are not a big overhead in that case > anyway. I have seen profiles that indicate rather the contrary... Even in the optimal case of no FPWs a single heap_insert() results in the CRC computed in 5 steps or so. 4 of them over potentially noncontiguous pointer addressed memory. If you add an index or two where the situation is the same the slowdown is not all that surprising. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump/restore syntax checking bug?
On 03/22/2013 10:13 PM, Josh Kupershmidt wrote: On Fri, Mar 22, 2013 at 9:35 PM, Joshua D. Drake wrote: postgres@jd-laptop:~$ pg_restore -d test -P 'by(),hello()' foo.sqlc Note, the pg_restore doc makes no mention of trying to squeeze multiple function prototypes in a single argument you've done here, or of using multiple -P flags. It appears we need better syntax checking. Can't really argue with this. But if you think these pg_restore examples are bad, try this gem: reindexdb --table='foo; ALTER ROLE limited WITH superuser' That is HORRIBLE! Looks like our base utilities need some attention. jD Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Sat, Mar 23, 2013 at 3:10 PM, Andres Freund wrote: >> Andres showed that switching out the existing CRC for zlib's would >> result in 8-30% increase in INSERT-SELECT speed >> (http://www.postgresql.org/message-id/201005202227.49990.and...@anarazel.de) >> with the speeded up CRC still showing up as 10% of the profile. So I >> guess another 5% speedup by doing the CRC 8 bytes at a time instead of >> the used 4. And another couple % by using Fletcher or SIMD. > > I am not sure the considerations for WAL are the same as for page checksums - > the current WAL code only computes the CRCs in rather small chunks, so very > pipelineable algorithms/implementations don't necessarly show the same benefit > for WAL as they do for page checksums... Sure, but I think that WAL checksums are not a big overhead in that case anyway. I should point out that getting the SIMD algorithm to not be a loss for small variable sized workloads will take considerable amount of effort and code. Whereas it's quite easy for pipelined CRC32 and Fletcher (or should I say Adler as we want to use mod 65521). Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On 2013-03-22 07:38:36 +0900, Michael Paquier wrote: > Is someone planning to provide additional feedback about this patch at some > point? Yes, now that I have returned from my holidays - or well, am returning from them, I do plan to. But it should probably get some implementation level review from somebody but Fujii and me... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SDP query optimizer
On 2013-03-22 20:35:43 -0300, Adriano Lange wrote: > Hi all, > > I have developed a new query optimizer for PostgreSQL and I would like to > share it with the community. The optimizer's name is Sampling and Dynamic > Programming (SDP). I put it into a plugin developed some years ago, named > LJQO: > > https://github.com/alange0001/ljqo.git > > This plugin was configured to compile only against PostgreSQL 9.2. However, > I guess it may be easily adjusted for other versions of PostgreSQL. > > I would be glad for any feedback about SDP or even about LJQO. > > I have some numbers about the SDP in comparison with GEQO. If interested, > see a diff between the two ".out2" files attached. The schema and query are > from a previous email posted by Andres Freund in this list. I just want to mention that unless you skew the statistics for the individual tables from their empty/default state this mostly measures a pretty degenerate case where optima are very rare and not very differentiated. Thats a useful thing to test, but not to have as the target to optimize for. So it might be interesting to run that thing with some table stats/contents stats set up. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
> >> Results for pgbench scale 100: > >> No checksums: tps = 56623.819783 > >> Fletcher checksums: tps = 55282.222687 (1.024x slowdown) > >> CRC Checksums: tps = 50571.324795 (1.120x slowdown) > >> SIMD Checksums: tps = 56608.888985 (1.000x slowdown) > >> > >> So to conclude, the 3 approaches: > > > > Great analysis. Still a tough choice. +1 > > One thing that might be interesting is to look at doing SIMD for both > > data and WAL. I wonder if that would be a noticeable speedup for WAL > > full-page writes? That would give greater justification for the extra > > work it will take (intrinsics/ASM), and it would be a nice win for > > non-checksum users. > > Andres showed that switching out the existing CRC for zlib's would > result in 8-30% increase in INSERT-SELECT speed > (http://www.postgresql.org/message-id/201005202227.49990.and...@anarazel.de) > with the speeded up CRC still showing up as 10% of the profile. So I > guess another 5% speedup by doing the CRC 8 bytes at a time instead of > the used 4. And another couple % by using Fletcher or SIMD. I am not sure the considerations for WAL are the same as for page checksums - the current WAL code only computes the CRCs in rather small chunks, so very pipelineable algorithms/implementations don't necessarly show the same benefit for WAL as they do for page checksums... And even if the checksumming were to be changed to compute the CRC in larger chunks - a very sensible thing imo - it would still be relatively small sizes in many workloads. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Sat, Mar 23, 2013 at 5:14 AM, Craig Ringer wrote: > Making zero a "not checksummed" magic value would significantly detract > from the utility of checksums IMO. FWIW using 65521 modulus to compress larger checksums into 16 bits will leave 14 non-zero values unused. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Sat, Mar 23, 2013 at 6:04 AM, Jim Nasby wrote: > Partitioned clock sweep strikes me as a bad idea... you could certainly get > unlucky and end up with a lot of hot stuff in one partition. Surely that is not worse than having everything in a single partition. Given a decent partitioning function it's very highly unlikely to have more than a few of the hottest buffers end up in a single partition. > Another idea that'sbeen broughht up inthe past is to have something in the > background keep a minimum number of buffers on the free list. That's how OS > VM systems I'm familiar with work, so there's precedent for it. > > I recall there were at least some theoretical concerns about this, but I > don't remember if anyone actually tested the idea. Yes, having bgwriter do the actual cleaning up seems like a good idea. The whole bgwriter infrastructure will need some serious tuning. There are many things that could be shifted to background if we knew it could keep up, like hint bit setting on dirty buffers being flushed out. But again, we have the issue of having good tests to see where the changes hurt. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Sat, Mar 23, 2013 at 6:29 AM, Atri Sharma wrote: > One way to distribute memory contention in case of spinlocks could be > to utilize the fundamentals of NUMA architecture. Specifically, we can > let the contending backends spin on local flags instead on the buffer > header flags directly. As access to local cache lines is much cheaper > and faster than memory locations which are far away in NUMA, we could > potentially reduce the memory overhead for a specific line and reduce > the overall overheads as well. This is not even something for NUMA architectures (which is by now all multiprocessor machines), even multicore machines have overheads for bouncing cache lines. The locks don't even have to be local, it's good enough to just have better probability of each backend contending hitting a different lock, if we take care of not having the locks share cache lines. IMO that's the whole point of striping locks, the critical section is usually cheaper than the cost of getting the cache line in an exclusive state. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SDP query optimizer
Hi, On 22-03-2013 21:22, Josh Berkus wrote: Woah! Way cool. As a warning, we're in the closing throes of version 9.3 right now, so if you code/ideas doesn't get the attention it deserves, that's why. Ok. No problem. :-) There is an incomplete project from a few years back to make the non-exhaustive query planner pluggable so that we could use different algorithms. Unfortunately, it was never finished and merged with the core code. Your planner is yet another reason it would be great to complete this. Yes. I looked at the Julius and Tomas' project in pgFoundry [1] some years ago, but it was inactive. Therefore, I decided to start a new one. [1] - http://pgfoundry.org/projects/optimizer/ Anyway, good work for all of you. -- Adriano Lange -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Single-argument variant for array_length and friends?
On 22 March 2013 09:12, Merlin Moncure wrote: > On Thu, Mar 21, 2013 at 2:00 AM, Pavel Stehule > wrote: >> lot of postgresql functions calculate with all items in array without >> respect to dimensions - like unnest. >> >> so concept "use outermost dim" is not in pg now, and should not be >> introduced if it is possible. More it goes against a verbosity concept >> introduced by ADA and reused in PL/SQL and PL/pgSQL. > > and pl/psm* Yeah, okay. That argument works for me. Let's go for option (a), only allow the user to omit the dimension argument if the array is 1-D. We still have the issue that Tom isn't convinced that the feature is worth pursuing -- Tom, would you please elaborate a little on what you dislike about it? I don't see much of a downside (just 3 extra pg_procs). Cheers, BJ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Saturday, March 23, 2013 9:34 AM Jim Nasby wrote: > On 3/22/13 7:27 PM, Ants Aasma wrote: > > On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure > wrote: > > > One other interesting idea I have seen is closeable scalable nonzero > > indication (C-SNZI) from scalable rw-locks [1]. The idea there is to > > use a tree structure to dynamically stripe access to the shared lock > > counter when contention is detected. Downside is that considerable > > amount of shared memory is needed so there needs to be some way to > > limit the resource usage. This is actually somewhat isomorphic to the > > nailing idea. > > > > The issue with the current buffer management algorithm is that it > > seems to scale badly with increasing shared_buffers. I think the > > improvements should concentrate on finding out what is the problem > > there and figuring out how to fix it. A simple idea to test would be > > to just partition shared buffers along with the whole clock sweep > > machinery into smaller ones, like the buffer mapping hash tables > > already are. This should at the very least reduce contention for the > > clock sweep even if it doesn't reduce work done per page miss. > > > > [1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf > > Partitioned clock sweep strikes me as a bad idea... you could certainly > get unlucky and end up with a lot of hot stuff in one partition. > > Another idea that'sbeen broughht up inthe past is to have something in > the background keep a minimum number of buffers on the free list. > That's how OS VM systems I'm familiar with work, so there's precedent > for it. > > I recall there were at least some theoretical concerns about this, but > I don't remember if anyone actually tested the idea. I have tried one of the idea's : Adding the buffers background writer finds reusable to freelist. http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382852FF9 7@szxeml509-mbs This can reduce the clock swipe as it can find buffers from freelist. It shows performance improvement for read loads when data can be contained in shared buffers, but when the data becomes large and (I/O) is involved, it shows some dip as well. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers