Re: [HACKERS] pgindent README correction
On Wednesday, February 8, 2012, Bruce Momjian wrote: On Mon, Jan 09, 2012 at 01:32:49PM -0500, Robert Haas wrote: The one other issue I ran into in following the latest pgindent instructions was that I had to add #include stdlib.h to the parse.c file (as included in the pg_bsd_indent-1.1.tar.gz file at ftp://ftp.postgresql.org/pub/dev ). Without it I got this: parse.c: In function *parse*: parse.c:236:6: warning: implicit declaration of function *exit* parse.c:236:6: warning: incompatible implicit declaration of built-in function *exit* Can someone fix that and put up a 1.2 version? Sounds like a job for Mr. Momjian. What server do I log into to update the ftp pgindent tgz file? The ftp master server is fornax.postgresql.org. //Magnus -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] controlling the location of server-side SSL files
On Tuesday, February 7, 2012, Peter Eisentraut wrote: On tis, 2012-01-24 at 22:05 +0200, Peter Eisentraut wrote: One thing that is perhaps worth thinking about: Currently, we just ignore missing root.crt and root.crl files. With this patch, we still do this, even if the user has given a specific nondefault location. That seems a bit odd, but I can't think of a simple way to do it better. There's a review in the CF app for this finding only minor issues, so I'm marking this patch therein as Ready for Committer. OK, no one had any concerns about the missing file behavior I described above? If not, then I'll commit it soon. I'm still worried about this. If we ignore a missing root.crt, then the effect is that authentication and certificate verification might fail, which would be annoying, but you'd notice it soon enough. But if we ignore a missing root.crl, we are creating a security hole. Yes, ignoring a missing file in a security context is definitely not good. It should throw an error. We have a few bad defaults from the old days around SSL for this, but if it requires breaking backwards compatibility to get it right, I think we should still do it. My best idea at the moment is that we should set these parameters to empty by default, and make users point them to existing files if they want to use that functionality. Comments? +1. Anybody who actually cares about setting up security is likely not going to rely on defaults anyway - and is certainly going to review whatever they are. So there should be no big problem there. //Magnus -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] Memory usage during sorting
On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes jeff.ja...@gmail.com wrote: The attached patch allows it to reuse that memory. On my meager system it reduced the building of an index on an integer column in a skinny 200 million row totally randomly ordered table by about 3% from a baseline of 25 minutes. Just to give a standard review, this patch is one line change and applies cleanly, builds ok. I'm not pretty sure what exactly you're trying to accomplish, but it seems to me that it's avoiding the first dumptuples cycle by forcing availMem = 0 even if it's negative. I read your comments as it'd be avoiding to alternate reading/writing back and force with scattered memory failing memory cache much during merge phase, but actually it doesn't affect merge phase but only init-dump phase, does it? If so, I'm not so convinced your benchmark gave 3 % gain by this change. Correct me as I'm probably wrong. Anyway, it's nice to modify the comment just above the change, since it's now true with it. Thanks, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory usage during sorting
On Sat, Feb 4, 2012 at 10:06 AM, Jeff Janes jeff.ja...@gmail.com wrote: The worst thing about the current memory usage is probably that big machines can't qsort more than 16,777,216 tuples no matter how much memory they have, because memtuples has to live entirely within a single allocation, which is currently limited to 1GB. It is probably too late to fix this problem for 9.2. I wish I had started there rather than here. This 16,777,216 tuple limitation will get even more unfortunate if the specializations that speed up qsort but not external sort get accepted. I think it's a fair ask to extend our palloc limitation of 1GB to 64bit space. I see there are a lot of applications that want more memory by one palloc call in user-defined functions, aggregates, etc. As you may notice, however, the area in postgres to accomplish it needs to be investigated deeply. I don't know where it's safe to allow it and where not. varlena types is unsafe, but it might be possible to extend varlena header to 64 bit in major release somehow. Attached is a completely uncommitable proof of concept/work in progress patch to get around the limitation. It shows a 2 fold improvement when indexing an integer column on a 50,000,000 row randomly ordered table. In any case, we do need bird-eye sketch to attack it but I guess it's worth and at some future point we definitely must do, though I don't know if it's the next release or third next release from now. Thanks, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] random_page_cost vs seq_page_cost
On 07/02/12 19:58, Bruce Momjian wrote: On Tue, Feb 07, 2012 at 05:06:18PM -0500, Greg Smith wrote: On 02/07/2012 03:23 PM, Bruce Momjian wrote: Where did you see that there will be an improvement in the 9.2 documentation? I don't see an improvement. I commented that I'm hoping for an improvement in the documentation of how much timing overhead impacts attempts to measure this area better. That's from the add timing of buffer I/O requests feature submission. I'm not sure if Bene read too much into that or not; I didn't mean to imply that the docs around random_page_cost have gotten better. I guess I did. But I'm very glad that as a side effect Bruce and Greg have improved it ;-) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [pgsql-www] [HACKERS] pgindent README correction
On Wed, Feb 8, 2012 at 8:13 AM, Magnus Hagander mag...@hagander.net wrote: On Wednesday, February 8, 2012, Bruce Momjian wrote: On Mon, Jan 09, 2012 at 01:32:49PM -0500, Robert Haas wrote: The one other issue I ran into in following the latest pgindent instructions was that I had to add #include stdlib.h to the parse.c file (as included in the pg_bsd_indent-1.1.tar.gz file at ftp://ftp.postgresql.org/pub/dev ). Without it I got this: parse.c: In function *parse*: parse.c:236:6: warning: implicit declaration of function *exit* parse.c:236:6: warning: incompatible implicit declaration of built-in function *exit* Can someone fix that and put up a 1.2 version? Sounds like a job for Mr. Momjian. What server do I log into to update the ftp pgindent tgz file? The ftp master server is fornax.postgresql.org. Though, you should bookmark ftpmaster.postgresql.org which will always point to the master FTP server even if it moves to a different host. -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Text-any concatenation volatility acting as optimization barrier
On Wed, Feb 8, 2012 at 06:21, Tom Lane t...@sss.pgh.pa.us wrote: Marti Raudsepp ma...@juffo.org writes: Case #1 uses the normal textcat(text, text) operator by automatically coercing 'x' as text. However, case #2 uses the anytextcat(anynonarray, text), which is marked as volatile thus acts as an optimization barrier. Hmm ... since those operators were invented (in 8.3), we have adopted a policy that I/O functions are presumed to be no worse than stable: http://archives.postgresql.org/pgsql-committers/2010-07/msg00307.php ISTM that would justify relabeling anytextcat/textanycat as stable, which should fix this. Yes, we should definitely take advantage of that. I scanned through all of pg_proc, there are 4 functions like this that can be changed: textanycat, anytextcat, quote_literal and quote_nullable. All of these have SQL wrappers to cast their argument to ::text. quote_literal | select pg_catalog.quote_literal($1::pg_catalog.text) quote_nullable | select pg_catalog.quote_nullable($1::pg_catalog.text) textanycat | select $1 || $2::pg_catalog.text anytextcat | select $1::pg_catalog.text || $2 Patch attached (in git am format). Passes all regression tests (except 'json' which fails on my machine even on git master). No documentation changes necessary AFAICT. Regards, Marti From e1943868d21316ff9126283efec54146c14e00fc Mon Sep 17 00:00:00 2001 From: Marti Raudsepp ma...@juffo.org Date: Wed, 8 Feb 2012 11:26:03 +0200 Subject: [PATCH] Mark textanycat/quote_literal/quote_nullable functions as stable These are SQL functions that rely on immutable functions/operators, but were previously marked volatile, since they relied on unknown ::text casts. As of commit aab353a60b95aadc00f81da0c6d99bde696c4b75, all text I/O functions are guaranteed to be stable or immutable. Author: Marti Raudsepp ma...@juffo.org --- src/include/catalog/catversion.h |2 +- src/include/catalog/pg_proc.h|8 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h index ae4e5f5..1d92ee3 100644 --- a/src/include/catalog/catversion.h +++ b/src/include/catalog/catversion.h @@ -53,6 +53,6 @@ */ /* mmddN */ -#define CATALOG_VERSION_NO 201202072 +#define CATALOG_VERSION_NO 201202081 #endif diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h index f8d01fb..d4206f1 100644 --- a/src/include/catalog/pg_proc.h +++ b/src/include/catalog/pg_proc.h @@ -2271,11 +2271,11 @@ DATA(insert OID = 1282 ( quote_ident PGNSP PGUID 12 1 0 0 0 f f f t f i 1 0 DESCR(quote an identifier for usage in a querystring); DATA(insert OID = 1283 ( quote_literalPGNSP PGUID 12 1 0 0 0 f f f t f i 1 0 25 25 _null_ _null_ _null_ _null_ quote_literal _null_ _null_ _null_ )); DESCR(quote a literal for usage in a querystring); -DATA(insert OID = 1285 ( quote_literalPGNSP PGUID 14 1 0 0 0 f f f t f v 1 0 25 2283 _null_ _null_ _null_ _null_ select pg_catalog.quote_literal($1::pg_catalog.text) _null_ _null_ _null_ )); +DATA(insert OID = 1285 ( quote_literalPGNSP PGUID 14 1 0 0 0 f f f t f s 1 0 25 2283 _null_ _null_ _null_ _null_ select pg_catalog.quote_literal($1::pg_catalog.text) _null_ _null_ _null_ )); DESCR(quote a data value for usage in a querystring); DATA(insert OID = 1289 ( quote_nullable PGNSP PGUID 12 1 0 0 0 f f f f f i 1 0 25 25 _null_ _null_ _null_ _null_ quote_nullable _null_ _null_ _null_ )); DESCR(quote a possibly-null literal for usage in a querystring); -DATA(insert OID = 1290 ( quote_nullable PGNSP PGUID 14 1 0 0 0 f f f f f v 1 0 25 2283 _null_ _null_ _null_ _null_ select pg_catalog.quote_nullable($1::pg_catalog.text) _null_ _null_ _null_ )); +DATA(insert OID = 1290 ( quote_nullable PGNSP PGUID 14 1 0 0 0 f f f f f s 1 0 25 2283 _null_ _null_ _null_ _null_ select pg_catalog.quote_nullable($1::pg_catalog.text) _null_ _null_ _null_ )); DESCR(quote a possibly-null data value for usage in a querystring); DATA(insert OID = 1798 ( oidin PGNSP PGUID 12 1 0 0 0 f f f t f i 1 0 26 2275 _null_ _null_ _null_ _null_ oidin _null_ _null_ _null_ )); @@ -2747,8 +2747,8 @@ DESCR(adjust time precision); DATA(insert OID = 1969 ( timetz PGNSP PGUID 12 1 0 0 0 f f f t f i 2 0 1266 1266 23 _null_ _null_ _null_ _null_ timetz_scale _null_ _null_ _null_ )); DESCR(adjust time with time zone precision); -DATA(insert OID = 2003 ( textanycat PGNSP PGUID 14 1 0 0 0 f f f t f v 2 0 25 25 2776 _null_ _null_ _null_ _null_ select $1 || $2::pg_catalog.text _null_ _null_ _null_ )); -DATA(insert OID = 2004 ( anytextcat PGNSP PGUID 14 1 0 0 0 f f f t f v 2 0 25 2776 25 _null_ _null_ _null_ _null_ select $1::pg_catalog.text || $2 _null_ _null_ _null_ )); +DATA(insert OID = 2003 ( textanycat PGNSP PGUID 14 1 0 0 0 f f f t f s 2 0 25 25 2776 _null_ _null_ _null_ _null_ select $1 || $2::pg_catalog.text _null_ _null_ _null_ )); +DATA(insert OID = 2004 ( anytextcat PGNSP
Re: [HACKERS] 16-bit page checksums for 9.2
On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch n...@leadboat.com wrote: On Tue, Feb 07, 2012 at 08:58:59PM +, Simon Riggs wrote: On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch n...@leadboat.com wrote: On Wed, Jan 11, 2012 at 10:12:31PM +, Simon Riggs wrote: This patch uses FPIs to guard against torn hint writes, even when the checksums are disabled. ?One could not simply condition them on the page_checksums setting, though. ?Suppose page_checksums=off and we're hinting a page previously written with page_checksums=on. ?If that write tears, leaving the checksum intact, that block will now fail verification. ?A couple of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control. Whenever the cluster starts with checksums disabled, set the field to InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum failure occurs in a page with LSN older than the stored one, emit either a softer warning or no message at all. We can only change page_checksums at restart (now) so the above seems moot. If we crash with FPWs enabled we repair any torn pages. There's no live bug, but that comes at a high cost: the patch has us emit full-page images for hint bit writes regardless of the page_checksums setting. Sorry, I don't understand what you mean. I don't see any failure cases that require that. page_checksums can only change at a shutdown checkpoint, The statement If that write tears, leaving the checksum intact, that block will now fail verification. cannot happen, ISTM. If we write out a block we update the checksum if page_checksums is set, or we clear it. If we experience a torn page at crash, the FPI corrects that, so the checksum never does fail verification. We only need to write a FPI when we write with checksums. If that's wrong, please explain a failure case in detail. PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding is insufficient. Am serialising this by only writing PageLSN while holding buf hdr lock. That means also taking the buffer header spinlock in every PageGetLSN() caller holding only a shared buffer content lock. Do you think that will pay off, versus the settled pattern of trading here your shared buffer content lock for an exclusive one? Yes, I think it will pay off. This is the only code location where we do that, and we are already taking the buffer header lock, so there is effectively no overhead. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed dblink using alternate libpq tuple storage
On Wed, Feb 08, 2012 at 02:44:13PM +0900, Shigeru Hanada wrote: (2012/02/07 23:44), Marko Kreen wrote: On Tue, Feb 07, 2012 at 07:25:14PM +0900, Shigeru Hanada wrote: - What is the right (or recommended) way to prevent from throwing exceptoin in row-processor callback function? When author should use PG_TRY block to catch exception in the callback function? When it calls backend functions that can throw exceptions? As all handlers running in backend will want to convert data to Datums, that means always wrap handler code in PG_TRY? ISTM that telling a) what happens to PGresult and PGconn when row processor ends without return, and b) how to recover them would be necessary. We can't assume much about caller because libpq is a client library. IMHO, it's most important to tell authors of row processor clearly what should be done on error case. Yes. In such case, must we call PQfinish, or is calling PQgetResult until it returns NULL enough to reuse the connection? IIUC calling pqClearAsyncResult seems sufficient, but it's not exported for client... Simply dropping -result is not useful as there are still rows coming in. Now you cannot process them anymore. The rule of after exception it's valid to close the connection with PQfinish() or continue processing with PQgetResult()/PQisBusy()/PQconsumeInput() seems quite simple and clear. Perhaps only clarification whats valid on sync connection and whats valid on async connection would be useful. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server
(2012/02/02 18:24), Marko Kreen wrote: I think you want this instead: https://commitfest.postgresql.org/action/patch_view?id=769 With modified version of pgsql_fdw which uses row processor to retrieve result tuples, I found significant performance gain on simple read-only pgbench, though scale factor was very small (-s 3). Executed command was pgbench -S -n -c 5 T 30. Average tps (excluding connections establishing) of 3 times measurements are: pgsql_fdw with SQL-cursor: 622 pgsql_fdw with Row Processor : 1219 - 2.0x faster than SQL-cursor w/o pgsql_fdw(direct access) : 7719 - 6.3x faster than Row Processor Row processor was almost 2x faster than SQL-cursor! I'm looking forward to this feature. In addition to performance gain, of course memory usage was kept at very low level. :) Regards, -- Shigeru Hanada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] xlog location arithmetic
On Wed, Feb 8, 2012 at 2:29 AM, Euler Taveira de Oliveira eu...@timbira.com wrote: On 26-01-2012 06:19, Fujii Masao wrote: Thanks for your review. Comments below. When I compiled the source with xlogdiff.patch, I got the following warnings. xlogfuncs.c:511:2: warning: format '%lX' expects argument of type 'long unsigned int *', but argument 3 has type 'uint64 *' [-Wformat] What is your compiler? I'm using gcc 4.6.2. I refactored the patch so I'm now using XLogRecPtr and %X. gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3) $ uname -a Linux hermes 3.0.0-15-generic #26-Ubuntu SMP Fri Jan 20 15:59:53 UTC 2012 i686 i686 i386 GNU/Linux In the updated version of the patch, I got no warnings at the compile time. But initdb failed because the OID which you assigned to pg_xlog_location_diff() has already been used for other function. So you need to update pg_proc.h. postgres=# SELECT pg_xlog_location_diff('0/274', '0/274'); ERROR: xrecoff 274 is out of valid range, 0..A4A534C Ugh? I can't reproduce that. It seems to be related to long int used by the prior version. Maybe. But another problem happened. When I changed pg_proc.h so that the unused OID was assigned to pg_xlog_location_diff(), and executed the above again, I encountered the segmentation fault: LOG: server process (PID 14384) was terminated by signal 11: Segmentation fault DETAIL: Failed process was running: SELECT pg_xlog_location_diff('0/274', '0/274'); LOG: terminating any other active server processes ISTM that the cause is that int8_numeric() is executed for uint32 value. We should use int4_numeric(), instead? While the output was int8 I could use pg_size_pretty but now I couldn't. I attached another patch that implements pg_size_pretty(numeric). I realized that it collides with the pg_size_pretty(int8) if we don't specify a type. Hence, I decided to drop the pg_size_pretty(int8) in favor of pg_size_pretty(numeric). It is slower than the former but it is not a performance critical function. I'm OK with this. -DATA(insert OID = 2288 ( pg_size_prettyPGNSP PGUID 12 1 0 0 0 f f f t f v 1 0 25 20 _null_ _null_ _null_ _null_ pg_size_pretty _null_ _null_ _null_ )); -DESCR(convert a long int to a human readable text using size units); +DATA(insert OID = 3158 ( pg_size_prettyPGNSP PGUID 12 1 0 0 0 f f f t f v 1 0 25 1700 _null_ _null_ _null_ _null_ pg_size_pretty _null_ _null_ _null_ )); +DESCR(convert a numeric to a human readable text using size units); Why OID needs to be reassigned? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ecpglib use PQconnectdbParams
On mån, 2012-02-06 at 21:11 +0100, Michael Meskes wrote: On Fri, Feb 03, 2012 at 01:15:30PM +0100, Christian Ullrich wrote: Shouldn't these be [7]? You can have up to 6 items, plus the terminator. I take it only keywords have to be [7], right? Committed, thanks for spotting this. There seems to be one more problem that I haven't had time to tackle yet. With this patch the connection regression test (make checktcp) doesn't run through anymore because one test connection cannot be made. It spits out the error message: FATAL: invalid command-line arguments for server process HINT: Try postgres --help for more information. This connection used to work without the patch and besides the error message is not really telling in this context. You can get more information on that with the attached patch, which might be a useful addition in general. The problem comes from this: exec sql connect to unix:postgresql://localhost/connectdb?connect_timeout=14 user connectuser; and the updated error message is: ECPGconnect: could not open database: FATAL: invalid command-line arguments for server process: connect_timeout=14 Because connect_timeout is a separate libpq connection parameter, but now it's stuck into options. It might have worked more or less by accident before. It's not clear to me why this only appears on checktcp. And why we don't run those tests by default. That should be clarified, because there are also other failures in that test that show that it hasn't been run in a while. diff --git i/src/backend/tcop/postgres.c w/src/backend/tcop/postgres.c index 49a3969..d5bd73f 100644 --- i/src/backend/tcop/postgres.c +++ w/src/backend/tcop/postgres.c @@ -3375,7 +3375,7 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx) if (IsUnderPostmaster) ereport(FATAL, (errcode(ERRCODE_SYNTAX_ERROR), - errmsg(invalid command-line arguments for server process), + errmsg(invalid command-line arguments for server process: %s, argv[optind]), errhint(Try \%s --help\ for more information., progname))); else ereport(FATAL, -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Can PQstatus() be used by Application to check connection to postgres periodically?
2012/2/8 Pavel Golub pa...@microolap.com: Hello, sujayr06. You wrote: s Hello All, s My application has to do a real time data upload to PostgreSQL s server. s Every time i have to do a real time upload, i do not wish to open s new connection. s I want to open a connection once [when my application comes up] s and periodically check if the connection is active. s Can PQstatus() be used by application to check the status of the s connection already established? s If PQstatus() cannot be used, does PostgreSQL provide alternate s interface to check the status of the connection. s Note : I do not wish to open connection on every real time upload s as its an overhead for the application. s Appreciate any help! You may use PQtransactionStatus for this purpose (http://www.postgresql.org/docs/9.1/static/libpq-status.html) Only if you don't care about getting the right answer. Neither PQstatus() nor PQtransactionStatus() make any attempt to determine whether the server is still there; they just return the cached status of the connection, without any system calls whatsoever. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Text-any concatenation volatility acting as optimization barrier
On Wed, Feb 8, 2012 at 4:53 AM, Marti Raudsepp ma...@juffo.org wrote: Patch attached (in git am format). Passes all regression tests (except 'json' which fails on my machine even on git master). Can you provide the diffs from the json test on your machine? I don't see any build-farm failures off-hand... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
It doesn't necessarily matter if we increase the size of the postgres binary by 10%, precisely because most of that is not going to be in play from one instant to the next. I'm thinking, in particular, of btree index specialisations, where it could make perfect sense to go crazy. You cannot have a reasonable discussion about such costs without considering that they will perhaps never be paid, given any plausible workload. That's why the percentage that the postgres binary size has been shown to increase by really isn't pertinent at all. At best, it's a weak proxy for such costs, assuming you don't have a demonscene-like preoccupation with reducing binary size, and I don't believe that we should. It would be difficult for me to measure such things objectively, but I'd speculate that the proprietary databases have much larger binaries than ours, while having far fewer features, precisely because they started applying tricks like this a long time ago. You could counter that their code bases probably look terrible, and you'd have a point, but so do I. On 8 February 2012 02:38, Robert Haas robertmh...@gmail.com wrote: I've been skeptical of this patch for a number of reasons, the major one of which is that most workloads spend only a very small amount of time in doing qucksorts, and most quicksorts are of very small amounts of data and therefore fast anyway. It is easy to construct an artificial test case that does lots and lots of in-memory sorting, but in real life I think that's not the great part of what people use databases for. Fair enough, but if that's true, then it's also true that the cost due to cache marginalisation - the only cost that I think is worth considering at all - is correspondingly a small fraction of that very small amount of sorting. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Wed, Feb 08, 2012 at 11:40:34AM +, Simon Riggs wrote: On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch n...@leadboat.com wrote: On Tue, Feb 07, 2012 at 08:58:59PM +, Simon Riggs wrote: On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch n...@leadboat.com wrote: This patch uses FPIs to guard against torn hint writes, even when the checksums are disabled. ?One could not simply condition them on the page_checksums setting, though. ?Suppose page_checksums=off and we're hinting a page previously written with page_checksums=on. ?If that write tears, leaving the checksum intact, that block will now fail verification. ?A couple of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control. Whenever the cluster starts with checksums disabled, set the field to InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum failure occurs in a page with LSN older than the stored one, emit either a softer warning or no message at all. We can only change page_checksums at restart (now) so the above seems moot. If we crash with FPWs enabled we repair any torn pages. There's no live bug, but that comes at a high cost: the patch has us emit full-page images for hint bit writes regardless of the page_checksums setting. Sorry, I don't understand what you mean. I don't see any failure cases that require that. page_checksums can only change at a shutdown checkpoint, The statement If that write tears, leaving the checksum intact, that block will now fail verification. cannot happen, ISTM. If we write out a block we update the checksum if page_checksums is set, or we clear it. If we experience a torn page at crash, the FPI corrects that, so the checksum never does fail verification. We only need to write a FPI when we write with checksums. If that's wrong, please explain a failure case in detail. In the following description, I will treat a heap page as having two facts. The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT. A torn page write lacking the protection of a full-page image can update one or both facts despite the transaction having logically updated both. (This is just the normal torn-page scenario.) Given that, consider the following sequence of events: 1. startup with page_checksums = on 2. update page P with facts CHKSUM, !HINT 3. clean write of P 4. clean shutdown 5. startup with page_checksums = off 6. update P with facts !CHKSUM, HINT. no WAL since page_checksums=off 7. begin write of P 8. crash. torn write of P only updates HINT. disk state now CHKSUM, HINT 9. startup with page_checksums = off 10. crash recovery. does not touch P, because step 6 generated no WAL 11. clean shutdown 12. startup with page_checksums = on 13. read P. checksum failure Again, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch, because step 6 _does_ write WAL. That ought to change for the sake of page_checksums=off performance, at which point we must protect the above scenario by other means. PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding is insufficient. Am serialising this by only writing PageLSN while holding buf hdr lock. That means also taking the buffer header spinlock in every PageGetLSN() caller holding only a shared buffer content lock. ?Do you think that will pay off, versus the settled pattern of trading here your shared buffer content lock for an exclusive one? Yes, I think it will pay off. This is the only code location where we do that, and we are already taking the buffer header lock, so there is effectively no overhead. The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer() (via BufferGetLSN()) and XLogCheckBuffer(). We don't take the buffer header spinlock at either of those locations. If they remain safe under the new rules, we'll need comments explaining why. I think they may indeed be safe, but it's far from obvious. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add protransform for numeric, varbit, and temporal types
On Tue, Feb 7, 2012 at 12:43 PM, Robert Haas robertmh...@gmail.com wrote: I've committed the numeric and varbit patches and will look at the temporal one next. Committed, after changing the OIDs so they don't conflict. Here's one more case for you to ponder: rhaas=# create table noah (i interval day); CREATE TABLE rhaas=# alter table noah alter column i set data type interval second(3); DEBUG: rewriting table noah ALTER TABLE Do we really need a rewrite in that case? The code acts like the interval range and precision are separate beasts, but is that really true? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum rate limit in KBps
Excerpts from Bruce Momjian's message of mié feb 08 00:58:58 -0300 2012: As much as I hate to poo-poo a patch addition, I have to agree with Robert Haas on this one. Renaming settings really isn't moving us forward. It introduces a migration problem and really doesn't move us forward in solving the underlying problem. Additional monitoring, while helpful, also is only a stop-gap. I think that (part of) the underlying problem is that we have no clear way to specify how much I/O do you want autovacuum to use. That's what this patch is all about, AFAIU; it has nothing to do with monitoring. Right now, as has been said, the only way to tweak this is to change vacuum_cost_delay; the problem with that setting is that making the calculation is not straightforward. (Now, I disagree that it's so complex that it cannot ever be explain to a class; or that it's so obscure that the only way to make it work is to leave it alone and never touch it. It's complex, okay, but it's not exactly rocket science either.) If the only real downside to this patch is that some people have already changed vacuum_cost_delay and they will want to migrate those settings forward, maybe we shouldn't be looking at _replacing_ that one with a new setting, but rather just add the new setting; and in the code for each, make sure that only one of them is set, and throw an error if the other one is. On Thu, Jan 19, 2012 at 09:42:52PM -0500, Robert Haas wrote: Another problem is that the vacuum algorithm itself could, I think, be made much smarter. We could teach HOT to prune pages that contain no HOT chains but do contain dead tuples. That would leave dead line pointers behind, but that's not nearly as bad as leaving the entire tuple behind. We could, as Simon and others have suggested, have one threshold for vacuuming the heap (i.e. reclaiming dead tuples) and another for vacuuming the indexes (i.e. reclaiming dead line pointers). That would open the door to partial vacuuming: just vacuum half a gigabyte or so of the heap, and then move on; the next vacuum can pick up where that one left off, at least up to the point where we decide we need to make an index pass; it would possibly also allow us to permit more than one vacuum on the same table at the same time, which is probably needed for very large tables. We could have backends that see dead tuples on a page throw them over to the fence to the background writer for immediate pruning. I blather, but I guess my point is that I really hope we're going to do something deeper here at some point in the near future, whatever becomes of the proposals now on the table. This is all fine, but what does it have to do with the current patch? I mean, if we change vacuum to do some stuff differently, it's still going to have to read and dirty pages and thus account for I/O. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
Peter Geoghegan pe...@2ndquadrant.com writes: It doesn't necessarily matter if we increase the size of the postgres binary by 10%, precisely because most of that is not going to be in play from one instant to the next. You've heard of swapping, no? Basically what I'm hearing from you is total denial that binary bloat costs anything, and that just does not square with reality. Even if the costs in performance terms are negligible in many common situations (something that you've asserted but without offering any proof), there are other costs; software distribution for instance is definitely sensitive to size. As a Red Hat person I've had to spend time fighting to keep Postgres included in the minimum does it fit on a DVD distribution of RHEL/Fedora. That case gets harder to make every year, and yet it's pretty critical to the success of the project --- if you don't get distributed, you lose users. IMO this patch is already well past the point of diminishing returns in value-per-byte-added. I'd like to see it trimmed back to provide a fast path for just single-column int4/int8/float4/float8 sorts. The other cases aren't going to offer enough of a win to justify the code space. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum rate limit in KBps
On Wed, Feb 8, 2012 at 9:38 AM, Alvaro Herrera alvhe...@commandprompt.com wrote: I think that (part of) the underlying problem is that we have no clear way to specify how much I/O do you want autovacuum to use. That's what this patch is all about, AFAIU; it has nothing to do with monitoring. Right now, as has been said, the only way to tweak this is to change vacuum_cost_delay; the problem with that setting is that making the calculation is not straightforward. I agree. (Now, I disagree that it's so complex that it cannot ever be explain to a class; or that it's so obscure that the only way to make it work is to leave it alone and never touch it. It's complex, okay, but it's not exactly rocket science either.) I emphatically agree. If the only real downside to this patch is that some people have already changed vacuum_cost_delay and they will want to migrate those settings forward, maybe we shouldn't be looking at _replacing_ that one with a new setting, but rather just add the new setting; and in the code for each, make sure that only one of them is set, and throw an error if the other one is. I think that part of the confusion here is that the current settings aren't strictly trying to regulate vacuum's I/O rate so much as the total amount of work it can do per unit time. Even if autovac dirties no pages and has no misses, the pages it hits will eventually fill up the cost limit and it will sleep; but the work in that case is all CPU utilization, not I/O. Similarly, we don't have separate control knobs for read I/O (misses) and write I/O (dirty pages). The sleeps happen due to a blend of factors (CPU, read I/O, write I/O) which are combined to produce an estimate of the total impact of vacuum on the rest of the system. Now, an argument could be made that we ought not to care about that. We could say: trying to use this blended algorithm to figure out when vacuum is getting too expensive is a loser. We just want to limit the amount of dirty data that's being generated, because that's the only thing that matters to us. In that case, I could see adding a new setting, sitting alongside the existing settings: amount of dirty data that can be generated per second. But that's not what Greg is proposing. He is proposing, essentially, that we keep the blended algorithm, but then stack another calculation on top that backs into the blended cost limit based on the user's tolerance for dirtying data. So there will still be limits on the amount of read I/O and CPU usage, but they'll be derived from the allowable rate of data dirtying and the ratios between the different cost parameters. The current settings aren't exactly intuitive, but I really don't want to have to explain to a customer that setting the dirty data limit to 8MB/s will actually limit it to just 5.3MB/s if all the pages are not in shared buffers (because the page miss costs will account for a third of the budget) and to 7.8MB/s if all the pages are in shared buffers (because the page miss costs will account for a twenty-first of the budget) and somewhere in between if only some of the pages are resident; and that, further, by setting the dirty data rate to 8MB/s, they've implicitly set the max read rate from disk at 16MB/s, again because of the 2:1 dirty:miss cost ratio. Yikes! Honestly, I think the best place for this work is in something like pgtune. It's totally useful to have a calculator for this stuff (for many of the same reasons that it's useful to have explain.depesz.com) but the abstraction being proposed is leaky enough that I think it's bound to cause confusion. This is all fine, but what does it have to do with the current patch? I mean, if we change vacuum to do some stuff differently, it's still going to have to read and dirty pages and thus account for I/O. Yeah, I drifted off topic there a bit. I think the only relevant point in all that is that even if we all agreed that this is an improvement, I'd be reluctant to slap a band-aid on something that I think needs surgery. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On Wed, Feb 8, 2012 at 8:33 AM, Peter Geoghegan pe...@2ndquadrant.com wrote: It doesn't necessarily matter if we increase the size of the postgres binary by 10%, precisely because most of that is not going to be in play from one instant to the next. As Tom says, that doesn't jive with my experience. If you add on enough binary bloat, you will have more page faults. It's true (as I think you pointed out upthread) that packing all the copies of quicksort into the binary one after the other minimizes the effect on other things, but it doesn't eliminate it entirely. If you've got this: random other stuff ... a zillion copies of quicksort more other stuff ...then a branch from the random other stuff section of the binary to the more other stuff section of the binary may cost more. For example, suppose the OS does X amount of readahead. By stuffing all those copies of quicksort into the middle there, you increase the chances that the page you need was beyond the readahead window. Or, if it wasn't going to be in the readahead window either way, then you increase the distance that the disk head needs to move to find the required block. These costs are very small, no question about it. They are almost impossible to measure individually, in much the same way that the cost of pollution or greenhouse gas emissions is difficult to measure. But it's an error to assume that because the costs are individually small that they will never add up to anything material. As long as you keep insisting on that, it's hard to have a rational conversation. We can reasonably debate the magnitude of the costs, but to assert that they don't exist gets us nowhere. Suppose we could get a 10% speedup on sin(numeric) by adding 40GB to the binary size. Would you be in favor of that? Do you think it would hurt performance on any other workloads? Would our packagers complain at all? Surely your same argument would apply to that case in spades: anyone who is not using the gigantic hard-coded lookup table will not pay any portion of the cost of it. It would be difficult for me to measure such things objectively, but I'd speculate that the proprietary databases have much larger binaries than ours, while having far fewer features, precisely because they started applying tricks like this a long time ago. You could counter that their code bases probably look terrible, and you'd have a point, but so do I. That might be true; I have no idea. There are probably lots of reasons why their code bases look terrible, including a long history of backward compatibility with now-defunct versions, a variety of commercial pressures, and the fact that they don't have to take flak in the public square for what their code looks like. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane t...@sss.pgh.pa.us wrote: IMO this patch is already well past the point of diminishing returns in value-per-byte-added. I'd like to see it trimmed back to provide a fast path for just single-column int4/int8/float4/float8 sorts. The other cases aren't going to offer enough of a win to justify the code space. I'm curious about how much we're gaining from the single-column specializations vs. the type-specific specializations. I think I'm going to go try to characterize that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ecpglib use PQconnectdbParams
Peter Eisentraut pete...@gmx.net writes: if (IsUnderPostmaster) ereport(FATAL, (errcode(ERRCODE_SYNTAX_ERROR), - errmsg(invalid command-line arguments for server process), + errmsg(invalid command-line arguments for server process: %s, argv[optind]), errhint(Try \%s --help\ for more information., progname))); else ereport(FATAL, +1 for that change, but please s/arguments/argument/ in the text. Also, what about the other branch of the if that's just out of sight here? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On Wed, Feb 08, 2012 at 01:33:30PM +, Peter Geoghegan wrote: It doesn't necessarily matter if we increase the size of the postgres binary by 10%, precisely because most of that is not going to be in play from one instant to the next. I'm thinking, in particular, of btree index specialisations, where it could make perfect sense to go crazy. You cannot have a reasonable discussion about such costs without considering that they will perhaps never be paid, given any plausible workload. That's why the percentage that the postgres binary size has been shown to increase by really isn't pertinent at all. At best, it's a weak proxy for such costs, assuming you don't have a demonscene-like preoccupation with reducing binary size, and I don't believe that we should. When you start a binary, your executable is mapped to a file system binary, and you page-fault in the pages you need. Now, if your optimization was alone in its own 4k (x86) virtual pages, and you never called the functions, odds are you would not pay a penalty, aside from distribution penalty, and perhaps a small penalty if useful code was before and after your block. The sort code expansion, however, is done in-line, in the middle of the sort code, you are clearly are filling in 64-byte (x86) CPU cache lines with type-specific expansion code for every sort case, whether we use the code or not. Now, I don't think it is a big problem, and I think the speedup is worth it for common data types, but we can't say the cost is zero. Saying it another way, having a binary in your file system that you never call is not overhead except for storage, but in this case, the sort code expansion is inside critical functions we are already calling. Frankly, it is the cost that has kept use away from using such optimizations for a long time. I recently posted that the zero-cost optimizations are mostly completed for sort and COPY, and we have to start considering non-zero-cost optimizations --- sad, but true. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On Wed, Feb 08, 2012 at 10:17:36AM -0500, Robert Haas wrote: On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane t...@sss.pgh.pa.us wrote: IMO this patch is already well past the point of diminishing returns in value-per-byte-added. I'd like to see it trimmed back to provide a fast path for just single-column int4/int8/float4/float8 sorts. The other cases aren't going to offer enough of a win to justify the code space. I'm curious about how much we're gaining from the single-column specializations vs. the type-specific specializations. I think I'm going to go try to characterize that. Yes, please. That would be a big help. Is there no optimization for strings? I assume they are sorted a lot. We can alway add more data types later. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgstat documentation tables
On Tue, Feb 07, 2012 at 11:29:12PM -0500, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: How do people feel about pulling text out of the SGML docs and loading it into the database as table and column comments? I'm not thrilled by that proposal. The length limits on comments are very much shorter than what is sensible to use in catalogs.sgml (at least if you want the comments to look passable in \d displays). And I don't want people trying to serve two different use-cases when they write those. I was thinking of just grabbing the first X characters from the SGML comments. Perhaps it'd work to pull column comments from the C header files --- any same-line comment in catalog/pg_*.h is probably short enough to serve this purpose usefully. Yes, that would be simpler. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Text-any concatenation volatility acting as optimization barrier
Robert Haas robertmh...@gmail.com writes: On Wed, Feb 8, 2012 at 4:53 AM, Marti Raudsepp ma...@juffo.org wrote: Patch attached (in git am format). Passes all regression tests (except 'json' which fails on my machine even on git master). Can you provide the diffs from the json test on your machine? I don't see any build-farm failures off-hand... I'm seeing diffs too after applying Marti's patch: instead of z, b, etc, the field labels in the json values look like f1, f2, etc in the output of queries such as SELECT row_to_json(q) FROM (SELECT $$a$$ || x AS b, y AS c, ARRAY[ROW(x.*,ARRAY[1,2,3]), ROW(y.*,ARRAY[4,5,6])] AS z FROM generate_series(1,2) x, generate_series(4,5) y) q; I believe what is happening is that now that the planner can flatten the sub-select, the RowExprs are getting expanded differently, and that ties into the when do we lose column names business that Andrew has been on about. However, I was not seeing that before applying the patch, so maybe Marti has another issue too. I am going to go ahead and commit the patch with the altered json results, because IMO it is mere accident that these regression cases were coming out with nice field labels anyway. When and if Andrew gets the RowExpr cases fixed properly, that will show up as these cases going back to nicer-looking output. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum rate limit in KBps
On Wed, Feb 08, 2012 at 09:56:17AM -0500, Robert Haas wrote: This is all fine, but what does it have to do with the current patch? I mean, if we change vacuum to do some stuff differently, it's still going to have to read and dirty pages and thus account for I/O. Yeah, I drifted off topic there a bit. I think the only relevant point in all that is that even if we all agreed that this is an improvement, I'd be reluctant to slap a band-aid on something that I think needs surgery. Agreed. I see this only as short-term fix that will be changed in the next release if we really decide to tackle the problem. What percentage of our userbase are going to ever adjust these settings --- 1%, for sure. What we have now just isn't cutting it for 99% of our users, and we need to address that if we are going to ever make any real headway here. Why can't vacuum handle things automatically like checkpoint smoothing? Why can't it detect when it is falling behind and speed up? Why can't it see as busy background writer and slow down? Unless we answer these questions, we are not solving the problem for 99% of our users. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
Bruce Momjian br...@momjian.us writes: Yes, please. That would be a big help. Is there no optimization for strings? I assume they are sorted a lot. It seems unlikely that it'd be worth including strings, especially if your locale is not C. This whole thing only makes sense for datatypes that are comparable in approximately 1 machine instruction. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On Wed, Feb 08, 2012 at 11:35:46AM -0500, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: Yes, please. That would be a big help. Is there no optimization for strings? I assume they are sorted a lot. It seems unlikely that it'd be worth including strings, especially if your locale is not C. This whole thing only makes sense for datatypes that are comparable in approximately 1 machine instruction. Ah, OK, interesting. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Text-any concatenation volatility acting as optimization barrier
On Wed, Feb 8, 2012 at 18:20, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Wed, Feb 8, 2012 at 4:53 AM, Marti Raudsepp ma...@juffo.org wrote: Patch attached (in git am format). Passes all regression tests (except 'json' which fails on my machine even on git master). Can you provide the diffs from the json test on your machine? I don't see any build-farm failures off-hand... I'm seeing diffs too after applying Marti's patch: instead of z, b, etc, the field labels in the json values look like f1, f2, etc in the output of queries such as Sorry, my bad. I guess I got the two versions (patched and unpatched) mixed up. Indeed this difference disappears when I remove my patch. I'll have to be more careful when checking regression diffs in the future. :) Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On 8 February 2012 15:17, Robert Haas robertmh...@gmail.com wrote: On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane t...@sss.pgh.pa.us wrote: IMO this patch is already well past the point of diminishing returns in value-per-byte-added. I'd like to see it trimmed back to provide a fast path for just single-column int4/int8/float4/float8 sorts. The other cases aren't going to offer enough of a win to justify the code space. I'm curious about how much we're gaining from the single-column specializations vs. the type-specific specializations. I think I'm going to go try to characterize that. I think it might make more sense to lose the type-specific specialisations for the multi-key case while adding a generic multi-key specialisation, than to lose all multi-key specialisations, though I have not considered that question at length, and would think that we'd still want to keep an int4 version in that case. Note that I *did not* include a generic multi-key specialisation, though only because I saw little point, having already covered by far the most common cases. While you're at it, I'd like to suggest that you perform a benchmark on a multi-key specialisation, so we can see just what we're throwing away before we do so. Better to have those numbers come from you. I continue to maintain that the most appropriate course of action is to provisionally commit all specialisations. If it's hard to know what effect this is going to have on real workloads, let's defer to beta testers, who presumably try the new release out with their application. It's a question you could squarely put to them, without gradually rolling back from that initial position being much of a problem. The mysql-server package is 45 MB on Fedora 16. That 1% of Postgres binary figure is for my earlier patch with btree specialisations, right? I'm not asking you to look at that right now. I also don't think that where do we eventually draw the line with specialisations like this in Postgres generally? is a question that you should expect me to answer, though I will say that we should look at each case on its merits. I have not totally denied binary bloat costs. I have attempted to quantify them, while acknowledging that such a task is difficult, as was evident from the fact that Robert wasn't suprised that I could not demonstrate any regression. Granted, my definition of a regression is that there is very clearly no net loss in performance at some reasonable granularity, which is a very practical definition. You can quite easily contrive a case that HOT handles really badly. Some people did, I believe, but HOT won out because it was clearly very useful in the real world. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Text-any concatenation volatility acting as optimization barrier
On 02/08/2012 11:20 AM, Tom Lane wrote: I am going to go ahead and commit the patch with the altered json results, because IMO it is mere accident that these regression cases were coming out with nice field labels anyway. When and if Andrew gets the RowExpr cases fixed properly, that will show up as these cases going back to nicer-looking output. I take it this is your way of making me hurry up with that work :-) cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Enable min/max optimization for bool_and/bool_or/every
Marti Raudsepp ma...@juffo.org writes: On Thu, Dec 22, 2011 at 18:41, Robert Haas robertmh...@gmail.com wrote: On Mon, Dec 19, 2011 at 5:16 AM, Marti Raudsepp ma...@juffo.org wrote: PS: It seems that the min/max optimization isn't documented in the manual (apart from release notes), so I didn't include any doc changes in this patch. I don't see a patch attached to this email, so either you forgot to attach it, or the list ate it somehow. I forgot to attach it, sorry. Here it is. I applied this patch, since I was busy applying catalog changes from you anyway ;-). I did think of a possible reason to reject the patch: with this change, the planner will take longer to plan queries involving bool_and() et al, since planagg.c will spend time looking (usually fruitlessly) for an index-based plan. I tried this simple test case: create table t (f1 bool); \timing explain select bool_and(f1) from t; Best-case timings for the EXPLAIN were about 0.480 ms without the patch and 0.500 ms with, so about a 4% penalty. On more complicated queries I think the fractional cost would be less. This seemed acceptable to me, so I went ahead and applied the change, but if anyone wants to argue about it now's the time. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Text-any concatenation volatility acting as optimization barrier
Andrew Dunstan and...@dunslane.net writes: On 02/08/2012 11:20 AM, Tom Lane wrote: I am going to go ahead and commit the patch with the altered json results, because IMO it is mere accident that these regression cases were coming out with nice field labels anyway. When and if Andrew gets the RowExpr cases fixed properly, that will show up as these cases going back to nicer-looking output. I take it this is your way of making me hurry up with that work :-) Well, right now the behavior is more consistent than it was before; we would surely have gotten questions about it as it was. Whether you find time to improve it or not isn't my concern at the moment. BTW, the order of the items in the json output doesn't match the column order of the sub-select, with or without the patch. This seems ... odd. Is it intentional? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On Wed, Feb 8, 2012 at 10:59 AM, Bruce Momjian br...@momjian.us wrote: On Wed, Feb 08, 2012 at 10:17:36AM -0500, Robert Haas wrote: On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane t...@sss.pgh.pa.us wrote: IMO this patch is already well past the point of diminishing returns in value-per-byte-added. I'd like to see it trimmed back to provide a fast path for just single-column int4/int8/float4/float8 sorts. The other cases aren't going to offer enough of a win to justify the code space. I'm curious about how much we're gaining from the single-column specializations vs. the type-specific specializations. I think I'm going to go try to characterize that. Yes, please. That would be a big help. Is there no optimization for strings? I assume they are sorted a lot. They are, but the value of the optimization is expected to diminish for more complex comparators. I did a quick test of this using the same setup I mentioned upthread: create table nodups (g) as select (g%10)*1000+g/10 from generate_series(1,1) g; This was the actual test query: select * from nodups order by g offset 10001; I did a ten-second warmup on after starting each postmaster, followed by three one-minute test runs: master: tps = 298.824321 (including connections establishing) tps = 301.741619 (including connections establishing) tps = 302.238016 (including connections establishing) Peter's patch, but with the type-specific optimizations disabled, so using just the single-sortkey optimizations: tps = 357.553690 (including connections establishing) tps = 358.765875 (including connections establishing) tps = 358.825051 (including connections establishing) Peter's patch, intact: tps = 394.566994 (including connections establishing) tps = 394.228667 (including connections establishing) tps = 394.492677 (including connections establishing) So that's about a 19% speedup for just the single sort-key optimizations, and about a 30% speedup in total. Compared to a baseline that includes the single sort-key optimizations, the additional type-specific optimization is buying about 10% on this test. I tried changing the test query to return the data, like this: select * from nodups order by g; master: tps = 181.301129 (including connections establishing) tps = 180.348333 (excluding connections establishing) tps = 179.600825 (including connections establishing) Peter's patch, but with the type-specific optimizations disabled, so using just the single-sortkey optimizations: tps = 201.728269 (including connections establishing) tps = 198.022527 (including connections establishing) tps = 199.663082 (including connections establishing) Peter's patch, intact: tps = 214.037387 (including connections establishing) tps = 212.895145 (including connections establishing) tps = 211.596558 (including connections establishing) So, with the overhead of actually returning the sorted data, we get 10% from the generic single-sortkey optimizations and 18% from the two combined. Compared to a baseline that includes the single-sortkey optimizations, the incremental improvement from adding the type-specific optimizations is about 6.6%. It would be less on a wider table, presumably. I also tested a two-column sort: create table twocol (a, b) as select g%101, g%17 from generate_series(1,1) g; select * from twocol order by a, b offset 1; master: tps = 270.218604 (including connections establishing) tps = 265.634964 (including connections establishing) tps = 268.520805 (including connections establishing) Peter's patch, intact: tps = 285.528626 (including connections establishing) tps = 285.140345 (including connections establishing) tps = 282.388457 (including connections establishing) So here the type-specific optimizations are buying us even less: only about 6%, and that's with no client overhead. And I tested float8: create table f8 (g) as select random() from generate_series(1,1); select * from f8 order by g offset 1; master: tps = 228.373239 (including connections establishing) tps = 225.117704 (including connections establishing) tps = 225.196197 (including connections establishing) Peter's patch, but with the type-specific optimizations disabled, so using just the single-sortkey optimizations: tps = 263.060820 (including connections establishing) tps = 262.788926 (including connections establishing) tps = 263.041186 (including connections establishing) Peter's patch, intact: tps = 283.274797 (including connections establishing) tps = 280.597965 (including connections establishing) tps = 280.151349 (including connections establishing) That's +17% for the single-sortkey stuff, +25% for everything, and the incremental improvement of the type-specific optimizations over the generic single-key optimizations is about 6.7%. It seems clear that the single sort-key optimizations are a much better value per byte of code than the type-specific optimizations. Ignoring client overhead, we get between half and two-thirds of
Re: [HACKERS] Vacuum rate limit in KBps
On Wed, Feb 8, 2012 at 11:22 AM, Bruce Momjian br...@momjian.us wrote: What we have now just isn't cutting it for 99% of our users, and we need to address that if we are going to ever make any real headway here. Why can't vacuum handle things automatically like checkpoint smoothing? Why can't it detect when it is falling behind and speed up? Why can't it see as busy background writer and slow down? Unless we answer these questions, we are not solving the problem for 99% of our users. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
Robert Haas robertmh...@gmail.com writes: [ lots of numbers ] ... I just can't get excited about that. However, I find the single-key optimizations much more compelling, for the reasons stated above, and feel we ought to include those. This conclusion seems sound to me, for the reasons you stated and one more: optimizations for a single sort key are going to be applicable to a very wide variety of queries, whereas all the other cases are necessarily less widely applicable. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Tue, Feb 7, 2012 at 10:21 PM, Joachim Wieland j...@mcknight.de wrote: On Tue, Feb 7, 2012 at 4:59 PM, Robert Haas robertmh...@gmail.com wrote: It turns out that (as you anticipated) there are some problems with my previous proposal. I assume you're talking to Tom, as my powers of anticipation are actually quite limited... :-) No, I was talking to you, actually. You earlier suggested that an ArchiveHandle was only meant to contain a single PGconn, and it seems you're right. We can refactor that assumption away, but not trivially. In my patch I dealt with exactly the same problem for the error handler by creating a separate function that has a static variable (a pointer to the ParallelState). The value is set and retrieved through the same function, so yes, it's kinda global but then again it can only be accessed from this function, which is only called from the error handler. How did you make this thread-safe? I'm starting to think it might make sense to press on with this refactoring just a bit further and eliminate the distinction between Archive and ArchiveHandle. How about doing more refactoring after applying the patch, you'd then see what is really needed and then we'd also have an actual use case for more than one connection (You might have already guessed that this proposal is heavily influenced by my self-interest of avoiding too much work to make my patch match your refactoring)... Well, I don't think your patch is going to be too heavily affected either way, because most of your changes were not in pg_dump.c, and the need for any changes at all in pg_dump.c should rapidly be going away. At any rate, the problem with stopping here is that g_conn is still floating around, and one way or the other we've got to get rid of it if you want to have more than one ArchiveHandle floating around at a time. So we at least need to press on far enough to get to that point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Wed, Feb 1, 2012 at 11:46 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 31.01.2012 17:35, Fujii Masao wrote: On Fri, Jan 20, 2012 at 11:11 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 20.01.2012 15:32, Robert Haas wrote: On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Here's another version of the patch to make XLogInsert less of a bottleneck on multi-CPU systems. The basic idea is the same as before, but several bugs have been fixed, and lots of misc. clean up has been done. This seems to need a rebase. Here you go. The patch seems to need a rebase again. Here you go again. It conflicted with the group commit patch, and the patch to WAL-log and track changes to full_page_writes setting. After applying this patch and then forcing crashes, upon recovery the database is not correct. If I make a table with 10,000 rows and then after that intensively update it using a unique key: update foo set count=count+1 where foobar=? Then after the crash there are less than 10,000 visible rows: select count(*) from foo This not a subtle thing, it happens every time. I get counts of between 1973 and 8827. Without this patch I always get exactly 10,000. I don't really know where to start on tracking this down. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
pg_receivexlog and sync rep Re: [HACKERS] Updated version of pg_receivexlog
On Tue, Oct 25, 2011 at 7:37 PM, Magnus Hagander mag...@hagander.net wrote: On Mon, Oct 24, 2011 at 14:40, Magnus Hagander mag...@hagander.net wrote: On Mon, Oct 24, 2011 at 13:46, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: How does this interact with synchronous replication? If a base backup that streams WAL is in progress, and you have synchronous_standby_names set to '*', I believe the in-progress backup will count as a standby for that purpose. That might give a false sense of security. Ah yes. Did not think of that. Yes, it will have this problem. Actually, thinking more, per other mail, it won't. Because it will never report that the data is synced to disk, so it will not be considered for sync standby. Now, new replication mode (synchronous_commit = write) is supported. In this mode, the in-progress backup will be considered as sync standby because its periodic status report includes the valid write position. We should change the report so that it includes only invalid positions. Patch attached. While I agree that the backup should not behave as sync standby, ISTM that pg_receivexlog should, which is very useful. If pg_receivexlog does so, we can write WAL synchronously in both local and remote, which would increase the durability of the system. Thus, to allow pg_receivexlog to behave as sync standby, we should change it so that its status report includes the write and flush positions? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center invalid_write_pos_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On 8 February 2012 17:58, Robert Haas robertmh...@gmail.com wrote: It seems clear that the single sort-key optimizations are a much better value per byte of code than the type-specific optimizations. Ignoring client overhead, we get between half and two-thirds of the benefit, and it costs us just one extra copy of the quicksort logic for all data types. That was clear from an early stage, and is something that I acknowledged way back in September - it's pretty obvious that chasing diminishing returns is the nature of this sort of thing, which is fine, provided that they are not chased beyond some point that doesn't make sense. However, I do not see why we should not at least include full integer specialisations as well (for single-keys at least), given that we get the additional, not inconsiderable improvements. I do not accept the premise that we need to find the optimal bytes added to performance increase ratio, because, as I've already stressed, adding bytes to the application binary does not have a linear cost. Here's what might be a good compromise: singlekey generic, singlekey int4, singlekey int8, multikey generic. That is a total number of 4 contiguous copies of the qsort, because we've obsoleted the original qsort_arg (accept for btree index sorting, which likely matters far less). We get the most benefit for 5 very common types - int4, int8, date, timestamp and timestamptz. A 30% improvement has to be better than the 19% speedup for just the single sort-key optimisations, considering that in practice these types are very commonly used. I think that there may be additional benefits from making the qsort_arg specialisation look less like a c stdlib one, like refining the swap logic to have compile-time knowledge of the type it is sorting. I'm thinking that we could usefully trim quite a bit from this: #define SWAPINIT(a, es) swaptype = ((char *)(a) - (char *)0) % sizeof(long) || \ (es) % sizeof(long) ? 2 : (es) == sizeof(long)? 0 : 1; #define thistype_vecswap(a, b, n) \ if ((n) 0) inl_swapfunc((a), (b), (size_t)(n), swaptype) #define thistype_swap(a, b) \ if (swaptype == 0) { \ long t = *(long *)(void *)(a); \ *(long *)(void *)(a) = *(long *)(void *)(b);\ *(long *)(void *)(b) = t; \ } else \ inl_swapfunc(a, b, es, swaptype) inline static void inl_swapfunc(char *a, char *b, size_t n, int swaptype) { if (swaptype = 1) swapcode(long, a, b, n); else swapcode(char, a, b, n); } -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
Jay Levitt jay.lev...@gmail.com writes: [Posted at Andres's request] TL;DR: Inserting and indexing cubes is slow and/or broken in various ways in various builds. Not sure yet about most of these, but I know the reason for this one: 2. In both 9.1 and 9.2, there is a long delay before CREATE INDEX realizes it can't work on an unlogged table That error is thrown in gistbuildempty, which is not called until after we have finished building the main-fork index. This is a tad unfriendly when the table already contains lots of data. ISTM there are two ways we could fix this: 1. Introduce a duplicative test at the start of gistbuild(). 2. Rearrange the order of operations in index_build() so that the init fork is made first. Both of these are kinda ugly, but #2 puts the ugliness into someplace that shouldn't have to know about it, and furthermore someplace that's unlikely to get reverted if/when gist is fixed to not have this problem. So I think I favor #1. Other opinions anyone? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
On Wed, Feb 8, 2012 at 2:15 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jay Levitt jay.lev...@gmail.com writes: [Posted at Andres's request] TL;DR: Inserting and indexing cubes is slow and/or broken in various ways in various builds. Not sure yet about most of these, but I know the reason for this one: 2. In both 9.1 and 9.2, there is a long delay before CREATE INDEX realizes it can't work on an unlogged table That error is thrown in gistbuildempty, which is not called until after we have finished building the main-fork index. This is a tad unfriendly when the table already contains lots of data. ISTM there are two ways we could fix this: 1. Introduce a duplicative test at the start of gistbuild(). 2. Rearrange the order of operations in index_build() so that the init fork is made first. Both of these are kinda ugly, but #2 puts the ugliness into someplace that shouldn't have to know about it, and furthermore someplace that's unlikely to get reverted if/when gist is fixed to not have this problem. So I think I favor #1. Other opinions anyone? I don't think I understand your object to #2. It appears to be a trivial rearrangement? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
Robert Haas robertmh...@gmail.com writes: On Wed, Feb 8, 2012 at 2:15 PM, Tom Lane t...@sss.pgh.pa.us wrote: ISTM there are two ways we could fix this: 1. Introduce a duplicative test at the start of gistbuild(). 2. Rearrange the order of operations in index_build() so that the init fork is made first. Both of these are kinda ugly, but #2 puts the ugliness into someplace that shouldn't have to know about it, and furthermore someplace that's unlikely to get reverted if/when gist is fixed to not have this problem. So I think I favor #1. Other opinions anyone? I don't think I understand your object to #2. It appears to be a trivial rearrangement? Yeah, but then we are wiring into index_build the idea that ambuildempty is more important, or more likely to throw an error, or something, than ambuild is. It seems weird. And fragile, since somebody could decide to re-order those two steps again for reasons unrelated to gist. Basically, I think this problem is gist's to deal with and so that's where the fix should be. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On 8 February 2012 18:48, Peter Geoghegan pe...@2ndquadrant.com wrote: I think that there may be additional benefits from making the qsort_arg specialisation look less like a c stdlib one, like refining the swap logic to have compile-time knowledge of the type it is sorting. I'm thinking that we could usefully trim quite a bit from this: It seems like while we cannot get any better performance, no doubt because the code is already using all manner of optimisations, including perhaps constant propagation, we can simplify this part of the code quite a bit. That seems fairly incidental though. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
On Wed, Feb 8, 2012 at 2:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Wed, Feb 8, 2012 at 2:15 PM, Tom Lane t...@sss.pgh.pa.us wrote: ISTM there are two ways we could fix this: 1. Introduce a duplicative test at the start of gistbuild(). 2. Rearrange the order of operations in index_build() so that the init fork is made first. Both of these are kinda ugly, but #2 puts the ugliness into someplace that shouldn't have to know about it, and furthermore someplace that's unlikely to get reverted if/when gist is fixed to not have this problem. So I think I favor #1. Other opinions anyone? I don't think I understand your object to #2. It appears to be a trivial rearrangement? Yeah, but then we are wiring into index_build the idea that ambuildempty is more important, or more likely to throw an error, or something, than ambuild is. It seems weird. And fragile, since somebody could decide to re-order those two steps again for reasons unrelated to gist. I guess. I think the compelling reason to do ambuildempty first is that it's fast. So might as well. I think you'e just going to end up hard-wiring the assumption that ambuild happens before ambuildempty, which doesn't seem any better than the other way around, but I don't care enough to argue about if you feel strongly about it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
On Tue, Feb 7, 2012 at 11:26 PM, Jay Levitt jay.lev...@gmail.com wrote: [*] psql:slowcube.sql:20: ERROR: node buffer of page being split (121550) does not exist This looks like a bug in buffering GiST index build I've implemented during my GSoC 2011 project. It looks especially strange with following setting: effective_cache_size = 3734MB because buffering GiST index build just shouldn't turn on in this case when index fits to cache. I'm goint to take a detailed look on this. -- With best regards, Alexander Korotkov.
Re: [HACKERS] [GENERAL] pg_dump -s dumps data?!
Andrew Dunstan and...@dunslane.net writes: OK, in this version we simply suppress creation of the TableDataInfo object if it's not wanted. I applied this with some further adjustments. I actually removed the code from makeTableDataInfo - there are only two places it gets called and doing it in those two spots seemed a bit cleaner. After study it seemed to me that this was the wrong direction to take. It was already the case that the config-table path was skipping the filters in getTableData(), none of which seem to be good for it to skip other than the one on the table's schema-dump flag. So I've refactored the code to put all those filters into makeTableDataInfo where they will be applied to both the normal and config-table cases. Now, back to the original subject of this thread: both HEAD and 9.1 are now operating as designed, in that they will dump the (user-provided portion of the) contents of an extension config table whenever that extension is dumped, even if --schema is specified. Do we want to change that? I'm not convinced that it's something to mess with lightly. I could possibly support a definition that says that we omit such data in --schema mode, except that I'm not sure what is sensible for rows that were modified (not inserted) by user actions. Also, we probably ought to get some input from the postgis people, because after all this entire feature was designed for their schema auxiliary tables. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
Robert Haas robertmh...@gmail.com writes: I guess. I think the compelling reason to do ambuildempty first is that it's fast. So might as well. I think you'e just going to end up hard-wiring the assumption that ambuild happens before ambuildempty, Well, no, because I'm proposing that both functions throw this error. which doesn't seem any better than the other way around, but I don't care enough to argue about if you feel strongly about it. What's ugly about this solution is the duplicative ereport calls. But at least the ugliness is confined to its source, ie gist. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Enable min/max optimization for bool_and/bool_or/every
On Wed, Feb 8, 2012 at 19:48, Tom Lane t...@sss.pgh.pa.us wrote: I applied this patch, since I was busy applying catalog changes from you anyway ;-). Thanks :) I did think of a possible reason to reject the patch: with this change, the planner will take longer to plan queries involving bool_and() et al, since planagg.c will spend time looking (usually fruitlessly) for an index-based plan. Good point, I should have done those measurements up front. Anyway, since I've often noticed \timing to be unreliable for short queries, I decided to retry your test with pgbench. Long story short, I measured 27% overhead in the un-indexed column case and 33% overhead for an indexed column. That's a lot more than I expected. I even rebuilt and retried a few times to make sure I hadn't botched something. The benchmark script is attached. UNPATCHED select bool_and(b) from unindexed; tps = 13787.023113 (excluding connections establishing) tps = 13880.484788 (excluding connections establishing) tps = 13784.654542 (excluding connections establishing) select bool_and(b) from indexed; tps = 12536.650703 (excluding connections establishing) tps = 12647.767993 (excluding connections establishing) tps = 12500.956407 (excluding connections establishing) PATCHED select bool_and(b) from unindexed; tps = 10096.834678 (excluding connections establishing) tps = 10110.182425 (excluding connections establishing) tps = 10103.904500 (excluding connections establishing) select bool_and(b) from indexed; tps = 8373.631407 (excluding connections establishing) tps = 8659.917173 (excluding connections establishing) tps = 8473.899896 (excluding connections establishing) Regards, Marti bench_bool_and.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch : Allow toast tables to be moved to a different tablespace
Hi We have some tables with documents and their metadata (filename, filetype, etc). Most of the time we just want to get the metadata (filename, filetype, etc) and not the document itself. In this case it would be nice to have the metadata (which wouldn't end up on the TOAST) on a fast array (SSD-based), and put the filedata on a slow array (HDD-based). It's probably possible to have two tables one for metadata and one for the extra file, but this is a workaround. -- Mvh Christian Nicolaisen Deltasoft AS Tlf +47 938 38 596 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Core Extensions relocation
Greg, This is currently awaiting a check by gsmith that the 7 named extensions do not add any new dependancies. Are you going to investigate this? If not, I'll give it a try this weekend. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
Jay Levitt jay.lev...@gmail.com writes: [Posted at Andres's request] TL;DR: Inserting and indexing cubes is slow and/or broken in various ways in various builds. 1. In 9.1.2, inserting 10x rows takes 19x the time. - 9.1-HEAD and 9.2 fix this; it now slows down linearly - but: 10s 8s 5s! - but: comparing Ubuntu binary w/vanilla source build on virtual disks, might not be significant FWIW, I find it really hard to believe that there is any real difference between 9.1.2 and 9.1 branch tip on this. There have been no significant changes in either the gist or contrib/cube code in that branch. I suspect you have a measurement issue there. On my not-at-all-virtual Fedora 16 workstation, with 9.1 tip, your test case shows index build times of 10 rows 3650 ms 100 rows48400 ms 1000 rows 1917800 ms which confirms the nonlinear scaling in 9.1, though I'm not sure it's not just running out of RAM and having to do a lot of I/O in the last case. (This is an entirely untuned debug build, which probably doesn't help.) It's hard to guess how much available RAM you were working with on your box -- mine's got 4GB. 2. In both 9.1 and 9.2, there is a long delay before CREATE INDEX realizes it can't work on an unlogged table Fixed. 3. In 9.2, creating the 10-million-row index always fails As Alexander noted, this is probably a bug in his recent patch. We'll look at it. (I duplicated it here, so it's plenty real.) 4. 9.1-HEAD never successfully indexes 10 million rows (never = at least 20 minutes on two runs; I will follow up in a few hours) Works for me (see above), though it's slower than you might've expected. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgstat documentation tables
On Wed, Feb 8, 2012 at 17:13, Bruce Momjian br...@momjian.us wrote: On Tue, Feb 07, 2012 at 11:29:12PM -0500, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: How do people feel about pulling text out of the SGML docs and loading it into the database as table and column comments? I'm not thrilled by that proposal. The length limits on comments are very much shorter than what is sensible to use in catalogs.sgml (at least if you want the comments to look passable in \d displays). And I don't want people trying to serve two different use-cases when they write those. I was thinking of just grabbing the first X characters from the SGML comments. So that it wouldn't necessarily even make for a complete sentence? That seems like a really bad idea... -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Sun, Jan 29, 2012 at 6:04 PM, Simon Riggs si...@2ndquadrant.com wrote: On Sun, Jan 29, 2012 at 9:41 PM, Jeff Janes jeff.ja...@gmail.com wrote: If I cast to a int, then I see advancement: I'll initialise it as 0, rather than -1 and then we don't have a problem in any circumstance. I've specifically designed the pgbench changes required to simulate conditions of clog contention to help in the evaluation of this patch. Yep, I've used that one for the testing. Most of the current patch is just bookkeeping to keep track of the point when we can look at history in read only manner. I've isolated the code better to allow you to explore various implementation options. I don't see any performance difference between any of them really, but you're welcome to look. Please everybody note that the clog history doesn't even become active until the first checkpoint, so this is dead code until we've hit the first checkpoint cycle and completed a million transactions since startup. So its designed to tune for real world situations, and is not easy to benchmark. (Maybe we could start earlier, but having extra code just for first few minutes seems waste of energy, especially since we must hit million xids also). I find that this version does not compile: clog.c: In function ‘TransactionIdGetStatus’: clog.c:431: error: ‘clog’ undeclared (first use in this function) clog.c:431: error: (Each undeclared identifier is reported only once clog.c:431: error: for each function it appears in.) Given that, I obviously cannot test this at this point, but let me go ahead and theorize about how well it's likely to work. What Tom suggested before (and after some reflection I think I believe it) is that the frequency of access will be highest for the newest CLOG page and then drop off for each page further back you go. Clearly, if that drop-off is fast - e.g. each buffer further backward is half as likely to be accessed as the next newer one - then the fraction of accesses that will hit pages that are far enough back to benefit from this optimization will be infinitesmal; 1023 out of every 1024 accesses will hit the first ten pages, and on a high-velocity system those all figure to have been populated since the last checkpoint. The best case for this patch should be an access pattern that involves a very long tail; actually, pgbench is a pretty good fit for that, assuming the scale factor is large enough. For example, at scale factor 100, we've got 10,000,000 tuples: choosing one at random, we're almost exactly 90% likely to find one that hasn't been chosen in the last 1,024,576 tuples (i.e. 32 CLOG pages @ 32K txns/page). In terms of reducing contention on the main CLOG SLRU, that sounds pretty promising, but depends somewhat on the rate at which transactions are processed relative to the frequency of checkpoints, since that will affect how many pages back you have go to use the history path. However, there is a potential fly in the ointment: in other cases in which we've reduced contention at the LWLock layer, we've ended up with very nasty contention at the spinlock layer that can sometimes eat more CPU time than the LWLock contention did. In that light, it strikes me that it would be nice to be able to partition the contention N ways rather than just 2 ways. I think we could do that as follows. Instead of having one control lock per SLRU, have N locks, where N is probably a power of 2. Divide the buffer pool for the SLRU N ways, and decree that each slice of the buffer pool is controlled by one of the N locks. Route all requests for a page P to slice P mod N. Unlike this approach, that wouldn't completely eliminate contention at the LWLock level, but it would reduce it proportional to the number of partitions, and it would reduce spinlock contention according to the number of partitions as well. A down side is that you'll need more buffers to get the same hit rate, but this proposal has the same problem: it doubles the amount of memory allocated for CLOG. Of course, this approach is all vaporware right now, so it's anybody's guess whether it would be better than this if we had code for it. I'm just throwing it out there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On Wed, Feb 8, 2012 at 1:48 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: That was clear from an early stage, and is something that I acknowledged way back in September OK, so why didn't/don't we do and commit that part first, and then proceed to argue about the remainder once it's in? I think that there may be additional benefits from making the qsort_arg specialisation look less like a c stdlib one, like refining the swap logic to have compile-time knowledge of the type it is sorting. I'm thinking that we could usefully trim quite a bit from this: That's an interesting idea, which seems worth pursuing, though possibly not for 9.2. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] double writes using double-write buffer approach [WIP]
Is there any problem if the double-write happens only by bgwriter or checkpoint. Something like whenever backend process has to evict the buffer, it will do same as you have described that write in a double-write buffer, but bgwriter will check this double-buffer and flush from it. Also whenever any backend will see that the double buffer is more than 2/3rd or some threshhold value full it will tell bgwriter to flush from double-write buffer. This can ensure very less I/O by any backend. Yes, I think this is a good idea. I could make changes so that the backends hand off the responsibility to flush batches of the double-write buffer to the bgwriter whenever possible. This would avoid some long IO waits in the backends, though the backends may of course eventually wait anyways for the bgwriter if IO is not fast enough. I did write the code so that any process can write a completed batch if the batch is not currently being flushed (so as to deal with crashes by backends). Having the backends flush the batches as they fill them up was just simpler for a first prototype. Dan - Original Message - From: Amit Kapila amit.kap...@huawei.com To: Dan Scales sca...@vmware.com, PG Hackers pgsql-hackers@postgresql.org Sent: Tuesday, February 7, 2012 1:08:49 AM Subject: Re: [HACKERS] double writes using double-write buffer approach [WIP] I think it is a good idea, and can help double-writes perform better in the case of lots of backend evictions. I don't understand this point, because from the data in your mail, it appears that when shared buffers are less means when more evictions can happen, the performance is less. ISTM that the performance is less incase shared buffers size is less because I/O might happen by the backend process which can degrade performance. Is there any problem if the double-write happens only by bgwriter or checkpoint. Something like whenever backend process has to evict the buffer, it will do same as you have described that write in a double-write buffer, but bgwriter will check this double-buffer and flush from it. Also whenever any backend will see that the double buffer is more than 2/3rd or some threshhold value full it will tell bgwriter to flush from double-write buffer. This can ensure very less I/O by any backend. -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Dan Scales Sent: Saturday, January 28, 2012 4:02 AM To: PG Hackers Subject: [HACKERS] double writes using double-write buffer approach [WIP] I've been prototyping the double-write buffer idea that Heikki and Simon had proposed (as an alternative to a previous patch that only batched up writes by the checkpointer). I think it is a good idea, and can help double-writes perform better in the case of lots of backend evictions. It also centralizes most of the code change in smgr.c. However, it is trickier to reason about. The idea is that all page writes generally are copied to a double-write buffer, rather than being immediately written. Note that a full copy of the page is required, but can folded in with a checksum calculation. Periodically (e.g. every time a certain-size batch of writes have been added), some writes are pushed out using double writes -- the pages are first written and fsynced to a double-write file, then written to the data files, which are then fsynced. Then double writes allow for fixing torn pages, so full_page_writes can be turned off (thus greatly reducing the size of the WAL log). The key changes are conceptually simple: 1. In smgrwrite(), copy the page to the double-write buffer. If a big enough batch has accumulated, then flush the batch using double writes. [I don't think I need to intercept calls to smgrextend(), but I am not totally sure.] 2. In smgrread(), always look first in the double-write buffer for a particular page, before going to disk. 3. At the end of a checkpoint and on shutdown, always make sure that the current contents of the double-write buffer are flushed. 4. Pass flags around in some cases to indicate whether a page buffer needs a double write or not. (I think eventually this would be an attribute of the buffer, set when the page is WAL-logged, rather than a flag passed around.) 5. Deal with duplicates in the double-write buffer appropriately (very rarely happens). To get good performance, I needed to have two double-write buffers, one for the checkpointer and one for all other processes. The double-write buffers are circular buffers. The checkpointer double-write buffer is just a single batch of 64 pages; the non-checkpointer double-write buffer is 128 pages, 2 batches of 64 pages each. Each batch goes to a different double-write file, so that they can be issued independently as soon as each batch is completed. Also, I need to sort the buffers being checkpointed by
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On 8 February 2012 23:33, Robert Haas robertmh...@gmail.com wrote: On Wed, Feb 8, 2012 at 1:48 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: That was clear from an early stage, and is something that I acknowledged way back in September OK, so why didn't/don't we do and commit that part first, and then proceed to argue about the remainder once it's in? I have no objection to that. I'm not sure that it's going to be possible to agree to anything beyond what has already been agreed. We don't seem to be covering new ground. What would it take to convince you to be more inclusive, and have at least a few full specialisations, in particular, int4 and int8 single key full specialisations? I'm sure that you'll agree that they're the second most compelling case. There doesn't seem to be a standard that I can meet that you can even /describe/ to have those additional specialisations committed, which is a shame, since they appear to be pretty decent wins above and beyond the generic single key specialisation, all things considered. I suppose that the standard refrain here is it's your patch; you must prove the case for committing it. That would be fine by me, but this isn't just about me, and it seems to be a practical impossibility in this case. It may be quite impossible even with far greater resources than I can afford to apply here, since it seems like you're hinting at some kind of rigorous proof that this cannot cause a regression for a single client, even though in order to see that regression multiple other clients would have to be benefiting. I cannot provide such a proof, since it would probably have to consider all commercially available CPUs and all possible workloads - I doubt that anyone can. That being the case, I'm eager not to waste any more time on this. I bear no ill will; that just seems to be the situation that we find ourselves in. That said, if you can describe the standard, I'll try and meet it - you seem to be suggesting that months and months of benchmarks wouldn't really change things, since this is the first step on the road to binary bloat ruin. The only fundamentally new argument that I can offer is that by applying the standard that the community does here, it severely limits the number of performance improvements that can be added, and the aggregate effect is that we're quite certainly worse off. It's a pity this isn't a really easy decision for you to make, but that's the nature of these things, and we should expect that to increasingly be the case. Is it really so unacceptable for us to risk doing a little worse under some rare circumstances in order to do better under some common circumstances? Why should it be an easy decision - when are important decisions ever easy? I think that there may be additional benefits from making the qsort_arg specialisation look less like a c stdlib one, like refining the swap logic to have compile-time knowledge of the type it is sorting. I'm thinking that we could usefully trim quite a bit from this: That's an interesting idea, which seems worth pursuing, though possibly not for 9.2. Well, it's really just a code clean-up. I suspect that qsort_arg is a minimally modified version of the NetBSD one that wasn't necessarily thoroughly understood when it was initially added (not that I'm suggesting that it ought to have been). Then again, you might prefer to keep it as consistent as possible with qsort (the other copy of that sort function that we already have). -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Wed, Feb 8, 2012 at 1:21 PM, Robert Haas robertmh...@gmail.com wrote: In my patch I dealt with exactly the same problem for the error handler by creating a separate function that has a static variable (a pointer to the ParallelState). The value is set and retrieved through the same function, so yes, it's kinda global but then again it can only be accessed from this function, which is only called from the error handler. How did you make this thread-safe? The ParallelState has a ParallelSlot for each worker process which contains that worker process's thread id. So a worker process just walks through the slots until it finds its own thread id and then goes from there. In particular, it only gets the file descriptors to and from the parent from this structure, to communicate the error it encountered, but it doesn't get the PGconn. This is because that error handler is only called when a worker calls die_horribly(AH, ...) in which case the connection is already known through AH. Termination via a signal just sets a variable that is checked in the I/O routines and there we also have AH to shut down the connection (actually to call die_horribly()). So we at least need to press on far enough to get to that point. Sure, let me know if I can help you with something. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum rate limit in KBps
On 02/08/2012 11:22 AM, Bruce Momjian wrote: Why can't vacuum handle things automatically like checkpoint smoothing? Why can't it detect when it is falling behind and speed up? Why can't it see as busy background writer and slow down? Unless we answer these questions, we are not solving the problem for 99% of our users. Scheduling VACUUM is a much harder problem than checkpoint smoothing. Checkpoints can define ahead and behind simply; defining what those terms mean for vacuum scheduling across a set of tables is tricky. A possible downside to not vacuuming hard enough is that you'll have tables reach their thresholds faster than you're cleaning them up. That's a future bad event possible if you don't do the right thing now. Predicting when that will happen is a whole different class of problem than tracking whether checkpoints are running on schedule. I have a design sketched out for something that adjusts the vacuum rate based on background writer activity. I'm not optimistic that will ever be automatic to the level you'd like here either. That looks like it will be prone to feedback loops where autovacuum can starve forever. If it ever lags enough that commonly accessed tables have become inefficiently stored, they will then generate more buffer I/O to access, and thus continue to block future vacuum work. That's the exact situation where vacuum is most needed, even though it seems too expensive to do. That's one trap people already fall into when doing this by hand. Everyone who's ever said autovacuum uses too much I/O when it pops up unexpectedly during the day, I'm going to turn that off has seen how that works out. Hilarity ensues when the inevitable and gigantic wraparound vacuums happen in the middle of the day instead. Just trying to set the expectations bar realistically here. I don't consider the easier problem of checkpoint smoothing a solved one, either. Saying autovacuum needs to reach even that level of automation to be a useful improvement over now is a slippery goal. Regardless, the simple idea I submitted to this CF is clearly dead for now. I'll take the feedback of this level of change can live in a user-side tuning tool and do that instead. Since I was already thinking in the direction of background activity monitoring, I have a good idea how I'd need to approach this next, to be more likely to gain community buy-in as an automated improvement. That's a longer term project though, which I'll hopefully be able to revisit for 9.3. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql NUL record and field separator
At 2012-02-07 13:20:43 +0200, pete...@gmx.net wrote: Should we rename the options and/or add that to the documentation, or is the new behavior obvious and any new terminology would be too confusing? I agree there is potential for confusion either way. I tried to come up with a complete and not-confusing wording for all four options, but did not manage to improve on the current state of affairs significantly. I think it can stay the way it is. The reference to xargs -0 is probably enough to set the right expectations about how it works. We can always add a sentence later to clarify the special-case behaviour of -0 if anyone is actually confused (and the best wording will be more clear in that situation too). -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers