Re: [HACKERS] Spreading full-page writes
On 27 May 2014 18:18, Jeff Janes wrote: > On Mon, May 26, 2014 at 8:15 PM, Robert Haas wrote: >> >> On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas >> wrote: >> >>I don't think we know that. The server might have crashed before that >> >>second record got generated. (This appears to be an unfixable flaw in >> >>this proposal.) >> > >> > The second record is generated before the checkpoint is finished and the >> > checkpoint record is written. So it will be there. >> > >> > (if you crash before the checkpoint is finished, the in-progress >> > checkpoint is no good for recovery anyway, and won't be used) >> >> Hmm, I see. >> >> It's not great to have to generate WAL at buffer-eviction time, >> though. Normally, when we go to evict a buffer, the WAL is already >> written. We might have to wait for it to be flushed, but if the WAL >> writer is doing its job, hopefully not. But here we'll definitely >> have to wait for the WAL flush. > > > I'm not sure we do need to flush it. If the checkpoint finishes, then the > WAL surely got flushed as part of the process of recording the end of the > checkpoint. If the checkpoint does not finish, recovery will start from the > previous checkpoint, which does contain the FPW (because if it didn't, the > page would not be eligible for this treatment) and so the possibly torn page > will get overwritten in full. I think Robert is correct, you would need to flush WAL before writing the disk buffer. That is the current invariant of WAL before data. However, we don't need to do this in a simple way: FPW-flush-buffer, we can do that with more buffering. So it seems like a reasonable idea to do this using a 64 buffer BulkAccessStrategy object and flush the WAL every 64 buffers. That's beginning to look more like double buffering though... -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spreading full-page writes
On 27 May 2014 13:20, Heikki Linnakangas wrote: > On 05/27/2014 03:18 PM, Simon Riggs wrote: >> >> IIRC Koichi had a patch for prefetch during recovery. Heikki, is that >> the reason you also discussed changing the WAL record format to allow >> us to identify the blocks touched by recovery more easily? > > > Yeah, that was one use case I had in mind for the WAL format changes. See > http://www.postgresql.org/message-id/533d6cbf.6080...@vmware.com. Those proposals suggest some very big changes to the way WAL works. Prefetch can work easily enough for most records - do we really need that much churn? You mentioned Btree vacuum records, but I'm planning to optimize those another way. Why don't we just have the prefetch code in core and forget the WAL format changes? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] json casts
On 05/27/2014 11:55 PM, Robert Haas wrote: On Tue, May 27, 2014 at 5:00 PM, Tom Lane wrote: I'd be inclined to think a more useful answer to this issue would be to make json.c special-case timestamps, as it already does for numerics. I wonder if anyone besides me is nervous about changing the semantics here. It seems like the sort of backward-compatibility break we normally avoid. If we do make such a compatibility break, it should certainly be release-noted. Yes, certainly it should be. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposing pg_hibernate
On Wed, May 28, 2014 at 7:31 AM, Gurjeet Singh wrote: > Caveats > -- > > - Buffer list is saved only when Postgres is shutdown in "smart" and > "fast" modes. > > That is, buffer list is not saved when database crashes, or on "immediate" > shutdown. > > - A reduction in `shared_buffers` is not detected. > > If the `shared_buffers` is reduced across a restart, and if the combined > saved buffer list is larger than the new shared_buffers, Postgres > Hibernator continues to read and restore blocks even after `shared_buffers` > worth of buffers have been restored. How about the cases when shared buffers already contain some data: a. Before Readers start filling shared buffers, if this cluster wishes to join replication as a slave and receive the data from master, then this utility might need to evict some buffers filled during startup phase. b. As soon as the server completes startup (reached consistent point), it allows new connections which can also use some shared buffers before Reader process could use shared buffers or are you planing to change the time when users can connect to database. I am not sure if replacing shared buffer contents in such cases can always be considered useful. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Spreading full-page writes
On Tue, May 27, 2014 at 1:19 PM, Fujii Masao wrote: > On Tue, May 27, 2014 at 3:57 PM, Simon Riggs wrote: > > The requirements we were discussing were around > > > > A) reducing WAL volume > > B) reducing foreground overhead of writing FPWs - which spikes badly > > after checkpoint and the overhead is paid by the user processes > > themselves > > C) need for FPWs during base backup > > > > So that gives us a few approaches > > > > * Compressing FPWs gives A > > * Background FPWs gives us B > >which look like we can combine both ideas > > > > * Double-buffering would give us A and B, but not C > >and would be incompatible with other two ideas > > Double-buffering would allow us to disable FPW safely but which would make > a recovery slow. Is it due to the fact that during recovery, it needs to check the contents of double buffer as well as the page in original location for consistency or there is something else also which will lead to slow recovery? Won't DBW (double buffer write) reduce the need for number of pages that needs to be read from disk as compare to FPW which will suffice the performance degradation due to any other impact? IIUC in DBW mechanism, we need to have a temporary sequential log file of fixed size which will be used to write data before the data gets written to its actual location in tablespace. Now as the temporary log file is of fixed size, the number of pages that needs to be read during recovery should be less as compare to FPW because in FPW it needs to read all the pages written in WAL log after last successful checkpoint. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] json casts
On Tue, May 27, 2014 at 5:00 PM, Tom Lane wrote: > I'd be inclined to think a more useful answer to this issue would be to > make json.c special-case timestamps, as it already does for numerics. I wonder if anyone besides me is nervous about changing the semantics here. It seems like the sort of backward-compatibility break we normally avoid. If we do make such a compatibility break, it should certainly be release-noted. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?
On Tue, May 27, 2014 at 11:20 PM, ash wrote: >> Now, consider the situation in which we want to achieve the same >> result without having to drop and recreate v. When the column type of >> t.a is changed, we can use the dependencies to figure out that v might >> be impacted. We can trace through the rewrite rule to find out where >> column t.a is referenced. And ... then what? All we know about t.a >> is that we're applying some operator to it, which is specified by OID. >> The rewrite rule doesn't tell us the actual *name* by which the >> operator was referenced in the original view text, nor does it tell us >> the search path that was in effect at that time. If it did, we could >> pick the same operator for + that would have been used had t.a been of >> the new type originally, but as it is, we can't. > > This could be a showstopper indeed. We can look up view def in pg_views > view, but it doesn't include any schema references unless they were > explicit in the CREATE VIEW statement. > > On the other hand, pg_dump *can* work around this: if you dump a view > that has been defined when a specific search_path was in effect, you'll > get correct definition in the schema dump. > > So why can't we try to learn from pg_dump? Well, pg_dump is trying to do something different than what you're trying to do here. pg_dump wants to make sure that the view, when fed back into psql, creates the same view that exists now, regardless of whether that's what the user created originally. For example, if a view is created referring to table foo, and table foo is later renamed to bar, then pg_dump wants to (and does) dump a statement referring to bar, not foo - even if there's a *new* table called foo against which the view could have been defined. Similarly, pg_dump will schema-qualify functions and operators, or not, based on whether that's necessary to reference the exact same operators that were selected when the original CREATE VIEW command was run, regardless of whether the original references were schema-qualified. None of that involves answering hypothetical questions; but what you want to do does, and that I think is the problem in a nutshell. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?
David G Johnston writes: > > Would it be possible to handle the specific case of varchar(n) to > varchar/text by just ignoring the error? Simply for the reference, my case is INT to BIGINT. -- Alex -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?
Robert Haas writes: >> >> It'd need to be explicitly requested, eg a 'CASCADE' option. > > Why? Would any sane person NOT want this behavior? > > I think the question here is whether there's any way to make this work > at all, not whether we'd want it if we could get it. Consider: > > CREATE OR REPLACE VIEW v AS SELECT a + 0 FROM t; > > If we drop the view, change the column type of t.a, and re-execute the > view, + might resolve to a different operator than before (or no > operator, at all). Furthermore, the operator to which it resolves > will depend on the search path at the time the CREATE OR REPLACE VIEW > command is executed. > > Now, consider the situation in which we want to achieve the same > result without having to drop and recreate v. When the column type of > t.a is changed, we can use the dependencies to figure out that v might > be impacted. We can trace through the rewrite rule to find out where > column t.a is referenced. And ... then what? All we know about t.a > is that we're applying some operator to it, which is specified by OID. > The rewrite rule doesn't tell us the actual *name* by which the > operator was referenced in the original view text, nor does it tell us > the search path that was in effect at that time. If it did, we could > pick the same operator for + that would have been used had t.a been of > the new type originally, but as it is, we can't. This could be a showstopper indeed. We can look up view def in pg_views view, but it doesn't include any schema references unless they were explicit in the CREATE VIEW statement. On the other hand, pg_dump *can* work around this: if you dump a view that has been defined when a specific search_path was in effect, you'll get correct definition in the schema dump. So why can't we try to learn from pg_dump? > Now maybe there are options other than trying to reproduce what the > original CREATE OR REPLACE statement would have done against the new > type. For example, we could look through views that depend on t.a and > rewrite each reference to that column to t.a::oldtype. This might > lead to odd results with multiple nested casts and generally funny > behavior if the column is re-typed multiple times; but maybe there's > some way to fix that. Also, it might not really be the semantics you > want if you were hoping the type update would truly cascade. But it > might still be better than a sharp stick in the eye, which is kinda > what we offer today. No, casting back to oldtype totally defeats the purpose, at least for my usecase. -- Alex -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?
Alexander Shulgin wrote > Hi Hackers, > > This came up recently on general list (and I've just hit the same issue > today): > http://www.postgresql.org/message-id/ > CAB7nPqTLmMn1LTb5WE0v0dO57iP0U73yKwzbZytAXDF1CAWLZg@.gmail > > Why couldn't postgres re-create the dependent views automatically? I > mean it knows which views depend on the altered column and keeps the > view definition, no? > > Would a patch likely to be accepted? How hard do you feel this might be > to implement? Any caveat that defeats the purpose of such feature? > > Thanks. Would it be possible to handle the specific case of varchar(n) to varchar/text by just ignoring the error? David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Re-create-dependent-views-on-ALTER-TABLE-ALTER-COLUMN-TYPE-tp5804972p5805191.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD
Peter Eisentraut writes: > I'm shocked that this new feature has been committed post beta with less > than 48 hours of review time over a holiday weekend. This issue has > been known for years. Why does it need to be solved right now? As per the commit message: our packagers were screaming about the fact that they couldn't build uuid-ossp anymore now that autoconf 2.69 has tightened up its validity checks for headers. If you don't like this change, we can revert it and also revert the upgrade to 2.69. I'm not terribly happy about pushing such a change post-beta1 either, but it's not like this isn't something we've known was needed. Anyway, what's the worst case if we find a bug here? Tell people not to use uuid-ossp? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD
On 5/27/14, 2:41 PM, Tom Lane wrote: > Heikki Linnakangas writes: >> On 27 May 2014 18:33:48 EEST, Tom Lane wrote: >>> If we were going to do it like that, I'd vote for >>> >>> --with-uuid={ossp,e2fs,bsd} >>> >>> with no default at present (ie you can't say just "--with-uuid", >>> though we'd have the option to allow that in future). But I doubt >>> this is better than the --with-foo-uuid spelling. > >> FWIW, --with-uuid=foo looks much clearer to me. > > OK, I'll make it so. I'm shocked that this new feature has been committed post beta with less than 48 hours of review time over a holiday weekend. This issue has been known for years. Why does it need to be solved right now? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD
On 5/26/14, 1:25 PM, Tom Lane wrote: > Assuming this works as advertised, does anyone have an objection to > pushing it into 9.4? Yes, it's way too late for that. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?
On Mon, May 26, 2014 at 10:39 AM, Stephen Frost wrote: > * ash (a...@commandprompt.com) wrote: >> This came up recently on general list (and I've just hit the same issue >> today): >> >> http://www.postgresql.org/message-id/cab7npqtlmmn1ltb5we0v0do57ip0u73ykwzbzytaxdf1caw...@mail.gmail.com >> >> Why couldn't postgres re-create the dependent views automatically? I >> mean it knows which views depend on the altered column and keeps the >> view definition, no? > > Might be pretty complicated in the end.. > >> Would a patch likely to be accepted? How hard do you feel this might be >> to implement? Any caveat that defeats the purpose of such feature? > > It'd need to be explicitly requested, eg a 'CASCADE' option. Why? Would any sane person NOT want this behavior? I think the question here is whether there's any way to make this work at all, not whether we'd want it if we could get it. Consider: CREATE OR REPLACE VIEW v AS SELECT a + 0 FROM t; If we drop the view, change the column type of t.a, and re-execute the view, + might resolve to a different operator than before (or no operator, at all). Furthermore, the operator to which it resolves will depend on the search path at the time the CREATE OR REPLACE VIEW command is executed. Now, consider the situation in which we want to achieve the same result without having to drop and recreate v. When the column type of t.a is changed, we can use the dependencies to figure out that v might be impacted. We can trace through the rewrite rule to find out where column t.a is referenced. And ... then what? All we know about t.a is that we're applying some operator to it, which is specified by OID. The rewrite rule doesn't tell us the actual *name* by which the operator was referenced in the original view text, nor does it tell us the search path that was in effect at that time. If it did, we could pick the same operator for + that would have been used had t.a been of the new type originally, but as it is, we can't. Now maybe there are options other than trying to reproduce what the original CREATE OR REPLACE statement would have done against the new type. For example, we could look through views that depend on t.a and rewrite each reference to that column to t.a::oldtype. This might lead to odd results with multiple nested casts and generally funny behavior if the column is re-typed multiple times; but maybe there's some way to fix that. Also, it might not really be the semantics you want if you were hoping the type update would truly cascade. But it might still be better than a sharp stick in the eye, which is kinda what we offer today. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposing pg_hibernate
Please find attached the updated code of Postgres Hibenator. Notable changes since the first proposal are: .) The name has been changed to pg_hibernator (from pg_hibernate), to avoid confusion with the ORM Hibernate. .) Works with Postgres 9.4 .) Uses DynamicBackgroundWorker infrastructure. .) Ability to restore one database at a time, to avoid random-read storms. Can be disabled by parameter. .) A couple of bug-fixes. .) Detailed documentation. I am pasting the README here (also included in the attachment). Best regards, Postgres Hibernator === This Postgres extension is a set-it-and-forget-it solution to save and restore the Postgres shared-buffers contents, across Postgres server restarts. It performs the automatic save and restore of database buffers, integrated with database shutdown and startup, hence reducing the durations of database maintenance windows, in effect increasing the uptime of your applications. Postgres Hibernator automatically saves the list of shared buffers to the disk on database shutdown, and automatically restores the buffers on database startup. This acts pretty much like your Operating System's hibernate feature, except, instead of saving the contents of the memory to disk, Postgres Hibernator saves just a list of block identifiers. And it uses that list after startup to restore the blocks from data directory into Postgres' shared buffers. Why -- DBAs are often faced with the task of performing some maintenance on their database server(s) which requires shutting down the database. The maintenance may involve anything from a database patch application, to a hardware upgrade. One ugly side-effect of restarting the database server/service is that all the data currently in database server's memory will be all lost, which was painstakingly fetched from disk and put there in response to application queries over time. And this data will have to be rebuilt as applications start querying database again. The query response times will be very high until all the "hot" data is fetched from disk and put back in memory again. People employ a few tricks to get around this ugly truth, which range from running a `select * from app_table;`, to `dd if=table_file ...`, to using specialized utilities like pgfincore to prefetch data files into OS cache. Wouldn't it be ideal if the database itself could save and restore its memory contents across restarts! The duration for which the server is building up caches, and trying to reach its optimal cache performance is called ramp-up time. Postgres Hibernator is aimed at reducing the ramp-up time of Postgres servers. How -- Compile and install the extension (you'll need a Postgres instalation and its `pg_config` in `$PATH`): $ cd pg_hibernator $ make install Then. 1. Add `pg_hibernator` to the `shared_preload_libraries` variable in `postgresql.conf` file. 2. Restart the Postgres server. 3. You are done. How it works -- This extension uses the `Background Worker` infrastructure of Postgres, which was introduced in Postgres 9.3. When the server starts, this extension registers background workers; one for saving the buffers (called `Buffer Saver`) when the server shuts down, and one for each database in the cluster (called `Block Readers`) for restoring the buffers saved during previous shutdown. When the Postgres server is being stopped/shut down, the `Buffer Saver` scans the shared-buffers of Postgres, and stores the unique block identifiers of each cached block to the disk. This information is saved under the `$PGDATA/pg_hibernator/` directory. For each of the database whose blocks are resident in shared buffers, one file is created; for eg.: `$PGDATA/pg_hibernator/2.postgres.save`. During the next startup sequence, the `Block Reader` threads are registered, one for each file present under `$PGDATA/pg_hibernator/` directory. When the Postgres server has reached stable state (that is, it's ready for database connections), these `Block Reader` processes are launched. The `Block Reader` process reads the save-files looking for block-ids to restore. It then connects to the respective database, and requests Postgres to fetch the blocks into shared-buffers. Configuration -- This extension can be controlled via the following parameters. These parameters can be set in postgresql.conf or on postmaster's command-line. - `pg_hibernator.enabled` Setting this parameter to false disables the hibernator features. That is, on server startup the BlockReader processes will not be launched, and on server shutdown the list of blocks in shared buffers will not be saved. Note that the BuffersSaver process exists at all times, even when this parameter is set to `false`. This is to allow the DBA to enable/disable the extension without having to restart the server. The BufferSaver process checks this parameter during server startup and right before shutdown,
Re: [HACKERS] [9.4] Minor SSL/ECDH related doc fixes
On Sat, May 17, 2014 at 10:36:59PM +0300, Marko Kreen wrote: > - Clarify ECDH decription in release notes. > - Fix default value - it's 'prime256v1'. > - List curves with good cross-platform support explicitly > (NIST P-256 / P-384 / P-521). > > The -list_curves output is full of garbage, it's hard to know which > ones make sense to use. Only those three curves are supported > cross-platform - OpenSSL/Java/Windows - so list them explicitly. > > Only reason to tune this value is changing overall security > level up/down, so now this can be done safely and quickly. > > Only upwards though. We could also list here NIST P-192/P-224 > (prime192v1, secp224r1), but those are not supported by Windows. > And prime256v1 is quite fast already. > > In the future it might make philosophical sense to list > also Brainpool curves (RFC7027), or some new curves from > http://safecurves.cr.yp.to/ when they are brought to TLS. > But currently only NIST/NSA curves are working option, > so let's keep it simple for users. Attached patch applied. I shortened the release note description. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml new file mode 100644 index d9e5985..4a666d0 *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** include 'filename' *** 1020,1032 ! Specifies the name of the curve to use in ECDH key exchanges. The ! default is prime256p1. ! The list of available curves can be shown with the command ! openssl ecparam -list_curves. --- 1020,1042 ! Specifies the name of the curve to use in ECDH key exchange. ! It needs to be supported by all clients that connect. ! It does not need to be same curve as used by server's ! Elliptic Curve key. The default is prime256v1. ! OpenSSL names for most common curves: ! prime256v1 (NIST P-256), ! secp384r1 (NIST P-384), ! secp521r1 (NIST P-521). ! ! ! ! The full list of available curves can be shown with the command ! openssl ecparam -list_curves. Not all of them ! are usable in TLS though. diff --git a/doc/src/sgml/release-9.4.sgml b/doc/src/sgml/release-9.4.sgml new file mode 100644 index 24862fe..ae059f8 *** a/doc/src/sgml/release-9.4.sgml --- b/doc/src/sgml/release-9.4.sgml *** *** 616,632 ! Such keys are faster and have improved security over previous ! options. The new configuration ! parameter ssl_ecdh_curve ! controls which curve is used. Improve the default ssl_ciphers ciphers (Marko Kreen) --- 616,633 ! This allows use of Elliptic Curve keys for server authentication. ! Such keys are faster and have improved security over RSA keys. ! The new configuration parameter ! ssl_ecdh_curve ! controls which curve is used for ECDH. Improve the default ssl_ciphers value (Marko Kreen) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb failed assertions
On Wed, May 21, 2014 at 4:53 AM, Heikki Linnakangas wrote: > Hmm. The patch looks correct as far as it goes. But that function is still a > bit funny. When it compares two identical arrays (or objects), and reaches > the WJB_END_ARRAY token, it will still fall into the code that checks what > the va and vb types are, and compares the last scalar values in that array > again. That's wrong, and will fail if the compiler decides to clobber the > local "va" or "vb" variables between iterations of the do-while loop, > because JsonbIteratorNext() does not set the value when returning > WJB_END_ARRAY. That's an obsolete assumption that once actually applied during development. Attached revision also adds handling for that case. Thanks -- Peter Geoghegan *** a/src/backend/utils/adt/jsonb_util.c --- b/src/backend/utils/adt/jsonb_util.c *** static void uniqueifyJsonbObject(JsonbVa *** 62,74 * * There isn't a JsonbToJsonbValue(), because generally we find it more * convenient to directly iterate through the Jsonb representation and only ! * really convert nested scalar values. formIterIsContainer() does this, so ! * that clients of the iteration code don't have to directly deal with the ! * binary representation (JsonbDeepContains() is a notable exception, although ! * all exceptions are internal to this module). In general, functions that ! * accept a JsonbValue argument are concerned with the manipulation of scalar ! * values, or simple containers of scalar values, where it would be ! * inconvenient to deal with a great amount of other state. */ Jsonb * JsonbValueToJsonb(JsonbValue *val) --- 62,74 * * There isn't a JsonbToJsonbValue(), because generally we find it more * convenient to directly iterate through the Jsonb representation and only ! * really convert nested scalar values. JsonbIteratorNext() does this, so that ! * clients of the iteration code don't have to directly deal with the binary ! * representation (JsonbDeepContains() is a notable exception, although all ! * exceptions are internal to this module). In general, functions that accept ! * a JsonbValue argument are concerned with the manipulation of scalar values, ! * or simple containers of scalar values, where it would be inconvenient to ! * deal with a great amount of other state. */ Jsonb * JsonbValueToJsonb(JsonbValue *val) *** compareJsonbContainers(JsonbContainer *a *** 137,149 ra = JsonbIteratorNext(&ita, &va, false); rb = JsonbIteratorNext(&itb, &vb, false); - /* - * To a limited extent we'll redundantly iterate over an array/object - * while re-performing the same test without any reasonable - * expectation of the same container types having differing lengths - * (as when we process a WJB_BEGIN_OBJECT, and later the corresponding - * WJB_END_OBJECT), but no matter. - */ if (ra == rb) { if (ra == WJB_DONE) --- 137,142 *** compareJsonbContainers(JsonbContainer *a *** 152,157 --- 145,162 break; } + if (ra == WJB_END_ARRAY || ra == WJB_END_OBJECT) + { + /* + * There is no array or object to safely compare at this stage + * of processing. jbvArray/jbvObject values are only + * considered initially, during processing of WJB_BEGIN_ARRAY + * and WJB_BEGIN_OBJECT tokens. This doesn't matter, because a + * second comparison would be redundant. + */ + continue; + } + if (va.type == vb.type) { switch (va.type) *** compareJsonbContainers(JsonbContainer *a *** 194,212 else { /* ! * It's safe to assume that the types differed. * * If the two values were the same container type, then there'd * have been a chance to observe the variation in the number of ! * elements/pairs (when processing WJB_BEGIN_OBJECT, say). They ! * can't be scalar types either, because then they'd have to be ! * contained in containers already ruled unequal due to differing ! * numbers of pairs/elements, or already directly ruled unequal ! * with a call to the underlying type's comparator. */ Assert(va.type != vb.type); ! Assert(va.type == jbvArray || va.type == jbvObject); ! Assert(vb.type == jbvArray || vb.type == jbvObject); /* Type-defined order */ res = (va.type > vb.type) ? 1 : -1; } --- 199,227 else { /* ! * It's safe to assume that the types differed, and that the va and ! * vb values passed were set. * + * We don't have to consider the WJB_END_ARRAY and WJB_END_OBJECT + * cases here, because in order to process those tokens we'd first + * have to process WJB_BEGIN_ARRAY or WJB_BEGIN_OBJECT, and that + * would be sufficient reason to end up here (with reliable values + * set for va and vb), which always results in finishing the + * comparison. + */ + Assert(ra != WJB_END_
Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD
Pushed; thanks for working on this! regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why is pg_lsn marking itself a preferred type?
On Wed, May 28, 2014 at 4:27 AM, Tom Lane wrote: > One of these doesn't belong: > > postgres=# select typname, typcategory from pg_type where typispreferred; >typname | typcategory > -+- > bool| B > text| S > oid | N > float8 | N > inet| I > timestamptz | D > interval| T > varbit | V > pg_lsn | U > (9 rows) > > Was there any actual rationale to this, or was it just somebody who did > not understand what that bit is for? Looks like an oversight of the pg_lsn patch. You could blame me for that I suppose... > I think it's probably mostly harmless given the lack of casts to or from > pg_lsn, but it's still a darn bad idea to have any preferred types in the > 'U' category. If we leave it like this it will bite us in the rear > eventually. > The most expedient response at this late date seems to be to change the > entry in pg_type.h without bumping catversion. That way at least it > will be right in databases initdb'd after beta2. Agreed. Attached patch fixes that, but I am sure that you already figured it out. -- Michael diff --git a/src/include/catalog/pg_type.h b/src/include/catalog/pg_type.h index a8abc0f..b7d9256 100644 --- a/src/include/catalog/pg_type.h +++ b/src/include/catalog/pg_type.h @@ -578,7 +578,7 @@ DESCR("UUID datatype"); DATA(insert OID = 2951 ( _uuid PGNSP PGUID -1 f b A f t \054 0 2950 0 array_in array_out array_recv array_send - - array_typanalyze i x f 0 -1 0 0 _null_ _null_ _null_ )); /* pg_lsn */ -DATA(insert OID = 3220 ( pg_lsn PGNSP PGUID 8 FLOAT8PASSBYVAL b U t t \054 0 0 3221 pg_lsn_in pg_lsn_out pg_lsn_recv pg_lsn_send - - - d p f 0 -1 0 0 _null_ _null_ _null_ )); +DATA(insert OID = 3220 ( pg_lsn PGNSP PGUID 8 FLOAT8PASSBYVAL b U f t \054 0 0 3221 pg_lsn_in pg_lsn_out pg_lsn_recv pg_lsn_send - - - d p f 0 -1 0 0 _null_ _null_ _null_ )); DESCR("PostgreSQL LSN datatype"); #define LSNOID 3220 DATA(insert OID = 3221 ( _pg_lsn PGNSP PGUID -1 f b A f t \054 0 3220 0 array_in array_out array_recv array_send - - array_typanalyze d x f 0 -1 0 0 _null_ _null_ _null_ )); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] json casts
On 05/27/2014 07:17 PM, Tom Lane wrote: Stephen Frost writes: * Andrew Dunstan (and...@dunslane.net) wrote: Given that this would be a hard coded behaviour change, is it too late to do this for 9.4? No, for my 2c. If we do it by adding casts then it'd require an initdb, so I'd vote against that for 9.4. If we just change behavior in json.c then that objection doesn't apply, so I wouldn't complain. I wasn't proposing to add a cast, just to allow users to add one if they wanted. But I'm quite happy to go with special-casing timestamps, and leave the larger question for another time. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] json casts
Stephen Frost writes: > * Andrew Dunstan (and...@dunslane.net) wrote: >> Given that this would be a hard coded behaviour change, is it too >> late to do this for 9.4? > No, for my 2c. If we do it by adding casts then it'd require an initdb, so I'd vote against that for 9.4. If we just change behavior in json.c then that objection doesn't apply, so I wouldn't complain. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Race condition within _bt_findinsertloc()? (new page split code)
On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas wrote: >> Fair enough, but I don't think that affects correctness either way (I >> don't think that you meant to imply that this was a necessary >> precaution that you'd taken - right?). > > > Right. So, the comments within _bt_search() suggest that the _bt_moveright() call will perform a _bt_finish_split() call opportunistically iff it's called from _bt_doinsert() (i.e. access == BT_WRITE). There is no reason to not do so in all circumstances though, assuming that it's desirable to do so as soon as possible (which I *don't* actually assume). If I'm not mistaken, it's also true that it would be strictly speaking correct to never do it there. Do you think it would be a fair stress-test if I was to hack Postgres so that this call never happens (within _bt_moveright())? I'd also have an incomplete page split occur at random with a probability of, say, 1% per split. The mechanism would be much the same as your original test-case for the split patch - I'd throw an error at the wrong place, although only 1% of the time, and over many hours. The concern with the _bt_moveright() call of _bt_finish_split() is that it might conceal a problem without reliably fixing it, potentially making isolating that problem much harder. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] json casts
* Andrew Dunstan (and...@dunslane.net) wrote: > >>I'd be inclined to think a more useful answer to this issue would be to > >>make json.c special-case timestamps, as it already does for numerics. > >> > >> > > OK, that's another approach. I'm all for doing this for JSON, but it'd sure be nice to have the option to get actual ISO 8601 from the built-ins too... > Given that this would be a hard coded behaviour change, is it too > late to do this for 9.4? No, for my 2c. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] json casts
On 05/27/2014 05:43 PM, Hannu Krosing wrote: On 05/27/2014 11:00 PM, Tom Lane wrote: See src/backend/utils/adt/json.c:json_categorize_type() lines 1280-1300. When rendering some value as part of a json string, if a cast exists from the data type to json, then the cast function is used to render the json instead of the type's normal output function, but only if it's not a builtin type. How exactly would disabling that code have any effect on timestamp rendering? There's no cast to json from timestamps (nor any other builtin type, except jsonb). I think Andrews idea was, that if cast were used, one could fix the above problem by defining a correct cast. Right. I'd be inclined to think a more useful answer to this issue would be to make json.c special-case timestamps, as it already does for numerics. OK, that's another approach. But I agree that special-casing the code to use the de-facto json standard of using ISO 8601 date representation is a better solution. Just make sure you get the TZ part right - this is another place where PostgreSQL often differs from other systems' understanding of ISO timestamps. The target would be something like: to_char($1,'\"-MM-DD"T"HH:MI:SS.USOF\"') AIUI that should be legal ISO 8601. But I'm happy to defer to experts. Given that this would be a hard coded behaviour change, is it too late to do this for 9.4? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PG Manual: Clarifying the repeatable read isolation example
Heikki Linnakangas-6 wrote > On 05/27/2014 10:12 PM, Evan Jones wrote: >> I was reading the Postgres MVCC documentation today (which is >> generally fantastic BTW), and am slightly confused by a single >> sentence example, describing possible read-only snapshot isolation >> anomalies. I would like to submit a patch to clarify this example, >> since I suspect others may be also confused, but to do that I need >> help understanding it. The example was added as part of the >> Serializable Snapshot Isolation patch. >> >> Link to the commit: >> http://git.postgresql.org/gitweb/?p=postgresql.git;h=dafaa3efb75ce1aae2e6dbefaf6f3a889dea0d21 >> >> >> >> I'm referring to the following sentence of 13.2.2, which is still in >> the source tree: >> >> http://www.postgresql.org/docs/devel/static/transaction-iso.html#XACT-REPEATABLE-READ >> >> "For example, even a read only transaction at this level may see a >> control record updated to show that a batch has been completed but >> not see one of the detail records which is logically part of the >> batch because it read an earlier revision of the control record." > > Hmm, that seems to be a super-summarized description of what Kevin & Dan > called the "receipts problem". There's an example of that in the > isolation test suite, see src/test/isolation/specs/receipt-report.spec. > Googling for it, I also found an academic paper written by Kevin & Dan > that illustrates it: http://arxiv.org/pdf/1208.4179.pdf, "2.1.2 Example > 2: Batch Processing". (Nice work, I didn't know of that paper until now!) > > I agree that's too terse. I think it would be good to actually spell out > a complete example of the Receipt problem in the manual. That chapter in > the manual contains examples of anomalities in Read Committed mode, so > it would be good to give a concrete example of an anomaly in Repeatable > Read mode too. Want to write up a docs patch? While this is not a doc patch I decided to give it some thought. The "bank" example was understandable enough for me so I simply tried to make it more accessible. I also didn't go and try to get it to conform to other, existing, examples. This is intended to replace the entire "For example..." paragraph noted above. While Repeatable Read provides for stable in-transaction reads logical query anomalies can result because commit order is not restricted and serialization errors only occur if two transactions attempt to modify the same record. Consider a rule that, upon updating r1 OR r2, if r1+r2 < 0 then subtract an additional 1 from the corresponding row. Initial State: r1 = 0; r2 = 0 Transaction 1 Begins: reads (0,0); adds -10 to r1, notes r1 + r2 will be -10 and subtracts an additional 1 Transaction 2 Begins: reads (0,0); adds 20 to r2, notes r1 + r2 will be +20; no further action needed Commit 2 Transaction 3: reads (0,20) and commits Commit 1 Transaction 4: reads (-11,20) and commits However, if Transaction 2 commits first then, logically, the calculation of r1 + r2 in Transaction 1 should result in a false outcome and the additional subtraction of 1 should not occur - leaving T4 reading (-10,20). The ability for out-of-order commits is what allows T3 to read the pair (0,20) which is logically impossible in the T2->before->T1 commit order with T4 reading (-11,20). Neither transaction fails since a serialization failure only occurs if a concurrent update occurs to [ r1 (in T1) ] or to [ r2 (in T2) ]; The update of [ r2 (in T1) ] is invisible - i.e., no failure occurs if a read value undergoes a change. Inspired by: http://www.sigmod.org/publications/sigmod-record/0409/2.ROAnomONeil.pdf - Example 1.3 David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/PG-Manual-Clarifying-the-repeatable-read-isolation-example-tp5805152p5805170.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] json casts
On 05/27/2014 11:00 PM, Tom Lane wrote: > Andrew Dunstan writes: >> On 05/27/2014 03:57 PM, Heikki Linnakangas wrote: >>> On 05/27/2014 10:53 PM, Andrew Dunstan wrote: I've been on the receiving end of a couple of mumbles about the fact that the JSON rendering code ignores casts of builtin types to JSON. This was originally done as an optimization to avoid doing cache lookups for casts for things we knew quite well how to turn into JSON values (unlike, say, hstore). However, there is at least one concrete case where this has some possibly undesirable consequences, namely timestamps. Many JSON processors, especially JavaScript/ECMAScript processors, require timestamp values to be in ISO 8601 format, with a 'T' between the date part and the time part, and thus they barf on the output we produce for such values. >>> I don't understand what ignoring casts of builtin types to JSON means. >>> Can you give an example? >> See src/backend/utils/adt/json.c:json_categorize_type() lines 1280-1300. >> When rendering some value as part of a json string, if a cast exists >> from the data type to json, then the cast function is used to render the >> json instead of the type's normal output function, but only if it's not >> a builtin type. > How exactly would disabling that code have any effect on timestamp > rendering? There's no cast to json from timestamps (nor any other > builtin type, except jsonb). I think Andrews idea was, that if cast were used, one could fix the above problem by defining a correct cast. > > I'd be inclined to think a more useful answer to this issue would be to > make json.c special-case timestamps, as it already does for numerics. > > regards, tom lane But I agree that special-casing the code to use the de-facto json standard of using ISO 8601 date representation is a better solution. Just make sure you get the TZ part right - this is another place where PostgreSQL often differs from other systems' understanding of ISO timestamps. Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5 commit fest schedule
On 5/22/14, 7:50 AM, Abhijit Menon-Sen wrote: > At 2014-05-22 07:22:42 -0400, pete...@gmx.net wrote: >> >> We need a volunteer to manage the first commit fest. > > I volunteer. Congratulations, you're the new commit fest manager. I trust that you know your way around, but if you have any questions about anything, you can ask me, and probably some of the other previous commit fest managers. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] json casts
Andrew Dunstan writes: > On 05/27/2014 03:57 PM, Heikki Linnakangas wrote: >> On 05/27/2014 10:53 PM, Andrew Dunstan wrote: >>> I've been on the receiving end of a couple of mumbles about the fact >>> that the JSON rendering code ignores casts of builtin types to JSON. >>> This was originally done as an optimization to avoid doing cache lookups >>> for casts for things we knew quite well how to turn into JSON values >>> (unlike, say, hstore). However, there is at least one concrete case >>> where this has some possibly undesirable consequences, namely >>> timestamps. Many JSON processors, especially JavaScript/ECMAScript >>> processors, require timestamp values to be in ISO 8601 format, with a >>> 'T' between the date part and the time part, and thus they barf on the >>> output we produce for such values. >> I don't understand what ignoring casts of builtin types to JSON means. >> Can you give an example? > See src/backend/utils/adt/json.c:json_categorize_type() lines 1280-1300. > When rendering some value as part of a json string, if a cast exists > from the data type to json, then the cast function is used to render the > json instead of the type's normal output function, but only if it's not > a builtin type. How exactly would disabling that code have any effect on timestamp rendering? There's no cast to json from timestamps (nor any other builtin type, except jsonb). I'd be inclined to think a more useful answer to this issue would be to make json.c special-case timestamps, as it already does for numerics. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Race condition within _bt_findinsertloc()? (new page split code)
On 05/27/2014 11:30 PM, Peter Geoghegan wrote: On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas wrote: Ah, sorry, I got that wrong. The downlinks store the *low* key of the child page, not the high key as I depicted. Let me try again: Would you mind humoring me, and including a corrected final post-downlink-insert diagram, when the split is fully complete? You omitted that. Sure: Parent page +-+ | 7 -> A | | 8 -> B | +-+ Leaf A Leaf B +--+ +-+ | HI-key: 8| | HI-key: 9 | | | | | | | <-> | | | | | | | 7 7* 8 | |9| +--+ +-+ - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Race condition within _bt_findinsertloc()? (new page split code)
On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas wrote: > Ah, sorry, I got that wrong. The downlinks store the *low* key of the child > page, not the high key as I depicted. Let me try again: Would you mind humoring me, and including a corrected final post-downlink-insert diagram, when the split is fully complete? You omitted that. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] json casts
On 05/27/2014 03:57 PM, Heikki Linnakangas wrote: On 05/27/2014 10:53 PM, Andrew Dunstan wrote: I've been on the receiving end of a couple of mumbles about the fact that the JSON rendering code ignores casts of builtin types to JSON. This was originally done as an optimization to avoid doing cache lookups for casts for things we knew quite well how to turn into JSON values (unlike, say, hstore). However, there is at least one concrete case where this has some possibly undesirable consequences, namely timestamps. Many JSON processors, especially JavaScript/ECMAScript processors, require timestamp values to be in ISO 8601 format, with a 'T' between the date part and the time part, and thus they barf on the output we produce for such values. I don't understand what ignoring casts of builtin types to JSON means. Can you give an example? See src/backend/utils/adt/json.c:json_categorize_type() lines 1280-1300. When rendering some value as part of a json string, if a cast exists from the data type to json, then the cast function is used to render the json instead of the type's normal output function, but only if it's not a builtin type. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] json casts
On 05/27/2014 10:53 PM, Andrew Dunstan wrote: I've been on the receiving end of a couple of mumbles about the fact that the JSON rendering code ignores casts of builtin types to JSON. This was originally done as an optimization to avoid doing cache lookups for casts for things we knew quite well how to turn into JSON values (unlike, say, hstore). However, there is at least one concrete case where this has some possibly undesirable consequences, namely timestamps. Many JSON processors, especially JavaScript/ECMAScript processors, require timestamp values to be in ISO 8601 format, with a 'T' between the date part and the time part, and thus they barf on the output we produce for such values. I don't understand what ignoring casts of builtin types to JSON means. Can you give an example? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] json casts
I've been on the receiving end of a couple of mumbles about the fact that the JSON rendering code ignores casts of builtin types to JSON. This was originally done as an optimization to avoid doing cache lookups for casts for things we knew quite well how to turn into JSON values (unlike, say, hstore). However, there is at least one concrete case where this has some possibly undesirable consequences, namely timestamps. Many JSON processors, especially JavaScript/ECMAScript processors, require timestamp values to be in ISO 8601 format, with a 'T' between the date part and the time part, and thus they barf on the output we produce for such values. I'm therefore wondering if we should just remove this optimization. I don't have any performance figures, but it seems unlikely to be terribly expensive. Perhaps in 9.5 we should look at caching a lot of the info the json rendering code gathers, but that's something we can't look at now. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PG Manual: Clarifying the repeatable read isolation example
Oh yeah, I shared an office with Dan so I should have thought to check their paper. Oops. Thanks for the suggestion; I'll try to summarize this into something that is similar to the Read Committed and Serializable mode examples. It may take me a week or two to find the time, but thanks for the suggestions. Evan On May 27, 2014, at 15:32 , Heikki Linnakangas wrote: > I agree that's too terse. I think it would be good to actually spell out a > complete example of the Receipt problem in the manual. That chapter in the > manual contains examples of anomalities in Read Committed mode, so it would > be good to give a concrete example of an anomaly in Repeatable Read mode too. > Want to write up a docs patch? -- Work: https://www.mitro.co/Personal: http://evanjones.ca/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_llog not mentioned in "Database File Layout"
On Mon, May 26, 2014 at 12:33 PM, Amit Langote wrote: > Hi, > > Just noticed pg_llog is not mentioned in the "Database File Layout" > section. Wonder if it's an oversight? > > Yes, it is an oversight. Patch attached. Regards, -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL >> Timbira: http://www.timbira.com.br >> Blog sobre TI: http://fabriziomello.blogspot.com >> Perfil Linkedin: http://br.linkedin.com/in/fabriziomello >> Twitter: http://twitter.com/fabriziomello diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml index 57e7f09..132169d 100644 --- a/doc/src/sgml/storage.sgml +++ b/doc/src/sgml/storage.sgml @@ -78,6 +78,11 @@ Item + pg_llog + Subdirectory containing logical replication status data + + + pg_multixact Subdirectory containing multitransaction status data (used for shared row locks) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PG Manual: Clarifying the repeatable read isolation example
On 05/27/2014 10:12 PM, Evan Jones wrote: I was reading the Postgres MVCC documentation today (which is generally fantastic BTW), and am slightly confused by a single sentence example, describing possible read-only snapshot isolation anomalies. I would like to submit a patch to clarify this example, since I suspect others may be also confused, but to do that I need help understanding it. The example was added as part of the Serializable Snapshot Isolation patch. Link to the commit: http://git.postgresql.org/gitweb/?p=postgresql.git;h=dafaa3efb75ce1aae2e6dbefaf6f3a889dea0d21 I'm referring to the following sentence of 13.2.2, which is still in the source tree: http://www.postgresql.org/docs/devel/static/transaction-iso.html#XACT-REPEATABLE-READ "For example, even a read only transaction at this level may see a control record updated to show that a batch has been completed but not see one of the detail records which is logically part of the batch because it read an earlier revision of the control record." Hmm, that seems to be a super-summarized description of what Kevin & Dan called the "receipts problem". There's an example of that in the isolation test suite, see src/test/isolation/specs/receipt-report.spec. Googling for it, I also found an academic paper written by Kevin & Dan that illustrates it: http://arxiv.org/pdf/1208.4179.pdf, "2.1.2 Example 2: Batch Processing". (Nice work, I didn't know of that paper until now!) I agree that's too terse. I think it would be good to actually spell out a complete example of the Receipt problem in the manual. That chapter in the manual contains examples of anomalities in Read Committed mode, so it would be good to give a concrete example of an anomaly in Repeatable Read mode too. Want to write up a docs patch? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Why is pg_lsn marking itself a preferred type?
One of these doesn't belong: postgres=# select typname, typcategory from pg_type where typispreferred; typname | typcategory -+- bool| B text| S oid | N float8 | N inet| I timestamptz | D interval| T varbit | V pg_lsn | U (9 rows) Was there any actual rationale to this, or was it just somebody who did not understand what that bit is for? I think it's probably mostly harmless given the lack of casts to or from pg_lsn, but it's still a darn bad idea to have any preferred types in the 'U' category. If we leave it like this it will bite us in the rear eventually. The most expedient response at this late date seems to be to change the entry in pg_type.h without bumping catversion. That way at least it will be right in databases initdb'd after beta2. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Race condition within _bt_findinsertloc()? (new page split code)
On 05/27/2014 09:47 PM, Peter Geoghegan wrote: On Tue, May 27, 2014 at 4:54 AM, Heikki Linnakangas wrote: Also note that _bt_moveright() also finishes any incomplete splits it encounters (when called for an insertion). So _bt_findinsertloc() never gets called on a page with the incomplete-split flag set. It might encounter one when it moves right, but never the first page. Fair enough, but I don't think that affects correctness either way (I don't think that you meant to imply that this was a necessary precaution that you'd taken - right?). Right. If I understood correctly, the tree looks like this before the insertion: Parent page: +-+ | | | 9 -> A | +-+ Leaf A +-+ | HI-key: 9 | | | | 7 8 9 | +-+ And after insertion and incomplete split: Parent page +-+ | | | 9 -> A | +-+ Leaf A Leaf B +--+ +-+ | HI-key: 8| | HI-key: 9 | | (INCOMPLETE_ | | | | SPLIT) | <-> | | | | | | | 7 7* 8 | | 9 | +--+ +-+ After the split is finished, the tree looks like this: Parent page +-+ | 8 -> A | | 9 -> B | +-+ Leaf A Leaf B +-+ +-+ | HI-key: 8 | | HI-key: 9 | | | <-> | | | 7 7* 8 | | 9 | +-+ +-+ How did the parent page change between before and after the final atomic operation (page split completion)? What happened to "9 -> A"? Ah, sorry, I got that wrong. The downlinks store the *low* key of the child page, not the high key as I depicted. Let me try again: On 05/27/2014 09:17 AM, Peter Geoghegan wrote: Anyway, the concern is that there may be problems when we call _bt_finish_split() with that left sibling locked thoughout (i.e. finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and itself has a right sibling from the incomplete split (which is usually the value lock page's right-right sibling)). I'm not concerned about performance, since as the comment says it ought to be an infrequent occurrence. I also believe that there are no deadlock hazards. But consider this scenario: * We insert the value 7 into an int4 unique index. We need to split the leaf page. We run out of memory or something, and ours is an incomplete page split. Our transaction is aborted. For the sake of argument, suppose that there are also already a bunch of dead tuples within the index with values of 7, 8 and 9. So, initially the tree looks like this: Parent page: +-+ | | | 7 -> A | +-+ Leaf A +-+ | HI-key: 9 | | | | 7 8 9 | +-+ And after insertion and incomplete split: Parent page +-+ | | | 7 -> A | +-+ Leaf A Leaf B +--+ +-+ | HI-key: 8| | HI-key: 9 | | (INCOMPLETE_ | | | | SPLIT) | <-> | | | | | | | 7 7* 8 | |9| +--+ +-+ where 7* is the newly inserted key, with value 7. (you didn't mention at what point the split happens, but in the next paragraph you said the new downlink has value 8, which implies the above split) * Another inserter of the value 7 comes along. It follows exactly the same downlink as the first, now aborted inserter (suppose the downlink's value is 9). It also locks the same leaf page to establish a "value lock" in precisely the same manner. Finding no room on the first page, it looks further right, maintaining its original "value lock" throughout. It finishes the first inserter's split of the non-value-lock page - a new downlink is inserted into the parent page, with the value 8. It then releases all buffer locks except the first one/original "value lock". A physical insertion has yet to occur. The downlink of the original page cannot contain 9. Because, as I now remember ;-), the downlinks contain low-keys, not high keys. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PG Manual: Clarifying the repeatable read isolation example
Feel free to flame me if I should be posting this elsewhere, but after reading the "submitting a patch" guide, it appears I should ask for guidance here. I was reading the Postgres MVCC documentation today (which is generally fantastic BTW), and am slightly confused by a single sentence example, describing possible read-only snapshot isolation anomalies. I would like to submit a patch to clarify this example, since I suspect others may be also confused, but to do that I need help understanding it. The example was added as part of the Serializable Snapshot Isolation patch. Link to the commit: http://git.postgresql.org/gitweb/?p=postgresql.git;h=dafaa3efb75ce1aae2e6dbefaf6f3a889dea0d21 I'm referring to the following sentence of 13.2.2, which is still in the source tree: http://www.postgresql.org/docs/devel/static/transaction-iso.html#XACT-REPEATABLE-READ "For example, even a read only transaction at this level may see a control record updated to show that a batch has been completed but not see one of the detail records which is logically part of the batch because it read an earlier revision of the control record." I do not understand how this example anomaly is possible. I'm imagining something like the following: 1. Do a bunch of work, possibly in parallel in multiple transactions, that insert/update a bunch of detail records. 2. After all that work commits, insert or update a record in the "control" table indicating that the batch completed. Or maybe: 1. Do a batch of work and update the "control" table in a single transaction. The guarantee that I believe REPEATABLE READ will give you in either of these case is that if you see the "control" table record, you will read all the detail records, because the control record is only written if the updated detail records have been committed. What am I not understanding? The most widely cited read-only snapshot isolation example is the bank withdrawl example from this paper: http://www.sigmod.org/publications/sigmod-record/0409/2.ROAnomONeil.pdf . However, I suspect we can present an anomaly that doesn't require as much explanation? Thanks, Evan Jones -- Work: https://www.mitro.co/Personal: http://evanjones.ca/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Race condition within _bt_findinsertloc()? (new page split code)
On Tue, May 27, 2014 at 4:54 AM, Heikki Linnakangas wrote: >> The "left sibling" referred to here is "the first page this key could >> be on", an important concept for unique index enforcement. > > No, it's not "the first page this key could be on". Well, it may be initially. I could have been more cautious about the terminology here. > Also note that _bt_moveright() also finishes any incomplete splits it > encounters (when called for an insertion). So _bt_findinsertloc() never gets > called on a page with the incomplete-split flag set. It might encounter one > when it moves right, but never the first page. Fair enough, but I don't think that affects correctness either way (I don't think that you meant to imply that this was a necessary precaution that you'd taken - right?). It's a nice property, since it makes the extra locking while completing a split within _bt_findinsertloc() particularly infrequent. But, that might also be a bad thing, when considered from a different perspective. > If I understood correctly, the tree looks like this before the insertion: > > Parent page: > +-+ > | | > | 9 -> A | > +-+ > > Leaf A > +-+ > | HI-key: 9 | > | | > | 7 8 9 | > +-+ > > And after insertion and incomplete split: > > Parent page > +-+ > | | > | 9 -> A | > +-+ > > Leaf A Leaf B > +--+ +-+ > | HI-key: 8| | HI-key: 9 | > | (INCOMPLETE_ | | | > | SPLIT) | <-> | | > | | | | > | 7 7* 8 | | 9 | > +--+ +-+ > After the split is finished, the tree looks like this: > > Parent page > +-+ > | 8 -> A | > | 9 -> B | > +-+ > > Leaf A Leaf B > +-+ +-+ > | HI-key: 8 | | HI-key: 9 | > | | <-> | | > | 7 7* 8 | | 9 | > +-+ +-+ How did the parent page change between before and after the final atomic operation (page split completion)? What happened to "9 -> A"? >> Regardless of whether or not I have these details exactly right (that >> is, regardless of whether or not this scenario is strictly possible) I >> suggest as a code-hardening measure that _bt_findinsertloc() release >> its "value lock", upon realizing it must complete splits, and then >> complete the split or splits known to be required. It would finally >> report that it "couldn't find" an insertion location to >> _bt_doinsert(), which would then retry from the start, just like when >> _bt_check_unique() finds an inconclusive conflict. > > Yeah, that would work too. It seems safe enough as it is, though, so I don't > see the point. Well, it would be nice to not have to finish the page split in what is a particularly critical path, with that extra buffer lock. It's not strictly necessary, but then it is theoretically safer, and certainly much clearer. The fact that this code is so seldom executed is one issue that made me revisit this. On the other hand, I can see why you'd want to avoid cluttering up the relatively comprehensible _bt_doinsert() function if it could be avoided. I defer to you. > PS. Thanks for looking into this again! These B-tree changes really need > thorough review. You're welcome. Hopefully my questions will lead you in a useful direction, even if my concerns turn out to be, in the main, unfounded. :-) It previously wasn't in evidence that you'd considered these interactions, and I feel better knowing that you have. -- Peter Geoghega -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD
Heikki Linnakangas writes: > On 27 May 2014 18:33:48 EEST, Tom Lane wrote: >> If we were going to do it like that, I'd vote for >> >> --with-uuid={ossp,e2fs,bsd} >> >> with no default at present (ie you can't say just "--with-uuid", >> though we'd have the option to allow that in future). But I doubt >> this is better than the --with-foo-uuid spelling. > FWIW, --with-uuid=foo looks much clearer to me. OK, I'll make it so. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spreading full-page writes
On Mon, May 26, 2014 at 8:15 PM, Robert Haas wrote: > On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas > wrote: > >>I don't think we know that. The server might have crashed before that > >>second record got generated. (This appears to be an unfixable flaw in > >>this proposal.) > > > > The second record is generated before the checkpoint is finished and the > checkpoint record is written. So it will be there. > > > > (if you crash before the checkpoint is finished, the in-progress > checkpoint is no good for recovery anyway, and won't be used) > > Hmm, I see. > > It's not great to have to generate WAL at buffer-eviction time, > though. Normally, when we go to evict a buffer, the WAL is already > written. We might have to wait for it to be flushed, but if the WAL > writer is doing its job, hopefully not. But here we'll definitely > have to wait for the WAL flush. > I'm not sure we do need to flush it. If the checkpoint finishes, then the WAL surely got flushed as part of the process of recording the end of the checkpoint. If the checkpoint does not finish, recovery will start from the previous checkpoint, which does contain the FPW (because if it didn't, the page would not be eligible for this treatment) and so the possibly torn page will get overwritten in full. Cheers, Jeff
Re: [HACKERS] IMPORT FOREIGN SCHEMA statement
Le mardi 27 mai 2014 09:52:27 Stephen Frost a écrit : > * David Fetter (da...@fetter.org) wrote: > > In the "easy" case of PostgreSQL, you might also be able to establish > > whether the UDT in the "already defined locally" case above is > > identical to the one defined remotely, but I don't think it's possible > > even in principle for non-PostgreSQL remote systems, possibly not even > > for ones with non-identical architecture, PostgreSQL major version, > > etc. > > For my 2c, it'd be very handy if IMPORT had an option or similar to also > copy over all (in the schema) or at least any depended-upon UDTs. It'd > really be nice if we had an ability to check the remote UDT vs. the > local one at some point also, since otherwise you're going to get bitten > at run-time when querying the foreign table. The specification only talks about importing tables, not types, views or (why not ?) foreign tables. If we want to extend the spec further, we could accept more than CreateForeignTableStmt as return values from the API: - CreateDomainStmt - CompositeTypeStmt - AlterTableCmd The core code would be responsible for executing them, and making sure the destination schema is correctly positioned on all of those. The postgres_fdw behaviour could then be controlled with options such as include_types (tri-valued enum accepting "none", "all", "as_needed"), relation_kinds (default to 'r', can support multiple kinds with a string 'rfv' for tables, foreign tables and views). I think we're drifting further away from the standard with that kind of stuff, but I'd be happy to implement it if that's the path we choose. -- Ronan Dunklau http://dalibo.com - http://dalibo.org signature.asc Description: This is a digitally signed message part.
Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD
On 27 May 2014 18:33:48 EEST, Tom Lane wrote: >Alvaro Herrera writes: >> Tom Lane wrote: >>> The problem is that the long-established spelling is >--with-ossp-uuid. >>> I don't think we can break that case. While we could set up >something >>> like what you propose alongside it, it doesn't seem like there's any >>> advantage to doing so compared to inventing --with-foo-uuid as >needed. > >> I was thinking that --with-ossp-uuid would still be required to >enable >> UUID generators at all; the other one just selects the implementation >to >> use, which defaults to OSSP to maintain backwards compatibility. >Maybe >> introduce --with-uuid and have --with-ossp-uuid a deprecated synonym >of >> that. > >If we were going to do it like that, I'd vote for > > --with-uuid={ossp,e2fs,bsd} > >with no default at present (ie you can't say just "--with-uuid", >though we'd have the option to allow that in future). But I doubt >this is better than the --with-foo-uuid spelling. FWIW, --with-uuid=foo looks much clearer to me. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD
On May 27, 2014, at 7:44 AM, Tom Lane wrote: > In either case, the problem remains of exactly what to call the > e2fsprogs-derived implementation. It does seem that people who are > familiar with these libraries call it that, but I'm worried that such > a name will confuse those not so familiar. --with-libuuid? D signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD
Alvaro Herrera writes: > Tom Lane wrote: >> The problem is that the long-established spelling is --with-ossp-uuid. >> I don't think we can break that case. While we could set up something >> like what you propose alongside it, it doesn't seem like there's any >> advantage to doing so compared to inventing --with-foo-uuid as needed. > I was thinking that --with-ossp-uuid would still be required to enable > UUID generators at all; the other one just selects the implementation to > use, which defaults to OSSP to maintain backwards compatibility. Maybe > introduce --with-uuid and have --with-ossp-uuid a deprecated synonym of > that. If we were going to do it like that, I'd vote for --with-uuid={ossp,e2fs,bsd} with no default at present (ie you can't say just "--with-uuid", though we'd have the option to allow that in future). But I doubt this is better than the --with-foo-uuid spelling. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD
Tom Lane wrote: > Alvaro Herrera writes: > > Why not --with-uuid-implementation=, and have be one of > > e2utils, bsd, ossp, with the latter being default? We could also have > > offer the value "list" or "help" which would list the available options. > > That way, if we come up with a new implementation in the future, this is > > easily extensible. > > The problem is that the long-established spelling is --with-ossp-uuid. > I don't think we can break that case. While we could set up something > like what you propose alongside it, it doesn't seem like there's any > advantage to doing so compared to inventing --with-foo-uuid as needed. I was thinking that --with-ossp-uuid would still be required to enable UUID generators at all; the other one just selects the implementation to use, which defaults to OSSP to maintain backwards compatibility. Maybe introduce --with-uuid and have --with-ossp-uuid a deprecated synonym of that. As a more sophisticated idea: if you say --with-uuid then --with-uuid-implementation is required; if you say --with-ossp-uuid then --with-uuid-implementation is optional and defaults to OSSP. > In either case, the problem remains of exactly what to call the > e2fsprogs-derived implementation. It does seem that people who are > familiar with these libraries call it that, but I'm worried that such > a name will confuse those not so familiar. I vote "e2fsprogs". In the help text for that option, mention that it works on Linux and some others OSes, or something like that so that people on Linux try that one first, and people on other OSes can web-search to see whether it's available. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IMPORT FOREIGN SCHEMA statement
* Tom Lane (t...@sss.pgh.pa.us) wrote: > Stephen Frost writes: > > This, plus the generic ability to pass an OPTIONS clause to the IMPORT > > (allowing you to have different defaults for different IMPORT > > statements) and having it be transactional, as you mention, appears to > > be covering all the relevant bases. > > Yeah, as far as the example of coping with differing case goes, I agree > that we'd want IMPORT to just follow whatever the FDW's default or > configured behavior is, since obviously the FDW will have to know how to > reverse the conversion when sending queries later on. So there's no > apparent need for an IMPORT-level option *for that example*. I'm not > sure if we need OPTIONS for IMPORT for any other uses. I'm waffling a bit on this particular point. IMPORT is for bulk operations, of course, but perhaps on one remote server you want to import everything from the dimension tables using the default mapping but everything from the fact or perhaps aggregate tables in some schema as text. You could do that with something like- import #1, change the server-level options, import #2, or you could do just one big import and then go back and fix everything, but... I'd rather introduce this options clause for IMPORT *now*, which means we don't need to care about if this specific example is really relevant or not, than not have it in 9.5 and have FDW authors unhappy because it's missing (whether they need it for this use case or some other). Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD
Alvaro Herrera writes: > Why not --with-uuid-implementation=, and have be one of > e2utils, bsd, ossp, with the latter being default? We could also have > offer the value "list" or "help" which would list the available options. > That way, if we come up with a new implementation in the future, this is > easily extensible. The problem is that the long-established spelling is --with-ossp-uuid. I don't think we can break that case. While we could set up something like what you propose alongside it, it doesn't seem like there's any advantage to doing so compared to inventing --with-foo-uuid as needed. In either case, the problem remains of exactly what to call the e2fsprogs-derived implementation. It does seem that people who are familiar with these libraries call it that, but I'm worried that such a name will confuse those not so familiar. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD
On 2014-05-27 16:36:45 +0200, Matteo Beccati wrote: > On 27/05/2014 15:52, Tom Lane wrote: > > Ah, cool. I had documented this option as only working for FreeBSD, > > but that's obviously too conservative. Anyone know about whether it > > works on OpenBSD? > > I've tried to google "man uuid openbsd" and I got the e2fs package > (which contains uuid/uuid.h and libuuid) instead of a man page, so I > believe that could be another use case for --with-linux-uuid. > > If it's confirmed to be working, that makes two BSD-derived systems > requiring "linux-uuid", so --with-e2fs-uuid or similar would be more > appropriate. When I looked at the e2fs uuid implementation a year or so back, it even had some form of windows support. So +1 to naming it non-os specifically. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD
Matteo Beccati writes: > On 27/05/2014 15:52, Tom Lane wrote: >> Ah, cool. I had documented this option as only working for FreeBSD, >> but that's obviously too conservative. Anyone know about whether it >> works on OpenBSD? > I've tried to google "man uuid openbsd" and I got the e2fs package > (which contains uuid/uuid.h and libuuid) instead of a man page, so I > believe that could be another use case for --with-linux-uuid. Yeah, googling led me to the same conclusion: e2fsprogs libuuid is definitely available for OpenBSD, but I found no evidence that it has the built-in UUID functions that FreeBSD/NetBSD have. > If it's confirmed to be working, that makes two BSD-derived systems > requiring "linux-uuid", so --with-e2fs-uuid or similar would be more > appropriate. Still don't like that name much, but maybe that's the best we can do. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD
Tom Lane wrote: > "David E. Wheeler" writes: > > On May 26, 2014, at 9:30 PM, Tom Lane wrote: > >>> How about --with-unix-uuid? Or --with-ext2-uuid? > > >> Meh. "Unix" certainly subsumes BSD, so that doesn't seem like a very > >> useful distinction. I guess we could use "ext2" but that would just > >> confuse most people. > > > --with-uuid? > > I thought about that but figured we'd regret it ... Why not --with-uuid-implementation=, and have be one of e2utils, bsd, ossp, with the latter being default? We could also have offer the value "list" or "help" which would list the available options. That way, if we come up with a new implementation in the future, this is easily extensible. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IMPORT FOREIGN SCHEMA statement
Stephen Frost writes: > Sure, I was being a bit over-simplistic. As was mentioned up-thread, > the option would rather be "flip all-uppercase to lowercase and all- > lowercase to uppercase, quote any which are mixed", or something along > those lines. What I was trying to get at is that it's up to the FDW > what options it wants to support in this regard and we already have a > way for the admin to pass in useful information to the FDW by way of the > FOREIGN SERVER FDW options. > This, plus the generic ability to pass an OPTIONS clause to the IMPORT > (allowing you to have different defaults for different IMPORT > statements) and having it be transactional, as you mention, appears to > be covering all the relevant bases. Yeah, as far as the example of coping with differing case goes, I agree that we'd want IMPORT to just follow whatever the FDW's default or configured behavior is, since obviously the FDW will have to know how to reverse the conversion when sending queries later on. So there's no apparent need for an IMPORT-level option *for that example*. I'm not sure if we need OPTIONS for IMPORT for any other uses. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD
Hi Tom, On 27/05/2014 15:52, Tom Lane wrote: > Matteo Beccati writes: >> On 27/05/2014 03:07, Tom Lane wrote: >>> I do not have a machine on which to try --with-bsd-uuid, so it's >>> possible I broke that portion of Matteo's patch. Would someone try >>> that case on a FreeBSD box? > >> I've tested on NetBSD i386 and --with-bsd-uuid worked out of the box. > > Ah, cool. I had documented this option as only working for FreeBSD, > but that's obviously too conservative. Anyone know about whether it > works on OpenBSD? I've tried to google "man uuid openbsd" and I got the e2fs package (which contains uuid/uuid.h and libuuid) instead of a man page, so I believe that could be another use case for --with-linux-uuid. If it's confirmed to be working, that makes two BSD-derived systems requiring "linux-uuid", so --with-e2fs-uuid or similar would be more appropriate. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IMPORT FOREIGN SCHEMA statement
* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: > Stephen Frost wrote: > > > It seems like it would often be desirable for the Oracle FDW to smash > > > all-upper-case names to all-lower-case while importing, so that no quoting > > > is needed on either side. I doubt though that this is so desirable that > > > it should happen unconditionally. > > > > The oracle FDW would simply need a foreign-server level option which > > says "smash everything to lowercase on import". > > That's not the same thing though -- consider what happens to the quoting > needs for names with mixed case. If you change mixed case to > all-lowercase, references to such objects using quotes in the > application code would fail because the name is now all-lowercase in > Postgres. A tri-valued enum could do the trick: > lowercase_names={wholly_uppercase_only, all, none} > with the first one being the most sensible and default. Sure, I was being a bit over-simplistic. As was mentioned up-thread, the option would rather be "flip all-uppercase to lowercase and all- lowercase to uppercase, quote any which are mixed", or something along those lines. What I was trying to get at is that it's up to the FDW what options it wants to support in this regard and we already have a way for the admin to pass in useful information to the FDW by way of the FOREIGN SERVER FDW options. This, plus the generic ability to pass an OPTIONS clause to the IMPORT (allowing you to have different defaults for different IMPORT statements) and having it be transactional, as you mention, appears to be covering all the relevant bases. Thanks, stephen signature.asc Description: Digital signature
Re: [HACKERS] IMPORT FOREIGN SCHEMA statement
Stephen Frost wrote: > * Tom Lane (t...@sss.pgh.pa.us) wrote: > > Albe Laurenz writes: > > > Oracle follows the SQL standard in folding table names to upper case. > > > So this would normally result in a lot of PostgreSQL foreign tables > > > with upper case names, which would be odd and unpleasant. > > > > > I cannot see a way out of that, but I thought I should mention it. > > > > It seems like it would often be desirable for the Oracle FDW to smash > > all-upper-case names to all-lower-case while importing, so that no quoting > > is needed on either side. I doubt though that this is so desirable that > > it should happen unconditionally. > > The oracle FDW would simply need a foreign-server level option which > says "smash everything to lowercase on import". That's not the same thing though -- consider what happens to the quoting needs for names with mixed case. If you change mixed case to all-lowercase, references to such objects using quotes in the application code would fail because the name is now all-lowercase in Postgres. A tri-valued enum could do the trick: lowercase_names={wholly_uppercase_only, all, none} with the first one being the most sensible and default. > > Between this and the type-mapping questions, it seems likely that > > we're going to need a way for IMPORT FOREIGN SCHEMA to accept > > user-supplied control options, which would in general be specific > > to the FDW being used. (Another thing the SQL committee failed to > > think about.) > > I can see value in having specific mappings defined at the foreign > server level for what IMPORT does by default, but as IMPORT is intended > to be a bulk process, I don't see the value in trying to teach it how to > do a specific mapping for a specific table or column inside the schema. > You run the IMPORT for the entire schema and then go fix up any special > cases or things you wanted differently. Yes, it seems better to offer the ability to create transactional scripts around IMPORT (so it must be able to run in a transaction block -- IMPORT first, then do a bunch of ALTERs), than complicating IMPORT itself infinitely to support hundreds of possible options. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IMPORT FOREIGN SCHEMA statement
On 27/05/14 13:49, Ronan Dunklau wrote: Between this and the type-mapping questions, it seems likely that we're going to need a way for IMPORT FOREIGN SCHEMA to accept user-supplied control options, which would in general be specific to the FDW being used. (Another thing the SQL committee failed to think about.) So, without extending the syntax too much, we could add options: IMPORT FOREIGN SCHEMA remote_schema FROM SERVER server_name INTO local_schema [ { LIMIT TO | EXCEPT } (table_name [, ...])] [ OPTIONS (option_list) ] This option list could then contain fdw-specific options, including type mapping. This one looks like good option to me. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IMPORT FOREIGN SCHEMA statement
* Ronan Dunklau (ronan.dunk...@dalibo.com) wrote: > So, without extending the syntax too much, we could add options: > > IMPORT FOREIGN SCHEMA remote_schema FROM SERVER server_name INTO local_schema > [ { LIMIT TO | EXCEPT } (table_name [, ...])] > [ OPTIONS (option_list) ] > > This option list could then contain fdw-specific options, including type > mapping. Yup, that looks reasonable to me. > Or we could add a specific clause: > > IMPORT FOREIGN SCHEMA remote_schema FROM SERVER server_name INTO local_schema > [ { LIMIT TO | EXCEPT } (table_name [, ...])] > [ OPTIONS (option_list) ] > [ TYPE MAPPING some mapping_definition ] -1 on this one. > With mapping_definition being either a tuple, or well-defined jsonb format > common accross FDWs. > > A third solution, which I don't like but doesn't modify the SQL grammar, > would > be to encode the options in the remote_schema name. Yeah, don't like that one either. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] IMPORT FOREIGN SCHEMA statement
* Tom Lane (t...@sss.pgh.pa.us) wrote: > Albe Laurenz writes: > > Oracle follows the SQL standard in folding table names to upper case. > > So this would normally result in a lot of PostgreSQL foreign tables > > with upper case names, which would be odd and unpleasant. > > > I cannot see a way out of that, but I thought I should mention it. > > It seems like it would often be desirable for the Oracle FDW to smash > all-upper-case names to all-lower-case while importing, so that no quoting > is needed on either side. I doubt though that this is so desirable that > it should happen unconditionally. The oracle FDW would simply need a foreign-server level option which says "smash everything to lowercase on import". > Between this and the type-mapping questions, it seems likely that > we're going to need a way for IMPORT FOREIGN SCHEMA to accept > user-supplied control options, which would in general be specific > to the FDW being used. (Another thing the SQL committee failed to > think about.) I can see value in having specific mappings defined at the foreign server level for what IMPORT does by default, but as IMPORT is intended to be a bulk process, I don't see the value in trying to teach it how to do a specific mapping for a specific table or column inside the schema. You run the IMPORT for the entire schema and then go fix up any special cases or things you wanted differently. That said, I'm not against having a way to pass to IMPORT a similar key/value pair set which we already support for the FDW options on tables, columns, etc, by way of safe-guarding any potential use case for such. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD
Matteo Beccati writes: > On 27/05/2014 03:07, Tom Lane wrote: >> I do not have a machine on which to try --with-bsd-uuid, so it's >> possible I broke that portion of Matteo's patch. Would someone try >> that case on a FreeBSD box? > I've tested on NetBSD i386 and --with-bsd-uuid worked out of the box. Ah, cool. I had documented this option as only working for FreeBSD, but that's obviously too conservative. Anyone know about whether it works on OpenBSD? > I'm attaching a little patch to be applied on top of yours. Thanks, will incorporate this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IMPORT FOREIGN SCHEMA statement
* David Fetter (da...@fetter.org) wrote: > In the "easy" case of PostgreSQL, you might also be able to establish > whether the UDT in the "already defined locally" case above is > identical to the one defined remotely, but I don't think it's possible > even in principle for non-PostgreSQL remote systems, possibly not even > for ones with non-identical architecture, PostgreSQL major version, > etc. For my 2c, it'd be very handy if IMPORT had an option or similar to also copy over all (in the schema) or at least any depended-upon UDTs. It'd really be nice if we had an ability to check the remote UDT vs. the local one at some point also, since otherwise you're going to get bitten at run-time when querying the foreign table. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] IMPORT FOREIGN SCHEMA statement
* David Fetter (da...@fetter.org) wrote: > - We make type mappings settable at the level of: > - FDW > - Instance (a.k.a. cluster) > - Database > - Schema > - Table > - Column While I like the general idea, you seem to be making this appear much more complex than it actually is. For example, I see no value in a table-level "uint4 -> int8" definition. If you want something which is not the default, just set it on the individual columns of the foreign table, exactly how we handle column name differences. There might be some value in being able to redefine what the default is at the FOREIGN SERVER level, as perhaps MySQL DB X and MySQL DB Y could have different default mappings for some reason, but adding in the rest of those levels doesn't add value imv. > This would consist of: > - The remote type > - The local type to which it maps > - The inbound transformation (default identity) > - The outbound transformation (default identity) The remote type and the local type are known already to the FDW, no? The FDW will need to know how to do whatever conversions we're talking about also, right? (If you want to do it in core PG instead of the FDW then I'm thinking you should probably use a view over top of the foreign table..). What's nice is that this all falls under what an FDW can do *already*, if it wants (and, actually, it could do it on the table-level technically too, if the FDW supports that, but schema? database? these things don't make sense...). For the IMPORT bit, it should just be doing whatever the defaults are unless you've set some different defaults for a given foreign server (also something which could be set using our existing system...). > ALTER FOREIGN TABLE foo ADD (mapping '{ > "datetime": "text", > "inbound": "IDENTITY", > outbound: "IDENTITY" > }') I'm very much against having two different command languages where one is used "when convenient". If we wanted to do this, we should build a full-spec mapping from whatever JSON language you come up with for our utility commands to what we actually implement in the grammar. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Spreading full-page writes
On 27 May 2014 07:42, Greg Stark wrote: > On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas > wrote: >> >> On 05/26/2014 02:26 PM, Greg Stark wrote: >>> Another idea would be to have separate checkpoints for each buffer >>> partition. You would have to start recovery from the oldest checkpoint of >>> any of the partitions. >> >> Yeah. Simon suggested that when we talked about this, but I didn't >> understand how that works at the time. I think I do now. The key to making >> it work is distinguishing, when starting recovery from the latest >> checkpoint, whether a record for a given page can be replayed safely. I used >> flags on WAL records in my proposal to achieve this, but using buffer >> partitions is simpler. > > Interesting. I just thought of it independently. Actually, I heard it from Doug Tolbert in 2005, based on how another DBMS coped with that issue. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spreading full-page writes
On 05/27/2014 03:18 PM, Simon Riggs wrote: IIRC Koichi had a patch for prefetch during recovery. Heikki, is that the reason you also discussed changing the WAL record format to allow us to identify the blocks touched by recovery more easily? Yeah, that was one use case I had in mind for the WAL format changes. See http://www.postgresql.org/message-id/533d6cbf.6080...@vmware.com. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spreading full-page writes
On 27 May 2014 03:49, Fujii Masao wrote: >> So that gives us a few approaches >> >> * Compressing FPWs gives A >> * Background FPWs gives us B >>which look like we can combine both ideas >> >> * Double-buffering would give us A and B, but not C >>and would be incompatible with other two ideas > > Double-buffering would allow us to disable FPW safely but which would make > a recovery slow. So if we adopt double-buffering, I think that we would also > need to overhaul the recovery. Which is also true of Background FPWs So our options are 1. Compressed FPWs only 2. Compressed FPWs plus BackgroundFPWs plus Recovery Buffer Prefetch 3. Double Buffering plus Recovery Buffer Prefetch IIRC Koichi had a patch for prefetch during recovery. Heikki, is that the reason you also discussed changing the WAL record format to allow us to identify the blocks touched by recovery more easily? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spreading full-page writes
On 05/27/2014 02:42 PM, Greg Stark wrote: On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas wrote: On 05/26/2014 02:26 PM, Greg Stark wrote: Another idea would be to have separate checkpoints for each buffer partition. You would have to start recovery from the oldest checkpoint of any of the partitions. Yeah. Simon suggested that when we talked about this, but I didn't understand how that works at the time. I think I do now. The key to making it work is distinguishing, when starting recovery from the latest checkpoint, whether a record for a given page can be replayed safely. I used flags on WAL records in my proposal to achieve this, but using buffer partitions is simpler. Interesting. I just thought of it independently. Incidentally you wouldn't actually want to use the buffer partitions per se since the new server might start up with a different number of partitions. You would want an algorithm for partitioning the block space that xlog replay can reliably reproduce regardless of the size of the buffer lock partition table. It might make sense to set it up so it coincidentally ensures all the buffers being flushed are in the same partition or maybe the reverse would be better. Probably it doesn't actually matter. Since you will be flushing the buffers one "redo partition" at a time, you would want to allow the OS to do merge the writes within a partition as much as possible. So my even-odd split would in fact be pretty bad. Some sort of striping, e.g. mapping each contiguous 1 MB chunk to the same partition, would be better. I'm assuming you would keep N checkpoint positions in the control file. That also means we can double the checkpoint timeout with only a marginal increase in the worst case recovery time. Since the worst case will be (1 + 1/n)*timeout's worth of wal to replay rather than 2*n. The amount of time for recovery would be much more predictable. Good point. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Race condition within _bt_findinsertloc()? (new page split code)
On 05/27/2014 09:17 AM, Peter Geoghegan wrote: While speccing out a new B-Tree verification tool, I had the opportunity to revisit a thought I had during review concerning _bt_findinsertloc(): that the new coding is unsafe in the event of deferred split completion of a leaf page of a unique index. To recap, we now do the following in _bt_findinsertloc(): /* * If this page was incompletely split, finish the split now. We * do this while holding a lock on the left sibling, which is not * good because finishing the split could be a fairly lengthy * operation. But this should happen very seldom. */ if (P_INCOMPLETE_SPLIT(lpageop)) { _bt_finish_split(rel, rbuf, stack); rbuf = InvalidBuffer; continue; } The "left sibling" referred to here is "the first page this key could be on", an important concept for unique index enforcement. No, it's not "the first page this key could be on". _bt_findinsertloc() does *not* hold a lock on the first valid page the key could go to. It merely ensures that when it steps to the next page, it releases the lock on the previous page only after acquiring the lock on the next page. Throughout the operation, it will hold a lock on *some* page that could legally hold the inserted value, and it acquires the locks in left-to-right order. This is sufficient for the uniqueness checks, because _bt_unique_check() scans all the pages, and _bt_unique_check() *does* hold the first page locked while it scans the rest of the pages. But _bt_findinsertlock() does not. Also note that _bt_moveright() also finishes any incomplete splits it encounters (when called for an insertion). So _bt_findinsertloc() never gets called on a page with the incomplete-split flag set. It might encounter one when it moves right, but never the first page. Anyway, the concern is that there may be problems when we call _bt_finish_split() with that left sibling locked thoughout (i.e. finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and itself has a right sibling from the incomplete split (which is usually the value lock page's right-right sibling)). I'm not concerned about performance, since as the comment says it ought to be an infrequent occurrence. I also believe that there are no deadlock hazards. But consider this scenario: * We insert the value 7 into an int4 unique index. We need to split the leaf page. We run out of memory or something, and ours is an incomplete page split. Our transaction is aborted. For the sake of argument, suppose that there are also already a bunch of dead tuples within the index with values of 7, 8 and 9. If I understood correctly, the tree looks like this before the insertion: Parent page: +-+ | | | 9 -> A | +-+ Leaf A +-+ | HI-key: 9 | | | | 7 8 9 | +-+ And after insertion and incomplete split: Parent page +-+ | | | 9 -> A | +-+ Leaf A Leaf B +--+ +-+ | HI-key: 8| | HI-key: 9 | | (INCOMPLETE_ | | | | SPLIT) | <-> | | | | | | | 7 7* 8 | | 9 | +--+ +-+ where 7* is the newly inserted key, with value 7. (you didn't mention at what point the split happens, but in the next paragraph you said the new downlink has value 8, which implies the above split) * Another inserter of the value 7 comes along. It follows exactly the same downlink as the first, now aborted inserter (suppose the downlink's value is 9). It also locks the same leaf page to establish a "value lock" in precisely the same manner. Finding no room on the first page, it looks further right, maintaining its original "value lock" throughout. It finishes the first inserter's split of the non-value-lock page - a new downlink is inserted into the parent page, with the value 8. It then releases all buffer locks except the first one/original "value lock". A physical insertion has yet to occur. Hmm, I think you got confused at this step. When inserting a 7, you cannot "look further right" to find a page with more space, because the HI-key, 8, on the first page stipulates that 7 must go on that page, not some later page. * A third inserter of the value comes along. It gets to a later page than the one locked by the second inserter, preferring the newer downlink with value 8 (the internal-page _bt_binsrch() logic ensures this). That's a contradiction: the downlink with value 8 points to the first page, not some later page. After the split is finished, the tree looks like this: Parent page +-+ | 8 -> A | | 9 -> B | +-+ Leaf A Leaf B +-+ +-+ | HI-key: 8 | | HI-key: 9 | | | <-> | | | 7 7* 8 | | 9 | +-+ +-+ Regardless of wh
Re: [HACKERS] IMPORT FOREIGN SCHEMA statement
>> Additionally, I don't really see how that would be useful in a general case. >> With an "in-core" defined meaning of type transformation, any FDW that doesn't >> fit exactly into the model would have a hard time. For example, what happens >> if an FDW is only ever capable of returning text ? > That's actually the case where it's most important to have the feature > all the way down to the column level. I'm not sure configuration about specific columns from specific tables would be that useful: if you already know the structure of the table, you can either create it yourself, or run an alter column statement just after creating it. Alternativeley, with the arbitrary options clause proposed by Tom and detailed below, one could use the LIMIT TO / EXCEPT clauses to fine-tune what options should apply. > That's why I suggested doing it via CREATE/ALTER with JSONB or similar > containing the details rather than inventing new SQL grammar, an > option I included only to dismiss it. Hum, adding a simple TYPE MAPPING is already inventing new SQL grammar, but its less invasive. > Between this and the type-mapping questions, it seems likely that > we're going to need a way for IMPORT FOREIGN SCHEMA to accept > user-supplied control options, which would in general be specific > to the FDW being used. (Another thing the SQL committee failed to > think about.) So, without extending the syntax too much, we could add options: IMPORT FOREIGN SCHEMA remote_schema FROM SERVER server_name INTO local_schema [ { LIMIT TO | EXCEPT } (table_name [, ...])] [ OPTIONS (option_list) ] This option list could then contain fdw-specific options, including type mapping. Or we could add a specific clause: IMPORT FOREIGN SCHEMA remote_schema FROM SERVER server_name INTO local_schema [ { LIMIT TO | EXCEPT } (table_name [, ...])] [ OPTIONS (option_list) ] [ TYPE MAPPING some mapping_definition ] With mapping_definition being either a tuple, or well-defined jsonb format common accross FDWs. A third solution, which I don't like but doesn't modify the SQL grammar, would be to encode the options in the remote_schema name. Would one of those solutions be acceptable ? -- Ronan Dunklau http://dalibo.com - http://dalibo.org signature.asc Description: This is a digitally signed message part.
Re: [HACKERS] Spreading full-page writes
On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas wrote: > > On 05/26/2014 02:26 PM, Greg Stark wrote: >> >>> Another idea would be to have separate checkpoints for each buffer >> partition. You would have to start recovery from the oldest checkpoint of >> any of the partitions. > > Yeah. Simon suggested that when we talked about this, but I didn't understand > how that works at the time. I think I do now. The key to making it work is > distinguishing, when starting recovery from the latest checkpoint, whether a > record for a given page can be replayed safely. I used flags on WAL records > in my proposal to achieve this, but using buffer partitions is simpler. Interesting. I just thought of it independently. Incidentally you wouldn't actually want to use the buffer partitions per se since the new server might start up with a different number of partitions. You would want an algorithm for partitioning the block space that xlog replay can reliably reproduce regardless of the size of the buffer lock partition table. It might make sense to set it up so it coincidentally ensures all the buffers being flushed are in the same partition or maybe the reverse would be better. Probably it doesn't actually matter. > For simplicity, let's imagine that we have two Redo-pointers for each > checkpoint record: one for even-numbered pages, and another for odd-numbered > pages. When checkpoint begins, we first update the Even-redo pointer to the > current WAL insert location, and then flush all the even-numbered buffers in > the buffer cache. Then we do the same for Odd. Hm, I had convinced myself that the LSN on the pages would mean you skip the replay anyways but I think I was wrong and you would need to keep a bitmap of which partitions were in recovery mode as you replay and keep adding partitions until they're all in recovery mode and then keep going until you've seen the checkpoint record for all of them. I'm assuming you would keep N checkpoint positions in the control file. That also means we can double the checkpoint timeout with only a marginal increase in the worst case recovery time. Since the worst case will be (1 + 1/n)*timeout's worth of wal to replay rather than 2*n. The amount of time for recovery would be much more predictable. > Recovery begins at the Even-redo pointer. Replay works as normal, but until > you reach the Odd-pointer, you refrain from replaying any changes to > Odd-numbered pages. After reaching the odd-pointer, you replay everything as > normal. > > Hmm, that seems actually doable... -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spreading full-page writes
On 05/26/2014 02:26 PM, Greg Stark wrote: On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas wrote: The second record is generated before the checkpoint is finished and the checkpoint record is written. So it will be there. (if you crash before the checkpoint is finished, the in-progress checkpoint is no good for recovery anyway, and won't be used) Another idea would be to have separate checkpoints for each buffer partition. You would have to start recovery from the oldest checkpoint of any of the partitions. Yeah. Simon suggested that when we talked about this, but I didn't understand how that works at the time. I think I do now. The key to making it work is distinguishing, when starting recovery from the latest checkpoint, whether a record for a given page can be replayed safely. I used flags on WAL records in my proposal to achieve this, but using buffer partitions is simpler. For simplicity, let's imagine that we have two Redo-pointers for each checkpoint record: one for even-numbered pages, and another for odd-numbered pages. When checkpoint begins, we first update the Even-redo pointer to the current WAL insert location, and then flush all the even-numbered buffers in the buffer cache. Then we do the same for Odd. Recovery begins at the Even-redo pointer. Replay works as normal, but until you reach the Odd-pointer, you refrain from replaying any changes to Odd-numbered pages. After reaching the odd-pointer, you replay everything as normal. Hmm, that seems actually doable... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw and connection management
On Tue, May 27, 2014 at 12:32:50AM -0300, Fabrízio de Royes Mello wrote: > On Mon, May 26, 2014 at 11:47 PM, Shigeru Hanada > wrote: > > > > 2014-05-24 0:09 GMT+09:00 Sandro Santilli : > > > Indeed I tried "DISCARD ALL" in hope it would have helped, so I find > > > good your idea of allowing extensions to register an hook there. > > > > > > Still, I'd like the FDW handler itself to possibly be configured > > > to disable the pool completely as a server-specific configuration. > > > > Connection management seems FDW-specific feature to me. How about to > > add FDW option, say pool_connection=true|false, to postgres_fdw which > > allows per-server configuration? Yes, that's what I had in mind. I'll try something along those lines. > Makes sense... but if we use "pool_connection=true" and want to close the > opened connection. How can we do that? Right, I still consider hooks on DISCARD a useful addition. --strk; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spreading full-page writes
On 05/26/2014 11:15 PM, Robert Haas wrote: On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas wrote: I don't think we know that. The server might have crashed before that second record got generated. (This appears to be an unfixable flaw in this proposal.) The second record is generated before the checkpoint is finished and the checkpoint record is written. So it will be there. (if you crash before the checkpoint is finished, the in-progress checkpoint is no good for recovery anyway, and won't be used) Hmm, I see. It's not great to have to generate WAL at buffer-eviction time, though. Normally, when we go to evict a buffer, the WAL is already written. We might have to wait for it to be flushed, but if the WAL writer is doing its job, hopefully not. But here we'll definitely have to wait for the WAL flush. Yeah. You would want to batch the flushes somehow, instead of flushing the WAL for every buffer being flushed. For example, after writing the FPW WAL record, just continue with the checkpoint without flushing the buffer, and do a second pass later doing buffer flushes. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spreading full-page writes
On Tue, May 27, 2014 at 3:57 PM, Simon Riggs wrote: > On 25 May 2014 17:52, Heikki Linnakangas wrote: > >> Here's an idea I tried to explain to Andres and Simon at the pub last night, >> on how to reduce the spikes in the amount of WAL written at beginning of a >> checkpoint that full-page writes cause. I'm just writing this down for the >> sake of the archives; I'm not planning to work on this myself. > ... > > Thanks for that idea, and dinner. It looks useful. > > I'll call this idea "Background FPWs" > >> Now, I'm sure there are issues with this scheme I haven't thought about, but >> I wanted to get this written down. Note this does not reduce the overall WAL >> volume - on the contrary - but it ought to reduce the spike. > > The requirements we were discussing were around > > A) reducing WAL volume > B) reducing foreground overhead of writing FPWs - which spikes badly > after checkpoint and the overhead is paid by the user processes > themselves > C) need for FPWs during base backup > > So that gives us a few approaches > > * Compressing FPWs gives A > * Background FPWs gives us B >which look like we can combine both ideas > > * Double-buffering would give us A and B, but not C >and would be incompatible with other two ideas Double-buffering would allow us to disable FPW safely but which would make a recovery slow. So if we adopt double-buffering, I think that we would also need to overhaul the recovery. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD
Hi Tom, On 27/05/2014 03:07, Tom Lane wrote: > I've verified functionality of this patch on these scenarios: > > (1) --with-ossp-uuid on RHEL6, using uuid-1.6.1-10.el6.x86_64 > (2) --with-linux-uuid on RHEL6, using libuuid-2.17.2-12.14.el6_5.x86_64 > (3) --with-linux-uuid on OS X 10.9.3, Intel > (4) --with-linux-uuid on OS X 10.4.11, PPC (hence, bigendian) > > I do not have a machine on which to try --with-bsd-uuid, so it's > possible I broke that portion of Matteo's patch. Would someone try > that case on a FreeBSD box? I've tested on NetBSD i386 and --with-bsd-uuid worked out of the box. I could fire up some virtual machines with FreeBSD and other BSD flavours, but maybe some buildfarm animals could be used for that. I'm attaching a little patch to be applied on top of yours. I didn't notice that "buf ? 13 : 0" was raising a warning about the condition being always true on BSD. I guess it's safe to hardcode 13 as the argument is ignored anyway with ossp, so I've fixed that. I've also fixed v1mc generation on "linux" to match the OSSP and BSD variants and added a regression test for it. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/ diff --git a/contrib/uuid-ossp/expected/uuid_ossp.out b/contrib/uuid-ossp/expected/uuid_ossp.out index f393e86..c14db22 100644 --- a/contrib/uuid-ossp/expected/uuid_ossp.out +++ b/contrib/uuid-ossp/expected/uuid_ossp.out @@ -77,3 +77,18 @@ SELECT uuid_generate_v4()::text ~ '^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9] t (1 row) +DO $_$ +DECLARE + u text; + i int; + c int; +BEGIN + FOR i in 1..32 LOOP +u := substr(uuid_generate_v1mc()::text, 25, 2); +EXECUTE 'SELECT x''' || u || '''::int & 3' INTO c; +IF c <> 3 THEN + RAISE WARNING 'v1mc broken'; +END IF; + END LOOP; +END; +$_$; diff --git a/contrib/uuid-ossp/sql/uuid_ossp.sql b/contrib/uuid-ossp/sql/uuid_ossp.sql index 8f22417..61a44a8 100644 --- a/contrib/uuid-ossp/sql/uuid_ossp.sql +++ b/contrib/uuid-ossp/sql/uuid_ossp.sql @@ -17,3 +17,20 @@ SELECT uuid_generate_v3(uuid_ns_dns(), 'www.widgets.com'); SELECT uuid_generate_v5(uuid_ns_dns(), 'www.widgets.com'); SELECT uuid_generate_v4()::text ~ '^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}$'; + +DO $_$ +DECLARE + u text; + i int; + c int; +BEGIN + FOR i in 1..32 LOOP +u := substr(uuid_generate_v1mc()::text, 25, 2); +EXECUTE 'SELECT x''' || u || '''::int & 3' INTO c; +IF c <> 3 THEN + RAISE WARNING 'v1mc broken'; +END IF; + END LOOP; +END; +$_$; + diff --git a/contrib/uuid-ossp/uuid-ossp.c b/contrib/uuid-ossp/uuid-ossp.c index bc29ade..7803dbe 100644 --- a/contrib/uuid-ossp/uuid-ossp.c +++ b/contrib/uuid-ossp/uuid-ossp.c @@ -460,6 +460,10 @@ uuid_generate_v1mc(PG_FUNCTION_ARGS) uuid_t uu; uuid_generate_random(uu); + + /* set IEEE802 multicast and local-admin bits */ + ((dce_uuid_t *)&uu)->node[0] |= 0x03; + uuid_unparse(uu, strbuf); buf = strbuf + 24; #else /* BSD */ @@ -472,7 +476,7 @@ uuid_generate_v1mc(PG_FUNCTION_ARGS) #endif return uuid_generate_internal(UUID_MAKE_V1 | UUID_MAKE_MC, NULL, - buf, buf ? 13 : 0); + buf, 13); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?
David Fetter writes: >> > >> > Also worth considering: functions which take any part of the view >> > as a parameter. >> >> Sorry, I don't get it: do you suggest we should re-create dependent >> functions too? > > I'd throw an error in cases where such functions had an obvious and > deterministic dependency on the views, ideally having gone through all > such functions first and enumerated them in the error message. Then it would also make sense to start with checking function dependency on the tables themselves, not only the joining views: psql=> CREATE TABLE t(id INT); CREATE TABLE psql=> CREATE FUNCTION func1() RETURNS SETOF INT AS $$ SELECT id FROM t; $$ LANGUAGE SQL; CREATE FUNCTION psql=> ALTER TABLE t ALTER COLUMN id TYPE BIGINT; ALTER TABLE -- Would complain on func1 right away psql=> SELECT func1(); ERROR: return type mismatch in function declared to return integer DETAIL: Actual return type is bigint. CONTEXT: SQL function "func1" during startup psql=> CREATE FUNCTION func2() RETURNS SETOF INT AS $$ SELECT id FROM t; $$ LANGUAGE SQL; ERROR: return type mismatch in function declared to return integer DETAIL: Actual return type is bigint. CONTEXT: SQL function "func2" -- Alex -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers