Re: [HACKERS] JSON for PG 9.2
2012/1/11 Robert Haas : > On Tue, Dec 20, 2011 at 9:06 PM, David E. Wheeler > wrote: >> On Dec 20, 2011, at 10:39 AM, Claes Jakobsson wrote: >>> Are people explicitly asking for a) *JSON* datatype or b) a type that lets >>> you store arbitrary complex semi-untyped data structures? >> >> Yes. >> >>> if b) then this might get a lot more interesting >> >> JSON is the most popular/likely way to represent that, I think. > > On that note, here's an updated version of the patch I posted > upthread, with some regression tests and minimal documentation. I like this patch and this feature. I see only one issue - there is not functionality that helps generate JSON in pg. What do you think about functions: array_to_json(anyarray), row_to_json(any) and format_json(text, text, ...) Regards Pavel > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpoint writeback via sync_file_range
On 1/10/12 9:14 PM, Robert Haas wrote: Based on that, I whipped up the attached patch, which, if sync_file_range is available, simply iterates through everything that will eventually be fsync'd before beginning the write phase and tells the Linux kernel to put them all under write-out. I hadn't really thought of using it that way. The kernel expects that when this is called the normal way, you're going to track exactly which segments you want it to sync. And that data isn't really passed through the fsync absorption code yet; the list of things to fsync has already lost that level of detail. What you're doing here doesn't care though, and I hadn't considered that SYNC_FILE_RANGE_WRITE could be used that way on my last pass through its docs. Used this way, it's basically fsync without the wait or guarantee; it just tries to push what's already dirty further ahead of the write queue than those writes would otherwise be. One idea I was thinking about here was building a little hash table inside of the fsync absorb code, tracking how many absorb operations have happened for whatever the most popular relation files are. The idea is that we might say "use sync_file_range every time calls for a relation have come in", just to keep from ever accumulating too many writes to any one file before trying to nudge some of it out of there. The bat that keeps hitting me in the head here is that right now, a single fsync might have a full 1GB of writes to flush out, perhaps because it extended a table and then write more than that to it. And in everything but a SSD or giant SAN cache situation, 1GB of I/O is just too much to fsync at a time without the OS choking a little on it. I don't know that I have a suitable place to test this, and I'm not quite sure what a good test setup would look like either, so while I've tested that this appears to issue the right kernel calls, I am not sure whether it actually fixes the problem case. I'll put this into my testing queue after the upcoming CF starts. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sending notifications from the master to the standby
BTW ... it occurs to me to ask whether we really have a solid use-case for having listeners attached to slave servers. I have personally never seen an application for LISTEN/NOTIFY in which the listeners were entirely read-only. Even if there are one or two cases out there, it's not clear to me that supporting it is worth the extra complexity that seems to be needed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sending notifications from the master to the standby
Joachim Wieland writes: > On Tue, Jan 10, 2012 at 11:55 AM, Tom Lane wrote: >> Simon Riggs writes: >>> Suggest we add something to initial handshake from standby to say >>> "please send me notify traffic", >> +1 on that. > From what you said I imagined this walsender listener as a regular > listener that listens on the union of all sets of channels that > anybody is listening on on the standby, with the LISTEN transaction on > the standby return from commit once the listener is known to have been > set up on the master. This seems vastly overcomplicated too. I'd just vote for a simple yes/no flag, so that receivers that have no interest in notifies don't have to deal with them. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Real-life range datasets
Hi I'm proposing OpenStreetMap which is of variable size up to >250 GB XML Data for whole world. It's downloadable from CloudMade.com or Geofabrik.de and can be imported into PostgreSQL using osm2pgsql. It's a key/value schema literally of the real world. I'm using hstore option of osm2pgsql and hstore index is based on GIST. I'm running a database instance called "PostGIS Terminal" which is a daily extract of Switzerland: http://labs.geometa.info/postgisterminal/?xapi=node%5Bname%3DHochschule%20Rapperswil%5D This is a typical query which extracts 'real' highways (being of geometry linestring, aka line/way) with a speed limit >= 100 km/h: SELECT ST_AsText(way) geom FROM osm_line WHERE tags @> '"highway"=>"motorway"' AND coalesce(substring((tags->'maxspeed') FROM E'[0-9]+')::int,0) >= 100 Yours, Stefan 2011/12/23 Alexander Korotkov : > Hello, > > On Thu, Dec 22, 2011 at 12:51 PM, Benedikt Grundmann > wrote: >> >> I should be able to give you a table with the same characteristics as >> the instruments table but bogus data by replacing all entries in the >> table with random strings of the same length or something like that. >> I can probably take a little bit of time during this or the next week >> to generate such "fake" real world data ;-) Is there an ftp site to >> upload the gzipped pg_dump file to? > > > Thank you very much for your response! I'm going to send you accessories for > upload soon. > > - > With best regards, > Alexander Korotkov. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] checkpoint writeback via sync_file_range
Greg Smith muttered a while ago about wanting to do something with sync_file_range to improve checkpoint behavior on Linux. I thought he was talking about trying to sync only the range of blocks known to be dirty, which didn't seem like a very exciting idea, but after looking at the man page for sync_file_range, I think I understand what he was really going for: sync_file_range allows you to hint the Linux kernel that you'd like it to clean a certain set of pages. I further recall from Greg's previous comments that in the scenarios he's seen, checkpoint I/O spikes are caused not so much by the data written out by the checkpoint itself but from the other dirty data in the kernel buffer cache. Based on that, I whipped up the attached patch, which, if sync_file_range is available, simply iterates through everything that will eventually be fsync'd before beginning the write phase and tells the Linux kernel to put them all under write-out. I don't know that I have a suitable place to test this, and I'm not quite sure what a good test setup would look like either, so while I've tested that this appears to issue the right kernel calls, I am not sure whether it actually fixes the problem case. But here's the patch, anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company writeback-v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON for PG 9.2
On Tue, Dec 20, 2011 at 9:06 PM, David E. Wheeler wrote: > On Dec 20, 2011, at 10:39 AM, Claes Jakobsson wrote: >> Are people explicitly asking for a) *JSON* datatype or b) a type that lets >> you store arbitrary complex semi-untyped data structures? > > Yes. > >> if b) then this might get a lot more interesting > > JSON is the most popular/likely way to represent that, I think. On that note, here's an updated version of the patch I posted upthread, with some regression tests and minimal documentation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company json-rmh-v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] log messages for archive recovery progress
On 08-01-2012 11:59, Satoshi Nagayasu / Uptime Technologies, LLC. wrote: >> [2011-12-08 15:14:36 JST] 16758: LOG: restored log file >> "00080046" from archive >> [2011-12-08 15:14:36 JST] 16758: LOG: recoverying 00080046 >> [2011-12-08 15:14:36 JST] 16758: LOG: restored log file >> "00080047" from archive >> [2011-12-08 15:14:36 JST] 16758: LOG: recoverying 00080047 >> cp: cannot stat `/backups/archlog/00080048': No such file or >> directory >> [2011-12-08 15:14:37 JST] 16758: LOG: could not restore file >> "00080048" from archive >> [2011-12-08 15:14:37 JST] 16758: LOG: attempting to look into pg_xlog >> [2011-12-08 15:14:37 JST] 16758: LOG: recoverying 00080048 > What about just 'restored log file "00080048" from pg_xlog' instead of the last two messages? If you can't read from pg_xlog emit 'could not restore file "00080048" from pg_xlog'. -- Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sending notifications from the master to the standby
On Tue, Jan 10, 2012 at 11:55 AM, Tom Lane wrote: > Simon Riggs writes: > [ Tom sketches a design ] > Seems a bit overcomplicated. I was just thinking of having walreceiver > note the WAL endpoint at the instant of receipt of a notify message, > and not release the notify message to the slave ring buffer until WAL > replay has advanced that far. How about this: We mark a notify message specially if it is the last message sent by a transaction and also add a flag to commit/abort-records, indicating whether or not the transaction has sent notifys. Now if such a last message is being put into the regular ring buffer on the standby and the xid is known to have committed or aborted, signal the backends. Also signal from a commit/abort-record if the flag is set. If the notify messages make it to the standby first, we just put messages of a not-yet-committed transaction into the queue, just as on the master. Listeners will get signaled when the commit record arrives. If the commit record arrives first, we signal, but the listeners won't find anything (at least not the latest notifications). When the last notify of that transaction finally arrives, the transaction is known to have committed and the listeners will get signaled. What could still happen is that the standby receives notifys, the commit message and more notifys. Listeners would still eventually get all the messages but potentially not all of them at once. Is this a problem? If so, then we could add a special "stop reading"-record into the queue before we write the notifys, that we subsequently change into a "continue reading"-record once all notifications are in the queue. Readers would treat a "stop reading" record just like a not-yet-committed transaction and ignore a "continue reading" record. >> Suggest we add something to initial handshake from standby to say >> "please send me notify traffic", > > +1 on that. From what you said I imagined this walsender listener as a regular listener that listens on the union of all sets of channels that anybody is listening on on the standby, with the LISTEN transaction on the standby return from commit once the listener is known to have been set up on the master. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsphere
On Wed, 2012-01-11 00:43:25 +0400, Oleg Bartunov wrote: > On Tue, 10 Jan 2012, Jan-Benedict Glaw wrote: > > On Sun, 2012-01-08 22:19:53 +0400, Oleg Bartunov wrote: > > > pgsphere now could benefit very much from our KNNGiST feature. > > > This is number one development from my point of view. I and > > > Teodor have no time to work on pgsphere, sorry. But, there are > > > some astronomers I'm working with, who can take part in this. > > > Sergey Karpov has done extensive benchmarks of q3c, rtree and > > > pgsphere and found the latter still has some benefits in some > > > workload, so we are interesting in development. > > Could the PostGIS stuff be abused for stellar coordinates? > > There is no principal difference between celestial sphere and earth, > it's a matter of conversion between coordinates. I'm a hobby astronomer myself--so I'm asking myself what is actually needed. I had a look at pgsphere some weeks ago, but didn't use it, because it seemed to be somewhat dead. My next approach was to load a dataset as-is (with floting point RA and dec) and then created a new table selecting all data and converted the coordinates to POINTs. Even with a combined index on (RA,Dec) (as well as one on the POINT column of the new PostGIS enabled table), the later was quite faster when searching for specific areas etc. One important thing that's needed is transformation between equatorial, ecliptic, galactic and probably (local, incorporating current local longitude/latitude and time) horizontal coordinates. What might be important, too, is to be able to change between J2000.0 and B1950.0 etc. The probably easiest thing is to change the (printable) representation of coordinates, because (depending on the people you talk to), especially for RA (equivalent to the earth's longitude), there are at least two totally different notations used: 2h 30min 4.3sec = 2.501194h = 37.5179° The other axis is usually simply written in degree. MfG, JBG -- Jan-Benedict Glaw jbg...@lug-owl.de +49-172-7608481 Signature of: Wenn ich wach bin, träume ich. the second : signature.asc Description: Digital signature
Re: [HACKERS] pgsphere
On 1/10/12 9:54 AM, Andrew Dunstan wrote: Speaking with my pgfoundry admin hat on, I am extremely reluctant to take control of a project away from its owners. If you don't get any action in a week or so, you can approach the pgfoundry admins for help. In the meantime you might want to fork the code onto github or bitbucket. Man, the pgfoundry admin hat has to be one of the least collectible ones around. Not exactly a lot of demand for them, and everyone who has one would happily give theirs away. I don't want to drag this off-topic thread on, but it's worth mentioning that http://wiki.postgresql.org/wiki/Project_Hosting has started a small migration guide of sorts for where else you might host this sort of project at. It's not necessarily obvious what pgfoundry provides relative to other sites, or what the trade-offs in the other possibilities are. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Command Triggers
Hi, Robert Haas writes: > Maybe we should try to split the baby here and defer the question of > whether to expose any of the parse tree internals, and if so how much, > to a future release. It seems to me that we could design a fairly > useful set of functionality around AFTER-CREATE, BEFORE-DROP, and > maybe even AFTER-ALTER triggers without exposing any parse tree > details. +1 Also remember that you have a “normalized” command string to play with. Lots of use cases are already ok here. The other ones would need a tree representation that's easy to consume, which in the current state of affairs (I saw no progress on the JSON data type and facilities) is very hard to imagine when you consider PLpgSQL. So unless I hear about a show stopper, I'm going to work some more on the command trigger patch where I still had some rough edges to polish. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsphere
On Tue, 10 Jan 2012, Jan-Benedict Glaw wrote: On Sun, 2012-01-08 22:19:53 +0400, Oleg Bartunov wrote: pgsphere now could benefit very much from our KNNGiST feature. This is number one development from my point of view. I and Teodor have no time to work on pgsphere, sorry. But, there are some astronomers I'm working with, who can take part in this. Sergey Karpov has done extensive benchmarks of q3c, rtree and pgsphere and found the latter still has some benefits in some workload, so we are interesting in development. Could the PostGIS stuff be abused for stellar coordinates? There is no principal difference between celestial sphere and earth, it's a matter of conversion between coordinates. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] random_page_cost vs seq_page_cost
On 1/5/12 5:04 AM, Benedikt Grundmann wrote: I have a question of how to benchmark hardware to determine the appropriate ratio of seq_page_cost vs random_page_cost. Emails in this mailing lists archive seem to indicate that 1.0 vs 3.0 - 4.0 are appropriate values on modern hardware. Which surprised me a bit as I had thought that on actual harddrives (ignoring SSDs) random_page_cost is higher. I guess that the number tries to reflect caching of the relevant pages in memory and modern hardware you have more of that? That sort of thing is one reason why all attempts so far to set random_page_cost based on physical characteristics haven't gone anywhere useful. The setting is sort of overloaded right now, it's a fuzzy mix of true random seek cost blended with some notion of cache percentage. Trying to bring some measurements to bear on it is a less effective approach than what people actually do here. Monitor the profile of query execution, change the value, see what happens. Use that as feedback for what direction to keep going; repeat until you're just spinning with no improvements. It's easy to measure the actual read times and set the value based on that instead. But that doesn't actually work out so well. There's at least three problems in that area: -Timing information is sometimes very expensive to collect. This I expect to at least document and quantify why usefully as a 9.2 feature. -Basing query execution decisions on what is already in the cache leads to all sorts of nasty feedback situations where you optimize for the short term, for example using an index already in cache, while never reading in what would be a superior long term choice because it seems too expensive. -Making a major adjustment to the query planning model like this would require a large performance regression testing framework to evaluate the results in. We are not sure if the database used to choose differently before the move to the new hardware and the hardware is performing worse for random seeks. Or if the planner is now making different choices. I don't recommend ever deploying new hardware without first doing some low-level benchmarks to validate its performance. Once stuff goes into production, you can't do that anymore. See http://www.2ndquadrant.com/en/talks/ for my hardware benchmarking talks if you'd like some ideas on what to collect. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] libpq: PQcmdStatus, PQcmdTuples signatures can be painlessly improved
,--- You/Peter (Tue, 10 Jan 2012 19:13:42 +0200) * | On tis, 2011-12-13 at 07:55 -0500, Alex Goncharov wrote: | > char *PQcmdStatus(PGresult *res); | > char *PQcmdTuples(PGresult *res); | > | > Unreasonable: | > | > a. What, these two can modify 'res' I pass in?.. | > | > b. Oh, yes, because they return 'char *' pointing to | > 'res->cmdStatus+n', so, a libpq user may write: | > | > char *s = PQcmdStatus(res); | > *s = 'x'; | > | > and have 'res' modified. (Would be the user's fault, of course.) | > | Note that const PGresult * would only warn against changing the | fields It would not warn, it would err (the compilation should fail). | of the PGresult struct. It doesn't do anything about changing the data | pointed to by pointers in the PGresult struct. So what you are saying | doesn't follow. By this logic, passing 'const struct foo *' doesn't have any point and value, for any function. But we know that this is done (and thank you for that) in many cases -- a good style, self-documentation and some protection. E.g. here: ,--- I/Alex (Tue, 13 Dec 2011 07:55:45 -0500) * | Compare: | | int PQntuples(const PGresult *res) | | Reasonable: doesn't modify 'res'. `-* BTW, I have not submitted the context differences, as suggested, only because of extreme overload at work and the need to do a careful caller and documentation analysis. I still hope to be able to do it in a reasonably near future. -- Alex -- alex-goncha...@comcast.net -- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add SPI results constants available for PL/*
2012/1/10 Andrew Dunstan : > > > On 01/10/2012 12:34 PM, Pavel Stehule wrote: >>> >>> >>> Actually, now I look closer I see that PLPerl passes back a stringified >>> status from SPI_execute(), so there is no great need for setting up these >>> constants. It's probably water under the bridge now, but maybe PLPython >>> should have done this too. >>> >> This is not documented well - I see nothing about result value in doc. >> Does it raise exception when SPI returns some bad result value? > > > > The docs state: > > You can then access the command status (e.g., SPI_OK_INSERT) like this: > > $res = $rv->{status}; > > > And it works like this: > > andrew=# do 'my $rv = spi_exec_query("select 1 as a"); > elog(NOTICE,$rv->{status});' language plperl; > NOTICE: SPI_OK_SELECT > CONTEXT: PL/Perl anonymous code block > DO > andrew=# > > An error causes the function to end, so it never sees the error status: > > andrew=# do 'my $rv = spi_exec_query("select blurfl"); > elog(NOTICE,$rv->{status});' language plperl; > ERROR: column "blurfl" does not exist at line 1. > CONTEXT: PL/Perl anonymous code block > andrew=# > > > If you think more documentation is needed, submit a patch. I was blind, I am sorry - I am missing explicit note, so command status is string, that is all. Regards Pavel > > > cheers > > andrew > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add SPI results constants available for PL/*
On 01/10/2012 12:34 PM, Pavel Stehule wrote: Actually, now I look closer I see that PLPerl passes back a stringified status from SPI_execute(), so there is no great need for setting up these constants. It's probably water under the bridge now, but maybe PLPython should have done this too. This is not documented well - I see nothing about result value in doc. Does it raise exception when SPI returns some bad result value? The docs state: You can then access the command status (e.g., SPI_OK_INSERT) like this: $res = $rv->{status}; And it works like this: andrew=# do 'my $rv = spi_exec_query("select 1 as a"); elog(NOTICE,$rv->{status});' language plperl; NOTICE: SPI_OK_SELECT CONTEXT: PL/Perl anonymous code block DO andrew=# An error causes the function to end, so it never sees the error status: andrew=# do 'my $rv = spi_exec_query("select blurfl"); elog(NOTICE,$rv->{status});' language plperl; ERROR: column "blurfl" does not exist at line 1. CONTEXT: PL/Perl anonymous code block andrew=# If you think more documentation is needed, submit a patch. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsphere
On Sun, 2012-01-08 22:19:53 +0400, Oleg Bartunov wrote: > pgsphere now could benefit very much from our KNNGiST feature. This is > number one development from my point of view. I and Teodor have no > time to work on pgsphere, sorry. But, there are some astronomers I'm working > with, who can take part in this. Sergey Karpov has done extensive benchmarks > of q3c, rtree and pgsphere and found the latter still has some benefits > in some workload, so we are interesting in development. Could the PostGIS stuff be abused for stellar coordinates? MfG, JBG -- Jan-Benedict Glaw jbg...@lug-owl.de +49-172-7608481 Signature of: Träume nicht von Deinem Leben: Lebe Deinen Traum! the second : signature.asc Description: Digital signature
Re: [HACKERS] Sending notifications from the master to the standby
On Tue, Jan 10, 2012 at 4:55 PM, Tom Lane wrote: > Simon Riggs writes: >> On Tue, Jan 10, 2012 at 5:00 AM, Tom Lane wrote: >>> It might be a bit tricky to get walreceivers to inject >>> the data into the slave-side ring buffer at the right time, ie, not >>> until after the commit a given message describes has been replayed; >>> but I don't immediately see a reason to think that's infeasible. > >> [ Simon sketches a design for that ] > > Seems a bit overcomplicated. I was just thinking of having walreceiver > note the WAL endpoint at the instant of receipt of a notify message, > and not release the notify message to the slave ring buffer until WAL > replay has advanced that far. You'd need to lay down ground rules about > how the walsender times the insertion of notify messages relative to > WAL in its output. You have to store the messages somewhere until they're needed. If that somewhere isn't on the standby, very close to the Startup process then its going to be very slow. Putting a marker in the WAL stream guarantees arrival order. The hash table was just a place to store them until they're needed, could be a ring buffer as well. Inserts into the slave ring buffer already have an xid on them, so the test will probably already cope with messages inserted but for which the parent xid has not committed. The only problem is coping with possible out of sequence messages. > But I don't see the need for either explicit markers > in the WAL stream or a hash table. Indeed, a hash table scares me > because it doesn't clearly guarantee that notifies will be released in > arrival order. The hash table is clearly not the thing providing an arrival order guarantee, it was just a cache. You have a few choices: (1) you either send the message while holding an exclusive lock, or (2) you send them as they come and buffer them, then reorder them using the WAL log sequence since that matches the original commit sequence. Or (3) add a sequence number to the messages sent by WALSender, so that the WALReceiver can buffer them locally and insert them in the correct order into the normal ring buffer - so in (3) the message sequence and the WAL sequence match, but the mechanism is different. (1) is out because the purpose of offloading to the standby is to give the master more capcity. If we slow it down in order to serve the standby we're doing things the wrong way around. I was choosing (2), maybe you prefer (3) or another design entirely. They look very similar to me and about the same complexity, its just copying data and preserving sequence. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add SPI results constants available for PL/*
2012/1/10 Andrew Dunstan : > > > On 01/03/2012 09:11 PM, Andrew Dunstan wrote: >> >> >> >> On 01/03/2012 08:40 PM, Robert Haas wrote: >>> >>> On Tue, Jan 3, 2012 at 9:33 AM, Pavel Stehule >>> wrote: > > I'd suppose it interesting to add a table to pg_catalog containing this > data. - it is useless overhead >>> >>> I tend to agree. >>> I am expecting so definition some constants in Perl, Python is simple >>> >>> Presumably one could instead write a script to transform the list of >>> constants into a .pm file that could be loaded into the background, or >>> whatever PL/python's equivalent of that concept is. Not sure if >>> there's a better way to do it. >> >> >> Yeah, I'm with you and Pavel. Here's my quick perl one-liner to produce a >> set of SPI_* constants for pl/perl. I'm looking at the best way to include >> this in the bootstrap code. >> >> perl -ne 'BEGIN { print "use constant\n{\n"; } END { print "};\n"; } >> print "\t$1 => $2,\n" if /#define (SPI_\S+)\s+\(?(-?\d+)\)?/;' >> src/include/executor/spi.h >> >> >> > > > Actually, now I look closer I see that PLPerl passes back a stringified > status from SPI_execute(), so there is no great need for setting up these > constants. It's probably water under the bridge now, but maybe PLPython > should have done this too. > This is not documented well - I see nothing about result value in doc. Does it raise exception when SPI returns some bad result value? Regards Pavel > cheers > > andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add SPI results constants available for PL/*
On 01/03/2012 09:11 PM, Andrew Dunstan wrote: On 01/03/2012 08:40 PM, Robert Haas wrote: On Tue, Jan 3, 2012 at 9:33 AM, Pavel Stehule wrote: I'd suppose it interesting to add a table to pg_catalog containing this data. - it is useless overhead I tend to agree. I am expecting so definition some constants in Perl, Python is simple Presumably one could instead write a script to transform the list of constants into a .pm file that could be loaded into the background, or whatever PL/python's equivalent of that concept is. Not sure if there's a better way to do it. Yeah, I'm with you and Pavel. Here's my quick perl one-liner to produce a set of SPI_* constants for pl/perl. I'm looking at the best way to include this in the bootstrap code. perl -ne 'BEGIN { print "use constant\n{\n"; } END { print "};\n"; } print "\t$1 => $2,\n" if /#define (SPI_\S+)\s+\(?(-?\d+)\)?/;' src/include/executor/spi.h Actually, now I look closer I see that PLPerl passes back a stringified status from SPI_execute(), so there is no great need for setting up these constants. It's probably water under the bridge now, but maybe PLPython should have done this too. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] libpq: PQcmdStatus, PQcmdTuples signatures can be painlessly improved
On tis, 2011-12-13 at 07:55 -0500, Alex Goncharov wrote: > char *PQcmdStatus(PGresult *res); > char *PQcmdTuples(PGresult *res); > > Unreasonable: > > a. What, these two can modify 'res' I pass in?.. > > b. Oh, yes, because they return 'char *' pointing to > 'res->cmdStatus+n', so, a libpq user may write: > > char *s = PQcmdStatus(res); > *s = 'x'; > > and have 'res' modified. (Would be the user's fault, of course.) > Note that const PGresult * would only warn against changing the fields of the PGresult struct. It doesn't do anything about changing the data pointed to by pointers in the PGresult struct. So what you are saying doesn't follow. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sending notifications from the master to the standby
Simon Riggs writes: > On Tue, Jan 10, 2012 at 5:00 AM, Tom Lane wrote: >> It might be a bit tricky to get walreceivers to inject >> the data into the slave-side ring buffer at the right time, ie, not >> until after the commit a given message describes has been replayed; >> but I don't immediately see a reason to think that's infeasible. > [ Simon sketches a design for that ] Seems a bit overcomplicated. I was just thinking of having walreceiver note the WAL endpoint at the instant of receipt of a notify message, and not release the notify message to the slave ring buffer until WAL replay has advanced that far. You'd need to lay down ground rules about how the walsender times the insertion of notify messages relative to WAL in its output. But I don't see the need for either explicit markers in the WAL stream or a hash table. Indeed, a hash table scares me because it doesn't clearly guarantee that notifies will be released in arrival order. > Suggest we add something to initial handshake from standby to say > "please send me notify traffic", +1 on that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sending notifications from the master to the standby
On Tue, Jan 10, 2012 at 12:56 PM, Joachim Wieland wrote: > I chose to do it this way because it seemed like the most natural way > to do it (which of course doesn't mean it's the best) :-). If its any consolation its exactly how I would have done it also up until about 2 months ago, and I remember discussing almost exactly the design you presented with someone in Rome last year. Anyway its a good feature, so I hope you have time. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsphere
I think Dave you fork project, I suggest different name pgsphere-2 to avoid confusion. github would be ok. Teodor and I are really busy guys, so I don't believe we could participate much, except discussion and testing. We implemented KNNGiST, you add neighbourhood search support to pgsphere :) Oleg On Tue, 10 Jan 2012, Dave Cramer wrote: Hi Oleg, On Sun, Jan 8, 2012 at 1:19 PM, Oleg Bartunov wrote: Dave, The situation with pgshpere is so, that I think we need new developer, since Janko keep silence :) I wrote him several time, since I wanted pgsphere now could benefit very much from our KNNGiST feature. This is number one development from my point of view. I and Teodor have no time to work on pgsphere, sorry. But, there are some astronomers I'm working with, who can take part in this. Sergey Karpov has done extensive benchmarks of q3c, rtree and pgsphere and found the latter still has some benefits in some workload, so we are interesting in development. Regards, Oleg So where do we go from here ? Dave Cramer dave.cramer(at)credativ(dot)ca http://www.credativ.ca On Fri, 6 Jan 2012, Andrew Dunstan wrote: On 01/06/2012 12:32 PM, Dave Cramer wrote: I've been asked by someone to support pgshpere. It would appear that the two project owners are MIA. If anyone knows different can they let me know ? Does anyone have any objection to me taking over the project? One of the owners is Teodor, who is a core committer ... I hope he's not MIA. cheers andrew Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsphere
On 01/10/2012 09:04 AM, Dave Cramer wrote: So where do we go from here ? First, please note that -hackers is not the right place for this discussion. pgsphere is a pgfoundry project, which is not the province of -hackers. As I suggested, the best solution is for Teodor to add you as a project admin. Oleg, could you please follow this up? Speaking with my pgfoundry admin hat on, I am extremely reluctant to take control of a project away from its owners. If you don't get any action in a week or so, you can approach the pgfoundry admins for help. In the meantime you might want to fork the code onto github or bitbucket. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LWLOCK_STATS
On Tue, Jan 10, 2012 at 3:16 AM, Simon Riggs wrote: > So benchmarking write-heavy workloads and separately benchmarking > read-only workloads is more representative. Absolutely. High write activity applications are much more difficult to optimize with simple tricks like client side caching. Also, storage is finally moving out of the dark ages so that high write transaction rate servers are no longer necessarily i/o bound without on reasonable hardware. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsphere
Hi Oleg, On Sun, Jan 8, 2012 at 1:19 PM, Oleg Bartunov wrote: > Dave, > > The situation with pgshpere is so, that I think we need new developer, since > Janko keep silence :) I wrote him several time, since I wanted > pgsphere now could benefit very much from our KNNGiST feature. This is > number one development from my point of view. I and Teodor have no > time to work on pgsphere, sorry. But, there are some astronomers I'm working > with, who can take part in this. Sergey Karpov has done extensive benchmarks > of q3c, rtree and pgsphere and found the latter still has some benefits > in some workload, so we are interesting in development. > > > Regards, > Oleg So where do we go from here ? Dave Cramer dave.cramer(at)credativ(dot)ca http://www.credativ.ca > > > On Fri, 6 Jan 2012, Andrew Dunstan wrote: > >> >> >> On 01/06/2012 12:32 PM, Dave Cramer wrote: >>> >>> I've been asked by someone to support pgshpere. >>> >>> It would appear that the two project owners are MIA. If anyone knows >>> different can they let me know ? >>> >>> Does anyone have any objection to me taking over the project? >> >> >> >> One of the owners is Teodor, who is a core committer ... I hope he's not >> MIA. >> >> cheers >> >> andrew >> >> >> >> > > Regards, > Oleg > _ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sending notifications from the master to the standby
On Tue, Jan 10, 2012 at 12:00 AM, Tom Lane wrote: > So this design is non-optimal both for existing uses and for the > proposed new uses, which means nobody will like it. You could > ameliorate #1 by adding a GUC that determines whether NOTIFY actually > writes WAL, but that's pretty ugly. In any case ISTM that problem #2 > means this design is basically broken. I chose to do it this way because it seemed like the most natural way to do it (which of course doesn't mean it's the best) :-). I agree that there should be a way to avoid the replication of the NOTIFYs. Regarding your second point though, remember that on the master we write notifications to the queue in pre-commit. And we also don't interleave notifications of different transactions. So once the commit record makes it to the standby, all the notifications are already there, just as on the master. In a burst of notifications, both solutions should more or less behave the same way but yes, the one involving the WAL file would be slower as it goes to the file system and back. > I wonder whether it'd be practical to not involve WAL per se in this > at all, but to transmit NOTIFY messages by having walsender processes > follow the notify stream (as though they were listeners) and send the > notify traffic as a separate message stream interleaved with the WAL > traffic. Agreed, having walsender/receiver work as NOTIFY proxies is kinda smart... Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [v9.2] Add GUC sepgsql.client_label
This patch adds a new GUC "sepgsql.client_label" that allows client process to switch its privileges into another one, as long as the system security policy admits this transition. Because of this feature, I ported two permissions from "process" class of SELinux; "setcurrent" and "dyntransition". The first one checks whether the client has a right to switch its privilege. And the other one checks a particular transition path from X to Y. This feature might seem to break assumption of the sepgsql's security model. However, single-directed domain transition from bigger-privileges to smaller-privileged domain by users' operation is also supported on operating system, and useful feature to restrict applications capability at beginning of the session. A few weeks ago, I got a requirement from Joshua Brindle. He is working for Web-application that uses CAC (Common Access Card) for its authentication, and wanted to integrate its security credential and security label of selinux/sepgsql. One problem was the system environment unavailable to use labeled-networking (IPsec), thus, it was not an option to switch the security label of processes on web-server side. An other solution is to port dynamic-transition feature into sepgsql, as an analogy of operating system. An expected scenario is below: The web-server is running with WEBSERV domain. It is allowed to connect to PostgreSQL, and also allowed to invoke an trusted-procedure that takes an argument of security-credential within CAC, but, nothing else are allowed. The trusted-procedure is allowed to reference a table between security-credential and security-label to be assigned on, then it switches the security label of client into CLIENT_n. The CLIENT_n shall be allowed to access tables, functions and others according to the security policy, and also allowed to reset "sepgsql.security_label" to revert WEBSERV. However, he is not available to switch other domain without security-credential stored within CAC card. I and Joshua agreed this scenario is reasonable and secure. So, we'd like to suggest this new feature towards v9.2 timeline. Thanks, [*1] CAC - Common Access Card http://en.wikipedia.org/wiki/Common_Access_Card -- KaiGai Kohei pgsql-v9.2-guc-sepgsql.client_label.v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Allow breaking out of hung connection attempts
On 09.01.2012 15:49, Ryan Kelly wrote: On Mon, Jan 09, 2012 at 10:35:50AM +0200, Heikki Linnakangas wrote: That assumes that it's safe to longjmp out of PQconnectdbParams at any instant. It's not. I'm guessing because it could result in a resource leak? Yes, and other unfinished business, too. I think you'd need to use the asynchronous connection functions PQconnectStartParams() and PQconnectPoll(), and select(). New patch attached. Thanks, some comments: * Why do you need the timeout? * If a SIGINT arrives before you set sigint_interrupt_enabled, it just sets cancel_pressed but doesn't jump out of the connection attempt. You need to explicitly check cancel_pressed after setting sigint_interrupt_enabled to close that race condition. * You have to reinitialize the fd mask with FD_ZERO/SET before each call to select(). select() modifies the mask. * In case of PGRES_POLLING_WRITING, you have to wait until the socket becomes writable, not readable. Attached is a new version that fixes those. There's one caveat in the libpq docs about PQconnectStart/PQconnectPoll: The connect_timeout connection parameter is ignored when using PQconnectPoll; it is the application's responsibility to decide whether an excessive amount of time has elapsed. Otherwise, PQconnectStart followed by a PQconnectPoll loop is equivalent to PQconnectdb. So after this patch, connect_timeout will be ignored in \connect. That probably needs to be fixed. You could incorporate a timeout fairly easily into the select() calls, but unfortunately there's no easy way to get the connect_timeout value. You could to parse the connection string the user gave with PQconninfoParse(), but the effective timeout setting could come from a configuration file, too. Not sure what to do about that. If there was a PQconnectTimeout(conn) function, similar to PQuser(conn), PQhost(conn) et al, you could use that. Maybe we should add that, or even better, a generic function that could be used to return not just connect_timeout, but all the connection options in effect in a connection. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index 69fac83..135d022 100644 --- a/src/bin/psql/command.c +++ b/src/bin/psql/command.c @@ -1515,8 +1515,8 @@ param_is_newly_set(const char *old_val, const char *new_val) static bool do_connect(char *dbname, char *user, char *host, char *port) { - PGconn *o_conn = pset.db, - *n_conn; + PGconn *o_conn = pset.db; + PGconn *n_conn = NULL; char *password = NULL; if (!dbname) @@ -1570,14 +1570,67 @@ do_connect(char *dbname, char *user, char *host, char *port) keywords[7] = NULL; values[7] = NULL; - n_conn = PQconnectdbParams(keywords, values, true); + /* attempt connection asynchronously */ + n_conn = PQconnectStartParams(keywords, values, true); + + if (sigsetjmp(sigint_interrupt_jmp, 1) != 0) + { + /* interrupted during connection attempt */ + PQfinish(n_conn); + n_conn = NULL; + } + else + { + while (true) + { +int poll_res; +int rc; +fd_set read_mask, + write_mask; + +poll_res = PQconnectPoll(n_conn); +if (poll_res == PGRES_POLLING_OK || + poll_res == PGRES_POLLING_FAILED) +{ + break; +} + +FD_ZERO(&read_mask); +FD_ZERO(&write_mask); + +if (poll_res == PGRES_POLLING_READING) + FD_SET(PQsocket(n_conn), &read_mask); +if (poll_res == PGRES_POLLING_WRITING) + FD_SET(PQsocket(n_conn), &write_mask); + +sigint_interrupt_enabled = true; +if (cancel_pressed) +{ + /* interrupted during connection attempt */ + PQfinish(n_conn); + n_conn = NULL; + sigint_interrupt_enabled = false; + break; +} +rc = select(PQsocket(n_conn) + 1, + &read_mask, &write_mask, NULL, + NULL); +sigint_interrupt_enabled = false; + +if (rc < 0 && errno != EINTR) + break; + } + } free(keywords); free(values); /* We can immediately discard the password -- no longer needed */ if (password) + { free(password); + password = NULL; + } if (PQstatus(n_conn) == CONNECTION_OK) break; @@ -1586,7 +1639,7 @@ do_connect(char *dbname, char *user, char *host, char *port) * Connection attempt failed; either retry the connection attempt with * a new password, or give up. */ - if (!password && PQconnectionNeedsPassword(n_conn) && pset.getPassword != TRI_NO) + if (PQconnectionNeedsPassword(n_conn) && pset.getPassword != TRI_NO) { PQfinish(n_conn); password = prompt_for_password(user); @@ -1600,7 +1653,8 @@ do_connect(char *dbname, char *user, char *host, char *port) */ if (pset.cur_cmd_interactive) { - psql_error("%s", PQerrorMessage(n_conn)); + if (n_conn) +psql_error("%s", PQerrorMessage(n_conn)); /* pset.db is left unmodified */ if (o_conn) @@ -1608,7 +1662,9 @@ do_connect(char *dbname, char *user, char
Re: [HACKERS] Page Checksums
On 10/01/12 09:07, Simon Riggs wrote: > > You can repeat that argument ad infinitum. Even if the CRC covers all the > > pages in the OS buffer cache, it still doesn't cover the pages in the > > shared_buffers, CPU caches, in-transit from one memory bank to another etc. > > You have to draw the line somewhere, and it seems reasonable to draw it > > where the data moves between long-term storage, ie. disk, and RAM. > > We protect each change with a CRC when we write WAL, so doing the same > thing doesn't sound entirely unreasonable, especially if your database > fits in RAM and we aren't likely to be doing I/O anytime soon. The > long term storage argument may no longer apply in a world with very > large memory. > I'm not so sure about that. The experience we have is that storage and memory doesn't grow as fast as demand. Maybe we are in a minority but at Jane Street memory size < database size is sadly true for most of the important databases. Concrete the two most important databases are 715 GB and 473 GB in size (the second used to be much closer to the first one in size but we recently archived a lot of data). In both databases there is a small set of tables that use the majority of the disk space. Those tables are also the most used tables. Typically the size of one of those tables is between 1-3x size of memory. And the cumulative size of all indices on the table is normally roughly the same size as the table. Cheers, Bene -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LWLOCK_STATS
On Tue, Jan 10, 2012 at 12:24 AM, Jim Nasby wrote: > IIRC, pg_bench is *extremely* write-heavy. There's probably not that many > systems that operate that way. I suspect that most OLTP systems read more > than they write, and some probably have as much as a 10-1 ratio. IMHO the main PostgreSQL design objective is doing a flexible, general purpose 100% write workload. Which is why Hot Standby and LISTEN/NOTIFY are so important as mechanisms for offloading read traffic to other places, so we can scale the total solution beyond 1 node without giving up the power of SQL. So benchmarking write-heavy workloads and separately benchmarking read-only workloads is more representative. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Generate call graphs in run-time
Is this only to print out the stack upon errors? Looks like the stack is in the variable error_context_stack. Is it always available containing all the parent functions, even when there is no error? Can I reach it from within pgstat.c? 2012/1/10 Jim Nasby > On Jan 9, 2012, at 2:08 PM, Joel Jacobson wrote: > > Generates call graphs of function calls within a transaction in run-time. > > Related to this... we had Command Prompt write a function for us that > would spit out the complete call-graph of the current call stack whenever > it was called. Alvaro didn't need to add any hooks to the backend to > accomplish this, so it would seem that the call stack is already available. > That might simplify what you're doing. > > I don't see this posted on pgFoundry yet, so I've attached it. > -- > Jim C. Nasby, Database Architect j...@nasby.net > 512.569.9461 (cell) http://jim.nasby.net > > -- Joel Jacobson Trustly +46703603801 https://trustly.com
Re: [HACKERS] Page Checksums
On Tue, Jan 10, 2012 at 8:04 AM, Heikki Linnakangas wrote: > On 10.01.2012 02:12, Jim Nasby wrote: >> >> Filesystem CRCs very likely will not happen to data that's in the cache. >> For some users, that's a huge amount of data to leave un-protected. > > > You can repeat that argument ad infinitum. Even if the CRC covers all the > pages in the OS buffer cache, it still doesn't cover the pages in the > shared_buffers, CPU caches, in-transit from one memory bank to another etc. > You have to draw the line somewhere, and it seems reasonable to draw it > where the data moves between long-term storage, ie. disk, and RAM. We protect each change with a CRC when we write WAL, so doing the same thing doesn't sound entirely unreasonable, especially if your database fits in RAM and we aren't likely to be doing I/O anytime soon. The long term storage argument may no longer apply in a world with very large memory. The question is, when exactly would we check the checksum? When we lock the block, when we pin it? We certainly can't do it on every access to the block since we don't even track where that happens in the code. I think we could add an option to check the checksum immediately after we pin a block for the first time but it would be very expensive and sounds like we're re-inventing hardware or OS features again. Work on 50% performance drain, as an estimate. That is a level of protection no other DBMS offers, so that is either an advantage or a warning. Jim, if you want this, please do the research and work out what the probability of losing shared buffer data in your ECC RAM really is so we are doing it for quantifiable reasons (via old Google memory academic paper) and to verify that the cost/benefit means you would actually use it if we built it. Research into requirements is at least as important and time consuming as research on possible designs. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sending notifications from the master to the standby
On Tue, Jan 10, 2012 at 5:00 AM, Tom Lane wrote: > Joachim Wieland writes: >> [ send NOTIFYs to slaves by means of: ] Good idea. > I wonder whether it'd be practical to not involve WAL per se in this > at all, but to transmit NOTIFY messages by having walsender processes > follow the notify stream (as though they were listeners) and send the > notify traffic as a separate message stream interleaved with the WAL > traffic. We already have, as of a few days ago, the concept of > additional traffic in the walsender stream besides the WAL data itself, > so adding notify traffic as another message type should be > straightforward. Also good idea. > It might be a bit tricky to get walreceivers to inject > the data into the slave-side ring buffer at the right time, ie, not > until after the commit a given message describes has been replayed; > but I don't immediately see a reason to think that's infeasible. When transaction commits it would use full-size commit records and set a (new) flag in xl_xact_commit.xinfo to show the commit is paired with notify traffic. Get messages in walreceiver.c XLogWalRcvProcessMsg() and put them in a shared hash table. Messages would need to contain xid of notifying transaction and other info needed for LISTEN. When we hit xact.c xact_redo_commit() on standby we'd check for messages in the hash table if the notify flag is set and execute the normal notify code as if the NOTIFY had run locally on the standby. We can sweep the hash table clean of any old messages each time we run ProcArrayApplyRecoveryInfo() Add new message type to walprotocol.h. Message code 'L' appears to be available. Suggest we add something to initial handshake from standby to say "please send me notify traffic", which we can link to a parameter that defines size of standby_notify_buffer. We don't want all standbys to receive such traffic unless they really want it and pg_basebackup probably doesn't want it either. If you wanted to get really fancy you could send only some of the traffic to each standby based on a hash or roundrobin algorithm, so we can spread the listeners across multiple standbys. I'll be your reviewer, if you want. > Going in this direction would mean that slave-side LISTEN only works > when using walsender/walreceiver, and not with old-style log shipping. > But personally I don't see a problem with that. If you're trying to > LISTEN you probably want pretty up-to-date data anyway. Which fits the expected use case also. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Checksums
On 10.01.2012 02:12, Jim Nasby wrote: Filesystem CRCs very likely will not happen to data that's in the cache. For some users, that's a huge amount of data to leave un-protected. You can repeat that argument ad infinitum. Even if the CRC covers all the pages in the OS buffer cache, it still doesn't cover the pages in the shared_buffers, CPU caches, in-transit from one memory bank to another etc. You have to draw the line somewhere, and it seems reasonable to draw it where the data moves between long-term storage, ie. disk, and RAM. Filesystem bugs do happen... though presumably most of those would be caught by the filesystem's CRC check... but you never know! Yeah. At some point we have to just have faith on the underlying system. It's reasonable to provide protection or make recovery easier from bugs or hardware faults that happen fairly often in the real world, but a can't-trust-no-one attitude is not very helpful. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers