[HACKERS] pg_dump bug in 9.4beta2 and HEAD
I'm seeing an assertion failure with pg_dump -c --if-exists which is not ready to handle BLOBs it seems: pg_dump: pg_backup_archiver.c:472: RestoreArchive: Assertion `mark != ((void *)0)' failed. To reproduce: $ createdb test $ pg_dump -c --if-exists test (works, dumps empty database) $ psql test -c select lo_create(1); $ pg_dump -c --if-exists test (fails, with the above mentioned assertion) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allowing parallel pg_restore from pipe
On Wed, Apr 24, 2013 at 4:05 PM, Stefan Kaltenbrunner ste...@kaltenbrunner.cc wrote: What might make sense is something like pg_dump_restore which would have no intermediate storage at all, just pump the data etc from one source to another in parallel. But I pity the poor guy who has to write it :-) hmm pretty sure that Joachims initial patch for parallel dump actually had a PoC for something very similiar to that... That's right, I implemented that as an own output format and named it migrator I think, which wouldn't write each stream to a file as the directory output format does but that instead pumps it back into a restore client. Actually I think the logic was even reversed, it was a parallel restore that got the data from internally calling pg_dump functionality instead of from reading files... The neat thing about this approach was that the order was optimized and correct, i.e. largest tables start first and dependencies get resolved in the right order. I could revisit that patch for 9.4 if enough people are interested. Joachim
Re: [HACKERS] Optimizing pglz compressor
On Tue, Mar 5, 2013 at 8:32 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: With these tweaks, I was able to make pglz-based delta encoding perform roughly as well as Amit's patch. Out of curiosity, do we know how pglz compares with other algorithms, e.g. lz4 ? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Materialized view assertion failure in HEAD
I'm getting an assertion failure in HEAD with materialized views, see below for backtrace. To reproduce, just run make installcheck, dump the regression database and then restore it, the server crashes during restore. (gdb) bt #0 0x7f283a366425 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7f283a369b8b in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00888429 in ExceptionalCondition ( conditionName=0xa0c840 !(!matviewRel-rd_rel-relhasoids), errorType=0xa0c78a FailedAssertion, fileName=0xa0c780 matview.c, lineNumber=135) at assert.c:54 #3 0x005dbfc4 in ExecRefreshMatView (stmt=0x1b70a28, queryString=0x1b6ff98 REFRESH MATERIALIZED VIEW tvmm;\n\n\n, params=0x0, completionTag=0x7fff47af0a60 ) at matview.c:135 #4 0x007758a5 in standard_ProcessUtility (parsetree=0x1b70a28, queryString=0x1b6ff98 REFRESH MATERIALIZED VIEW tvmm;\n\n\n, params=0x0, dest=0x1b70da0, completionTag=0x7fff47af0a60 , context=PROCESS_UTILITY_TOPLEVEL) at utility.c:1173 #5 0x00773e3f in ProcessUtility (parsetree=0x1b70a28, queryString=0x1b6ff98 REFRESH MATERIALIZED VIEW tvmm;\n\n\n, params=0x0, dest=0x1b70da0, completionTag=0x7fff47af0a60 , context=PROCESS_UTILITY_TOPLEVEL) at utility.c:341 #6 0x00772d7e in PortalRunUtility (portal=0x1aeb6d8, utilityStmt=0x1b70a28, isTopLevel=1 '\001', dest=0x1b70da0, completionTag=0x7fff47af0a60 ) at pquery.c:1185 #7 0x00772f56 in PortalRunMulti (portal=0x1aeb6d8, isTopLevel=1 '\001', dest=0x1b70da0, altdest=0x1b70da0, completionTag=0x7fff47af0a60 ) at pquery.c:1317 #8 0x00772481 in PortalRun (portal=0x1aeb6d8, count=9223372036854775807, isTopLevel=1 '\001', dest=0x1b70da0, altdest=0x1b70da0, completionTag=0x7fff47af0a60 ) at pquery.c:814 #9 0x0076c155 in exec_simple_query ( query_string=0x1b6ff98 REFRESH MATERIALIZED VIEW tvmm;\n\n\n) at postgres.c:1048 #10 0x00770517 in PostgresMain (argc=2, argv=0x1ab4bb0, username=0x1ab4a88 joe) at postgres.c:3969 #11 0x0070ccec in BackendRun (port=0x1ae5ac0) at postmaster.c:3989 #12 0x0070c401 in BackendStartup (port=0x1ae5ac0) at postmaster.c:3673 #13 0x00708ce6 in ServerLoop () at postmaster.c:1575 #14 0x00708420 in PostmasterMain (argc=3, argv=0x1ab2420) at postmaster.c:1244 #15 0x006704f7 in main (argc=3, argv=0x1ab2420) at main.c:197 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Materialized view assertion failure in HEAD
On Tue, Mar 5, 2013 at 9:11 AM, Kevin Grittner kgri...@ymail.com wrote: Will investigate. You don't have default_with_oids = on, do you? No, this was a default install with a default postgresql.conf -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise missing in the walsender
On Wed, Feb 20, 2013 at 4:54 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Feb 19, 2013 at 5:48 PM, Simon Riggs si...@2ndquadrant.com wrote: I agree with Merlin and Joachim - if we have the call in one place, we should have it in both. We might want to assess whether we even want to have it one place. I've seen cases where the existing call hurts performance, because of WAL file recycling. That's interesting, I hadn't thought about WAL recycling. I now agree that this whole thing is even more complicated, you might have an archive_command set as well, like cp for instance, that reads in the WAL file again, possibly even right after we called posix_fadvise on it. It appears to me that the right strategy depends on a few factors: a) what ratio of your active dataset fits into RAM? b) how many WAL files do you have? c) how long does it take for them to get recycled? d) archive_command set / wal_senders active? And recommendations for the two extremes would be: If your dataset fits mostly into RAM and if you have only few WAL files that get recycled quickly then you don't want to evict the WAL file from the buffer cache. On the other hand if your dataset doesn't fit into RAM and you have many WAL files that take a while until they get recycled, then you should consider hinting to the OS. If you're in that second category (I am) and you're also using the archive_command you could just piggyback the posix_fadvise call onto your archive_command, assuming that the walsender is already done with the file at that moment. And I'm also pretty certain that Robert's setup that he used for the write scalability tests fell into the first category. So given the above, I think it's possible to come up with benchmarks that prove whatever you want to prove :-) Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise missing in the walsender
On Tue, Feb 19, 2013 at 5:48 PM, Simon Riggs si...@2ndquadrant.com wrote: In access/transam/xlog.c we give the OS buffer caching a hint that we won't need a WAL file any time soon with posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED); I agree with Merlin and Joachim - if we have the call in one place, we should have it in both. You could argue that if it's considered beneficial in the case with no walsender, then you should definitely have it if there are walsenders around: The walsenders reopen and read those files which gives the OS reason to believe that other processes might do the same in the near future and hence that it should not evict those pages too early. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] posix_fadvise missing in the walsender
In access/transam/xlog.c we give the OS buffer caching a hint that we won't need a WAL file any time soon with posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED); before closing the WAL file, but only if we don't have walsenders. That's reasonable because the walsender will reopen that same file shortly after. However the walsender doesn't call posix_fadvise once it's done with the file and I'm proposing to add this to walsender.c for consistency as well. Since there could be multiple walsenders, only the slowest one should call this function. Finding out the slowest walsender can be done by inspecting the shared memory and looking at the sentPtr of each walsender. Any comments? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical decoding exported base snapshot
On Tue, Dec 11, 2012 at 6:52 PM, Andres Freund and...@2ndquadrant.com wrote: One problem I see is that while exporting a snapshot solves the visibility issues of the table's contents it does not protect against schema changes. I am not sure whether thats a problem. If somebody runs a CLUSTER or something like that, the table's contents will be preserved including MVCC semantics. That's fine. The more problematic cases I see are TRUNCATE, DROP and ALTER TABLE. This is why the pg_dump master process executes a lock table table in access share mode for every table, so your commands would all block. In fact it's even more complicated because the pg_dump worker processes also need to lock the table. They try to get a similar lock in NOWAIT mode right before dumping the table. If they don't get the lock that means that somebody else is waiting for an exclusive lock (this is the case you describe) and the backup will fail. Problem 2: To be able to do a SET TRANSACTION SNAPSHOT the source transaction needs to be alive. That's currently solved by exporting the snapshot in the walsender connection that did the INIT_LOGICAL_REPLICATION. The question is how long should we preserve that snapshot? You lost me on this one after the first sentence... But in general the snapshot isn't so much a magical thing: As long the exporting transaction is open, it guarantees that there is a transaction out there that is holding off vacuum from removing data and it's also guaranteeing that the snapshot as is has existed at some time in the past. Once it is applied to another transaction you now have two transactions that will hold off vacuum because they share the same xmin,xmax values. You could also just end the exporting transaction at that point. One thought at the time was to somehow integrate exported snapshots with the prepared transactions feature, then you could export a snapshot, prepare the exporting transaction and have that snapshot frozen forever and it would even survive a server restart. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel pg_dump
On Sat, Dec 8, 2012 at 3:05 PM, Bruce Momjian br...@momjian.us wrote: On Sat, Dec 8, 2012 at 11:13:30AM -0500, Andrew Dunstan wrote: I am working on it when I get a chance, but keep getting hammered. I'd love somebody else to review it too. FYI, I will be posting pg_upgrade performance numbers using Unix processes. I will try to get the Windows code working but will also need help. Just let me know if there's anything I can help you guys with. Joachim
Re: [HACKERS] [PATCH] pg_dump: Sort overloaded functions in deterministic order
On Wed, Oct 17, 2012 at 5:43 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: (I tested the new pg_dump with 8.2 and HEAD and also verified it passes pg_upgrade's make check. I didn't test with other server versions.) I also tested against 8.3 and 8.4 since 8.4 is the version that introduced pg_get_function_identity_arguments. The included testcase fails on 8.3 and succeeds on 8.4 (pg_dump succeeds in both cases of course but it's only ordered deterministically in 8.4+). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add FET to Default and Europe.txt
On Mon, Oct 8, 2012 at 4:00 PM, Tom Lane t...@sss.pgh.pa.us wrote: We can't just refuse to deal with this ambiguity. We have to have some very-low-pain way to install settings that will please those large fractions of our user base. Moreover, if that very-low-pain way isn't the exact same way it's been done for the last half dozen releases, you'll already have managed to annoy those selfsame large fractions. You'd better have a good reason for changing it. As previously noted, there are two topics that are being discussed here: - how to allow users to configure their own timezone abbreviations and - how to keep the timezone abbreviations that we ship updated wrt zic changes Regarding configurability, we currently allow users (=administrators) to change their abbreviations to whatever they like to use. The Default file we provide has been taken from the set that used to be a list that we compiled in back in the 8.x days (and we had this egregious GUC australian_timezones that decided whether CST/EST was regarded to be US or Australian timezones - there was no such hack for any of the other ambiguities). These timezone abbreviations have developed over several decades, most of these decades without global communication in mind. They are full of inconsistencies and irregularities and IMHO any way to select a proper set for someone automatically is doomed to solve the problem only partially. Another funny ambiguity is that the EST timezone in Austalia could mean both Eastern Standard Time and Eastern Summer Time, having different offsets depending on what month of the year you're in... The fact that we don't hear more complaints about the sets actually suggests that most people are happy with our Default set. In fact, Marc could just add his FET timezone to his own installations and use it from there. His request to add it to our Default set however is perfectly valid and should be discussed. One thing that could be improved about configurability is the fact that if you're not an administrator of the database, i.e. if you don't have write access to where the timezones are defined, you're pretty much out of luck being able to define your own abbreviations. This is true in a shared hosting environment for example. Wrt updating the timezone abbreviation files, I guess what we need is a parser for the zic database, our own timezone files and a script that compares the two datasets and spits out any conflicts so that we can clean them up after manual inspection of the differences. I will see if I can come up with some scripts in this direction. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Review for pg_dump: Sort overloaded functions in deterministic order
Patch looks good, all concerns that had been raised previously have been addressed in this version of the patch. The only thing that IMO needs to change is a stylistic issue: if (fout-remoteVersion = 80200) { [...] (fout-remoteVersion = 80400) ? pg_catalog.pg_get_function_identity_arguments(oid) : NULL::text, [...] } Please just create a whole new if (fout-remoteVersion = 80400) { [...] } here. Other than that, the feature works as advertised and does not negatively affect runtime or memory consumption (the new field is only added to functions / aggregates). Please send a new version of the patch that changes the above mentioned item, the patch also needs rebasing (off by a couple of lines). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] could not open relation with OID errors after promoting the standby to master
On Tue, May 22, 2012 at 9:50 AM, Robert Haas robertmh...@gmail.com wrote: Hmm. I think that if you do it this way, the minimum recovery point won't be respected, which could leave you with a corrupted database. Now, if all the WAL files that you need are present in pg_xlog anyway, then they ought to get replayed anyway, but I think that if you are using restore_command (as opposed to streaming replication) we restore WAL segments under a different file name, which might cause this problem. Uhm, maybe I add some more details, so you get a better idea of what I did: The idea was to promote the standby to be the new master. There was streaming replication active but at some time I had to take the master down. IIRC from the log I saw that after the master went down, the standby continued recovering from a bunch of archived log files (via recovery_command), I had suspected that either the standby was lagging behind a bit or that the master archived them during shutdown. When the standby didn't have anything else left to recover from (saying both xlog file foo doesn't exist and cannot connect to master), I deleted recovery.conf on the standby and restarted it. I wouldn't have assumed any corruption was possible given that I did clean shutdowns on both sides... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] could not open relation with OID errors after promoting the standby to master
On Wed, May 16, 2012 at 11:38 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: Well, that is not surprising in itself -- InitTempTableNamespace calls RemoveTempRelations to cleanup from a possibly crashed previous backend with the same ID. So that part of the backtrace looks normal to me (unless there is something weird going on, which might very well be the case). Right, I guess the stack trace is okay but some state was obviously wrong. I was able to clean that up now by some catalog hacking, but I'm definitely going to dump and reload soon. I found out that it was certain backend ids which couldn't create temporary tables, meaning that when I did a create temp table in these few certain backend ids (about 4-5 all with low id numbers which is why I hit them quite often), it would give me this could not open relation with OID x error. I also couldn't drop the temp schema in these backends: # drop schema pg_temp_4; ERROR: cache lookup failed for relation 1990987636 # select oid, * from pg_namespace ; (got oid 4664506 for pg_temp_4) # select * from pg_class where oid = 1990987636; (no rows returned) # delete from pg_namespace where oid = 4664506; DELETE 1 # create temp table myfoo(a int); CREATE TABLE Later on I also found some leftover pg_type entries from temporary tables that didn't exist anymore. I'm quite that certain I shouldn't see these anymore... And I also find a few entries in pg_class with relistemp='t' whose oid is considerably older than anything recent. This kinda suggests that there might be something weird going on when you have temp tables in flight and fail over, at least that's the only explanation I have for how this could have happened. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] could not open relation with OID errors after promoting the standby to master
I've switched servers yesterday night and the previous slave is now the master. This is 9.0.6 (originally) / 9.0.7 (now) on Linux. Now I'm seeing a bunch of ERROR: could not open relation with OID 1990987633 STATEMENT: create temp table seen_files (fileid integer) Interestingly enough, 90% of these happen with create temp table statements, but I also see them with regular read-only select statements, but I'd say it's at most 10% of these. Some reports for this error message suggested running reindex, so I've run reindex table and also vacuum full on all system catalogs (just for good measure) but that didn't help much. Restarting the cluster didn't help either. If it matters, I have not promoted the master with a trigger file but restarted it after deleting recovery.conf. Everything else appears to be running fine but since it's a) annoying and b) not very comforting, I might dump and restore tomorrow night. I added a sleep() after this error message so that I could attach a debugger and grab a stack trace: (gdb) bt #0 0x003eb509a170 in __nanosleep_nocancel () from /lib64/libc.so.6 #1 0x003eb5099fc4 in sleep () from /lib64/libc.so.6 #2 0x0046375c in relation_open (relationId=1990987633, lockmode=value optimized out) at heapam.c:906 #3 0x004b26e6 in heap_drop_with_catalog (relid=1990987633) at heap.c:1567 #4 0x004ace01 in doDeletion (object=0x62dfbc8, depRel=0x2b9ea756f6a8) at dependency.c:1046 #5 deleteOneObject (object=0x62dfbc8, depRel=0x2b9ea756f6a8) at dependency.c:1004 #6 0x004aeb00 in deleteWhatDependsOn (object=value optimized out, showNotices=0 '\000') at dependency.c:401 #7 0x004b6e90 in RemoveTempRelations () at namespace.c:3234 #8 InitTempTableNamespace () at namespace.c:3066 #9 0x004b70b5 in RangeVarGetCreationNamespace (newRelation=value optimized out) at namespace.c:351 #10 0x00536e13 in DefineRelation (stmt=0x638da00, relkind=114 'r', ownerId=0) at tablecmds.c:409 #11 0x0063c15f in standard_ProcessUtility (parsetree=value optimized out, queryString=0x6298460 create temp table temp_defs_table_20068 (symhash char(32), symbolid integer), params=0x0, isTopLevel=value optimized out, dest=value optimized out, completionTag=value optimized out) at utility.c:512 #12 0x00637d81 in PortalRunUtility (portal=0x61ec860, utilityStmt=0x62992d0, isTopLevel=1 '\001', dest=0x6299670, completionTag=0x7a9c1c30 ) at pquery.c:1191 #13 0x006384de in PortalRunMulti (portal=0x61ec860, isTopLevel=1 '\001', dest=0x6299670, altdest=0x6299670, completionTag=0x7a9c1c30 ) at pquery.c:1296 #14 0x00639732 in PortalRun (portal=0x61ec860, count=9223372036854775807, isTopLevel=1 '\001', dest=0x6299670, altdest=0x6299670, completionTag=0x7a9c1c30 ) at pquery.c:822 #15 0x006359e5 in exec_simple_query (argc=value optimized out, argv=value optimized out, username=value optimized out) at postgres.c:1058 #16 PostgresMain (argc=value optimized out, argv=value optimized out, username=value optimized out) at postgres.c:3936 #17 0x005f95d6 in BackendRun () at postmaster.c:3560 #18 BackendStartup () at postmaster.c:3247 #19 ServerLoop () at postmaster.c:1431 #20 0x005fb934 in PostmasterMain (argc=value optimized out, argv=value optimized out) at postmaster.c:1092 #21 0x0058de36 in main (argc=3, argv=0x61dd1d0) at main.c:188 (gdb) Any idea? It looks suspicious that it calls into RemoveTempRelations() from InitTempNamespace() thereby removing the table it is about to create? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel pg_dump
On Tue, Apr 3, 2012 at 9:26 AM, Andrew Dunstan and...@dunslane.net wrote: First, either the creation of the destination directory needs to be delayed until all the sanity checks have passed and we're sure we're actually going to write something there, or it needs to be removed if we error exit before anything gets written there. pg_dump also creates empty files which is the analogous case here. Just try to dump a nonexistant database for example (this also shows that delaying won't help...). Maybe pg_dump -F d should be prepared to accept an empty directory as well as a non-existent directory, just as initdb can. That sounds like a good compromise. I'll implement that. Second, all the PrintStatus traces are annoying and need to be removed, or perhaps better only output in debugging mode (using ahlog() instead of just printf()) Sure, PrintStatus is just there for now to see what's going on. My plan was to remove it entirely in the final patch. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Tue, Apr 3, 2012 at 11:04 AM, Robert Haas robertmh...@gmail.com wrote: OK, but it seems like a pretty fragile assumption that none of the workers will ever manage to emit any other error messages. We don't rely on this kind of assumption in the backend, which is a lot better-structured and less spaghetti-like than the pg_dump code. Yeah, but even if they don't use exit_horribly(), the user would still see the output, stdout/stderr aren't closed and everyone can still write to it. As a test, I printed out some messages upon seeing a specific dump id in a worker: if (strcmp(command, DUMP 3540) == 0) { write_msg(NULL, Info 1\n); printf(Info 2\n); exit_horribly(modulename, that's why\n); } $ ./pg_dump -j 7 ... pg_dump: Info 1 Info 2 pg_dump: [parallel archiver] that's why if (strcmp(command, DUMP 3540) == 0) { write_msg(NULL, Info 1\n); printf(Info 2\n); fprintf(stderr, exiting on my own\n); exit(1); } $ ./pg_dump -j 7 ... pg_dump: Info 1 Info 2 exiting on my own pg_dump: [parallel archiver] A worker process died unexpectedly -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Tue, Apr 3, 2012 at 9:34 AM, Robert Haas robertmh...@gmail.com wrote: On Sun, Apr 1, 2012 at 12:35 PM, Joachim Wieland j...@mcknight.de wrote: Unfortunately this is not really the case. What is being moved out of pg_backup_archiver.c and into parallel.c is either the shutdown logic that has been applied only a few days ago or is necessary to change the parallel restore logic from one-thread-per-dump-object to the message passing framework where a worker starts in the beginning and then receives a new dump object from the master every time it's idle. Hmm. It looks to me like the part-two patch still contains a bunch of code rearrangement. For example, the current code for pg_backup_archiver.c patch contains this: typedef struct ParallelState { int numWorkers; ParallelStateEntry *pse; } ParallelState; In the revised patch, that's removed, and parallel.c instead contains this: typedef struct _parallel_state { int numWorkers; ParallelSlot *parallelSlot; } ParallelState; Yes, this is what I referred to as the part of the shutdown logic that we only applied a few days ago. I basically backported what I had in parallel.c into pg_backup_archiver.c which is where all the parallel logic lives at the moment. Moving it out of pg_backup_archiver.c and into a new parallel.c file means that I'd either have to move the declaration to a header or create accessor functions and declare these in a header. I actually tried it and both solutions created more lines than they would save later on, especially with the lines that will remove this temporary arrangement again... The current parallel restore engine already has a ParallelSlot structure but uses it in a slightly different way. That's why I created the one in the shutdown logic as ParallelStateEntry for now. This will be gone with the final patch and at the end there will only be a ParallelSlot left that will serve both purposes. That's why you see this renaming (and the removal of the current ParallelSlot structure). struct _parallel_state won't be used anywhere, except for forward declarations in headers. I just used it because that seemed to be the naming scheme, other structures are called similarly, e.g.: $ grep struct _ pg_backup_archiver.c typedef struct _restore_args typedef struct _parallel_slot typedef struct _outputContext I'm fine with any name, just let me know what you prefer. On a similar note, what's the point of changing struct Archive to have int numWorkers instead of int number_of_jobs, and furthermore shuffling the declaration around to a different part of the struct? number_of_jobs was in the struct RestoreOptions before, a structure that is not used when doing a dump. I moved it to the Archive as it is used by both dump and restore and since all other code talks about workers I changed the name to numWorkers. On another note, I am not sure that I like the messaging protocol you've designed. It seems to me that this has little chance of being reliable: + void (*on_exit_msg_func)(const char *modulename, const char *fmt, va_list ap) = vwrite_msg; I believe the idea here is that you're going to capture the dying gasp of the worker thread and send it to the master instead of printing it out. But that doesn't seem very reliable. There's code all over the place (and, in particular, in pg_dump.c) that assumes that we may write messages at any time, and in particular that we may emit multiple messages just before croaking. I guess you're talking about the code in pg_dump that reads in the database schema and the details of all the different objects in the schema. This code is run before forking off workers and is always executed in the master. pg_dump only forks when all the catalog data has been read so only actual TABLE DATA and BLOBs are dumped from the workers. I claim that in at least 90% the functions involved here use exit_horribly() and output a clear message about why they're dying. If they don't but just die, the master will say worker died unexpectedly. As you said a few mails before, any function exiting at this stage should rather call exit_horribly() to properly clean up after itself. The advantage of using the message passing system for the last error message is that you get exactly one message and it's very likely that it accurately describes what happened to a worker to make it stop. Also, I like the idea of making it possible to use assertions in front-end code. But I think that if we're going to do that, we should make it work in general, not just for things that include pg_backup_archiver.h. I completely agree. Assertions helped a lot dealing with concurrent code. How do you want to tackle this for now? Want me to create a separate header pg_assert.h as part of my patch? Or is it okay to factor it out later and include it from the general header then? -- Sent via pgsql-hackers mailing list (pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Wed, Mar 28, 2012 at 5:19 PM, Andrew Dunstan and...@dunslane.net wrote: First hurdle: It doesn't build under Windows/mingw-w64: parallel.c:40:12: error: static declaration of 'pgpipe' follows non-static declaration Strange, I'm not seeing this but I'm building with VC2005. What happens is that you're pulling in the pgpipe.h header. I have moved these functions as static functions into pg_dump since you voted for removing them from the other location (because as it turned out, nobody else is currently using them). Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Thu, Mar 29, 2012 at 2:46 AM, Andrew Dunstan and...@dunslane.net wrote: But your patch hasn't got rid of them, and so it's declared twice. There is no pgpipe.h, BTW, it's declared in port.h. If VC2005 doesn't complain about the double declaration then that's a bug in the compiler, IMNSHO. Doesn't it even issue a warning? I no longer use VC2005 for anything, BTW, I use VC2008 or later. I agree, the compiler should have found it, but no, I don't even get a warning. I just verified it and when I add a #error right after the prototypes in port.h then it hits the #error on every file, so it definitely sees both prototypes and doesn't complain... cl.exe is running with /W3. Anyway, ISTM the best thing is just for us to get rid of pgpipe without further ado. I'll try to get a patch together for that. Thanks. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Wed, Mar 28, 2012 at 1:46 PM, Robert Haas robertmh...@gmail.com wrote: I'm wondering if we really need this much complexity around shutting down workers. I'm not sure I understand why we need both a hard and a soft method of shutting them down. At least on non-Windows systems, it seems like it would be entirely sufficient to just send a SIGTERM when you want them to die. They don't even need to catch it; they can just die. At least on my Linux test system, even if all pg_dump processes are gone, the server happily continues sending data. When I strace an individual backend process, I see a lot of Broken pipe writes, but that doesn't stop it from just writing out the whole table to a closed file descriptor. This is a 9.0-latest server. --- SIGPIPE (Broken pipe) @ 0 (0) --- read(13, \220\370\0\0\240\240r\266\3\0\4\0\264\1\320\1\0 \4 \0\0\0\0\270\237\220\0h\237\230\0..., 8192) = 8192 read(13, \220\370\0\0\350\300r\266\3\0\4\0\264\1\320\1\0 \4 \0\0\0\0\260\237\230\0h\237\220\0..., 8192) = 8192 sendto(7, d\0\0\0Acpp\t15.0\t124524\taut..., 8192, 0, NULL, 0) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- read(13, \220\370\0\\341r\266\3\0\5\0\260\1\340\1\0 \4 \0\0\0\0\270\237\220\0p\237\220\0..., 8192) = 8192 sendto(7, d\0\0\0Dcpp\t15.0\t1245672000\taut..., 8192, 0, NULL, 0) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- read(13, unfinished ... I guess that https://commitfest.postgresql.org/action/patch_view?id=663 would take care of that in the future, but without it (i.e anything = 9.2) it's quite annoying if you want to Ctrl-C a pg_dump and then have to manually hunt down and kill all the backend processes. I tested the above with immediately returning from DisconnectDatabase() in pg_backup_db.c on Linux. The important thing is that it calls PQcancel() on it's connection before terminating. On Windows, several pages indicate that you can only cleanly terminate a thread from within the thread, e.g. the last paragraph on http://msdn.microsoft.com/en-us/library/windows/desktop/ms686724%28v=vs.85%29.aspx The patch is basically doing what this page is recommending. I'll try your proposal about terminating in the child when the connection is closed, that sounds reasonable and I don't see an immediate problem with that. The existing coding of on_exit_nicely is intention, and copied from similar logic for on_shmem_exit in the backend. Is there really a compelling reason to reverse the firing order of exit hooks? No, reversing the order was not intended. I rewrote it to a for loop because the current implementation modifies a global variable and so on Windows only one thread would call the exit hook. I'll add all your other suggestions to the next version of my patch. Thanks! Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Fri, Mar 23, 2012 at 11:11 AM, Alvaro Herrera alvhe...@commandprompt.com wrote: Are you going to provide a rebased version? Rebased version attached, this patch also includes Robert's earlier suggestions. parallel_pg_dump_5.diff.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Fri, Mar 23, 2012 at 11:11 AM, Alvaro Herrera alvhe...@commandprompt.com wrote: Are you going to provide a rebased version? Yes, working on that. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] double free in current HEAD's pg_dump
There's a double free in the current HEAD's pg_dump. Fix attached. diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 2b0a5ff..57a6ccb 100644 *** a/src/bin/pg_dump/pg_dump.c --- b/src/bin/pg_dump/pg_dump.c *** dumpBlobs(Archive *fout, void *arg) *** 2372,2379 PQclear(res); } while (ntups 0); - PQclear(res); - return 1; } --- 2372,2377 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Fri, Mar 16, 2012 at 12:06 AM, Robert Haas robertmh...@gmail.com wrote: Good. The only exit handler I've seen so far is pgdump_cleanup_at_exit. If there's no other one, is it okay to remove all of this stacking functionality (see on_exit_nicely_index / MAX_ON_EXIT_NICELY) from dumputils.c and just define two global variables, one for the function and one for the arg that this function would operate on (or a struct of both)? No. That code is included by other things - like pg_dumpall - that don't know there's such a thing as an Archive. But I don't see that as a big problem; just on_exit_nicely whatever you want. We could also add on_exit_nicely_reset(), if needed, to clear the existing handlers. Yes, on_exit_nicely_reset() would be what I'd need to remove all callbacks from the parent after the fork in the child process. I still can't find any other hooks except for pgdump_cleanup_at_exit from pg_dump.c. I guess what you're saying is that we provide dumputil.c to other programs but even though none of them currently sets any exit callback, you want to keep the functionality so that they can set multiple exit hooks in the future should the need for them arise. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Wed, Mar 14, 2012 at 2:02 PM, Robert Haas robertmh...@gmail.com wrote: I think we should get rid of die_horribly(), and instead have arrange to always clean up AH via an on_exit_nicely hook. Good. The only exit handler I've seen so far is pgdump_cleanup_at_exit. If there's no other one, is it okay to remove all of this stacking functionality (see on_exit_nicely_index / MAX_ON_EXIT_NICELY) from dumputils.c and just define two global variables, one for the function and one for the arg that this function would operate on (or a struct of both)? We'd then have the current function and AHX (or only AH-connection from it) in the non-parallel case and as soon as we enter the parallel dump, we can exchange it for another function operating on ParallelState*. This avoids having to deal with thread-local storage on Windows, because ParallelState* is just large enough to hold all the required data and a specific thread can easily find its own slot with its threadId. Sure, but since all the function does is write to it or access it, what good does that do me? It encapsulates the variable so that it can only be used for one specific use case. Seems pointless to me. Not so much to me if the alternative is to make ParallelState* a global variable, but anyway, with the concept proposed above, ParallelState* would be the arg that the parallel exit handler would operate on, so it would indeed be global but hidden behind a different name and a void* pointer. (I will address all the other points you brought up in my next patch) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Wed, Mar 14, 2012 at 4:39 PM, Andrew Dunstan aduns...@postgresql.org wrote: I've just started looking at the patch, and I'm curious to know why it didn't follow the pattern of parallel pg_restore which created a new worker for each table rather than passing messages to looping worker threads as this appears to do. That might have avoided a lot of the need for this message passing infrastructure, if it could have been done. But maybe I just need to review the patch and the discussions some more. The main reason for this design has now been overcome by the flexibility of the synchronized snapshot feature, which allows to get the snapshot of a transaction even if this other transaction has been running for quite some time already. In other previously proposed implementations of this feature, workers had to connect at the same time and then could not close their transactions without losing the snapshot. The other drawback of the fork-per-tocentry-approach is the somewhat limited bandwith of information from the worker back to the master, it's basically just the return code. That's fine if there is no error, but if there is, then the master can't tell any further details (e.g. could not get lock on table foo, or could not write to file bar: no space left on device). This restriction does not only apply to error messages. For example, what I'd also like to have in pg_dump would be checksums on a per-TocEntry basis. The individual workers would calculate the checksums when writing the file and then send them back to the master for integration into the TOC. I don't see how such a feature could be implemented in a straightforward way without a message passing infrastructure. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Tue, Mar 13, 2012 at 1:53 PM, Robert Haas robertmh...@gmail.com wrote: What I mean is that the function ArchiveEntry() is defined in pg_backup_archiver.c, and it takes an argument called relpages, and the string relpages does not appear anywhere else in that file. Uhm, that's kinda concerning, isn't it... fixed... [...pgpipe...] Dunno. Can we get an opinion on that from one of the Windows guys? Andrew, Magnus? Waiting for the verdict here... If a child terminates without leaving a message, the master will still detect it and just say a worker process died unexpectedly (this part was actually broken, but now it's fixed :-) ) All that may be true, but I still don't see why it's right for this to apply in the cases where the worker thread says die_horribly(), but not in the cases where the worker says exit_horribly(). Hm, I'm not calling the error handler from exit_horribly because it doesn't have the AH. It looks like the code assumes that die_horribly() is called whenever AH is available and if not, exit_horribly() should be called which eventually calls these preregistered exit-hooks via exit_nicely() to clean up the connection. I think we should somehow unify both functions, the code is not very consistent in this respect, it also calls exit_horribly() when it has AH available. See for example pg_backup_tar.c Or is there another distinction between them? The question how to clean it up basically brings us back to the question what to do about global variables in general and for error handlers in particular. Or we change fmtQualifiedId to take an int and then we always pass the archive version instead of the Archive* ? Hmm, I think that might make sense. Done. +enum escrow_action { GET, SET }; +static void +parallel_error_handler_escrow_data(enum escrow_action act, ParallelState *pstate) +{ + static ParallelState *s_pstate = NULL; + + if (act == SET) + s_pstate = pstate; + else + *pstate = *s_pstate; +} This seems like a mighty complicated way to implement a global variable. Well, we talked about that before, when you complained that you couldn't get rid of the global g_conn because of the exit handler. You're right that in fact it is an indirect global variable here but it's clearly limited to the use of the error handler and you can be sure that nobody other than this function writes to it or accesses it without calling this function. Sure, but since all the function does is write to it or access it, what good does that do me? It encapsulates the variable so that it can only be used for one specific use case. Attaching a new version. parallel_pg_dump_4.diff.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Thu, Feb 16, 2012 at 1:29 PM, Robert Haas robertmh...@gmail.com wrote: Can you provide an updated patch? Robert, updated patch is attached. parallel_pg_dump_2.diff.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Wed, Feb 8, 2012 at 1:21 PM, Robert Haas robertmh...@gmail.com wrote: In my patch I dealt with exactly the same problem for the error handler by creating a separate function that has a static variable (a pointer to the ParallelState). The value is set and retrieved through the same function, so yes, it's kinda global but then again it can only be accessed from this function, which is only called from the error handler. How did you make this thread-safe? The ParallelState has a ParallelSlot for each worker process which contains that worker process's thread id. So a worker process just walks through the slots until it finds its own thread id and then goes from there. In particular, it only gets the file descriptors to and from the parent from this structure, to communicate the error it encountered, but it doesn't get the PGconn. This is because that error handler is only called when a worker calls die_horribly(AH, ...) in which case the connection is already known through AH. Termination via a signal just sets a variable that is checked in the I/O routines and there we also have AH to shut down the connection (actually to call die_horribly()). So we at least need to press on far enough to get to that point. Sure, let me know if I can help you with something. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Tue, Feb 7, 2012 at 4:59 PM, Robert Haas robertmh...@gmail.com wrote: It turns out that (as you anticipated) there are some problems with my previous proposal. I assume you're talking to Tom, as my powers of anticipation are actually quite limited... :-) This is not quite enough to get rid of g_conn, but it's close: the major stumbling block at this point is probably exit_nicely(). The gyrations we're going through to make sure that AH-connection gets closed before exiting are fairly annoying; maybe we should invent something in dumputils.c along the line of the backend's on_shmem_exit(). Yeah, this becomes even more important with parallel jobs where you want all worker processes die once the parent exits. Otherwise some 10 already started processes would continue to dump your 10 largest tables for the next few hours with the master process long dead... All while you're about to start up the next master process... In my patch I dealt with exactly the same problem for the error handler by creating a separate function that has a static variable (a pointer to the ParallelState). The value is set and retrieved through the same function, so yes, it's kinda global but then again it can only be accessed from this function, which is only called from the error handler. I'm starting to think it might make sense to press on with this refactoring just a bit further and eliminate the distinction between Archive and ArchiveHandle. How about doing more refactoring after applying the patch, you'd then see what is really needed and then we'd also have an actual use case for more than one connection (You might have already guessed that this proposal is heavily influenced by my self-interest of avoiding too much work to make my patch match your refactoring)... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Fri, Feb 3, 2012 at 7:52 AM, Robert Haas robertmh...@gmail.com wrote: If you're OK with that much I'll go do it. Sure, go ahead! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Wed, Feb 1, 2012 at 12:24 PM, Robert Haas robertmh...@gmail.com wrote: I think we're more-or-less proposing to rename Archive to Connection, aren't we? And then ArchiveHandle can store all the things that aren't related to a specific connection. How about something like that: (Hopefully you'll come up with better names...) StateHandle { Connection Schema Archive Methods union { DumpOptions RestoreOptions } } Dumping would mean to do: Connection - Schema - Archive using DumpOptions through the specified Methods Restore: Archive - Schema - Connection using RestoreOptions through the specified Methods Dumping from one database and restoring into another one would be two StateHandles with different Connections, Archive == NULL, Schema pointing to the same Schema, Methods most likely also pointing to the same function pointer table and each with different Options in the union of course. Granted, you could say that above I've only grouped the elements of the ArchiveHandle, but I don't really see that breaking it up into several objects makes it any better or easier. There could be some accessor functions to hide the details of the whole object to the different consumers. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Tue, Jan 31, 2012 at 9:05 AM, Robert Haas robertmh...@gmail.com wrote: And just for added fun and excitement, they all have inconsistent naming conventions and inadequate documentation. I think if we need more refactoring in order to support multiple database connections, we should go do that refactoring. The current situation is not serving anyone well. I guess I'd find it cleaner to have just one connection per Archive (or ArchiveHandle). If you need two connections, why not just have two Archive objects, as they would have different characteristics anyway, one for dumping data, one to restore. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Mon, Jan 30, 2012 at 12:20 PM, Robert Haas robertmh...@gmail.com wrote: But the immediate problem is that pg_dump.c is heavily reliant on global variables, which isn't going to fly if we want this code to use threads on Windows (or anywhere else). It's also bad style. Technically, since most of pg_dump.c dumps the catalog and since this isn't done in parallel but only in the master process, most functions need not be changed for the parallel restore. Only those that are called from the worker threads need to be changed, this has been done in e.g. dumpBlobs(), the change that you quoted upthread. But it seems possible that we might someday want to dump from one database and restore into another database at the same time, so maybe we ought to play it safe and use different variables for those things. Actually I've tried that but in the end concluded that it's best to have at most one database connection in an ArchiveHandle if you don't want to do a lot more refactoring. Besides the normal connection parameters like host, port, ... there's also std_strings, encoding, savedPassword, currUser/currSchema, lo_buf, remoteVersion ... that wouldn't be obvious where they belonged to. Speaking about refactoring, I'm happy to also throw in the idea to make the dump and restore more symmetrical than they are now. I kinda disliked RestoreOptions* being a member of the ArchiveHandle without having something similar for the dump. Ideally I'd say there should be a DumpOptions and a RestoreOptions field (with a struct Connection being part of them containing all the different connection parameters). They could be a union if you wanted to allow only one connection, or not if you want more than one. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Fri, Jan 27, 2012 at 4:57 PM, Robert Haas robertmh...@gmail.com wrote: But even if you do know that subclassing is intended, that doesn't prove that the particular Archive object is always going to be an ArchiveHandle under the hood. If it is, why not just pass it as an ArchiveHandle to begin with? I know that you took back some of your comments, but I'm with you here. Archive is allocated as an ArchiveHandle and then casted back to Archive*, so you always know that an Archive is an ArchiveHandle. I'm all for getting rid of Archive and just using ArchiveHandle throughout pg_dump which would get rid of these useless casts. I admit that I might have made it a bit worse by adding a few more of these casts but the fundamental issue was already there and there is precedence for casting between them in both directions :-) Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sending notifications from the master to the standby
On Tue, Jan 10, 2012 at 12:00 AM, Tom Lane t...@sss.pgh.pa.us wrote: So this design is non-optimal both for existing uses and for the proposed new uses, which means nobody will like it. You could ameliorate #1 by adding a GUC that determines whether NOTIFY actually writes WAL, but that's pretty ugly. In any case ISTM that problem #2 means this design is basically broken. I chose to do it this way because it seemed like the most natural way to do it (which of course doesn't mean it's the best) :-). I agree that there should be a way to avoid the replication of the NOTIFYs. Regarding your second point though, remember that on the master we write notifications to the queue in pre-commit. And we also don't interleave notifications of different transactions. So once the commit record makes it to the standby, all the notifications are already there, just as on the master. In a burst of notifications, both solutions should more or less behave the same way but yes, the one involving the WAL file would be slower as it goes to the file system and back. I wonder whether it'd be practical to not involve WAL per se in this at all, but to transmit NOTIFY messages by having walsender processes follow the notify stream (as though they were listeners) and send the notify traffic as a separate message stream interleaved with the WAL traffic. Agreed, having walsender/receiver work as NOTIFY proxies is kinda smart... Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sending notifications from the master to the standby
On Tue, Jan 10, 2012 at 11:55 AM, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: [ Tom sketches a design ] Seems a bit overcomplicated. I was just thinking of having walreceiver note the WAL endpoint at the instant of receipt of a notify message, and not release the notify message to the slave ring buffer until WAL replay has advanced that far. How about this: We mark a notify message specially if it is the last message sent by a transaction and also add a flag to commit/abort-records, indicating whether or not the transaction has sent notifys. Now if such a last message is being put into the regular ring buffer on the standby and the xid is known to have committed or aborted, signal the backends. Also signal from a commit/abort-record if the flag is set. If the notify messages make it to the standby first, we just put messages of a not-yet-committed transaction into the queue, just as on the master. Listeners will get signaled when the commit record arrives. If the commit record arrives first, we signal, but the listeners won't find anything (at least not the latest notifications). When the last notify of that transaction finally arrives, the transaction is known to have committed and the listeners will get signaled. What could still happen is that the standby receives notifys, the commit message and more notifys. Listeners would still eventually get all the messages but potentially not all of them at once. Is this a problem? If so, then we could add a special stop reading-record into the queue before we write the notifys, that we subsequently change into a continue reading-record once all notifications are in the queue. Readers would treat a stop reading record just like a not-yet-committed transaction and ignore a continue reading record. Suggest we add something to initial handshake from standby to say please send me notify traffic, +1 on that. From what you said I imagined this walsender listener as a regular listener that listens on the union of all sets of channels that anybody is listening on on the standby, with the LISTEN transaction on the standby return from commit once the listener is known to have been set up on the master. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_restore --no-post-data and --post-data-only
On Tue, Nov 15, 2011 at 6:14 PM, Andrew Dunstan and...@dunslane.net wrote: Updated version with pg_restore included is attached. The patch applies with some fuzz by now but compiles without errors or warnings. The feature just works, it is not adding a lot of new code, basically it parses the given options and then skips over steps depending on the selected section. I verified the equivalence of -a and -s to the respective sections in the different archive formats and no surprise here either, they were equivalent except for the header (which has a timestamp). If you ask pg_restore to restore a section out of an archive which doesn't have this section, there is no error and the command just succeeds. This is what I expected and I think it's the right thing to do but maybe others think that there should be a warning. In pg_restore, pre-data cannot be run in parallel, it would only run serially, data and post-data can run in parallel, though. This is also what I had expected but it might be worth to add a note about this to the documentation. What I didn't like about the implementation was the two set_section() functions, I'd prefer them to move to a file that is shared between pg_dump and pg_restore and become one function... Minor issues: {section, required_argument, NULL, 5} in pg_dump.c is not in the alphabetical order of the options. ./pg_restore --section=foobar pg_restore: unknown section name foobar) Note the trailing ')', it's coming from a _(...) confusion Some of the lines in the patch have trailing spaces and in the documentation part tabs and spaces are mixed. int skip used as bool skip in dumpDumpableObject() Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] synchronized snapshots
Hi Marko, On Wed, Sep 28, 2011 at 2:29 AM, Marko Tiikkaja marko.tiikk...@2ndquadrant.com wrote: In a sequence such as this: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; INSERT INTO foo VALUES (-1); SELECT pg_export_snapshot(); the row added to foo is not visible in the exported snapshot. If that's the desired behaviour, I think it should be mentioned in the documentation. Yes, that's the desired behaviour, the patch add this paragraph to the documentation already: Also note that even after the synchronization both clients still run their own independent transactions. As a consequence, even though synchronized with respect to reading pre-existing data, both transactions won't be able to see each other's uncommitted data. I'll take a look at the other issues and update the patch either tonight or tomorrow. Thank you, Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] synchronized snapshots
On Mon, Aug 15, 2011 at 3:47 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 15.08.2011 04:31, Joachim Wieland wrote: The one thing that it does not implement is leaving the transaction in an aborted state if the BEGIN TRANSACTION command failed for an invalid snapshot identifier. So what if the snapshot is invalid, the SNAPSHOT clause silently ignored? That sounds really bad. No, the command would fail, but since it fails, it doesn't change the transaction state. What was proposed originally was to start a transaction but throw an error that leaves the transaction in the aborted state. But then the command had some effect because it started a transaction block, even though it failed. I can certainly see that this would be useful but I am not sure if it justifies introducing this inconsistency. We would have a BEGIN TRANSACTION command that left the session in a different state depending on why it failed... I don't understand what inconsistency you're talking about. What else can cause BEGIN TRANSACTION to fail? Is there currently any failure mode that doesn't leave the transaction in aborted state? Granted, it might only fail for parse errors so far, but that would include for example sending BEGIN DEFERRABLE to a pre-9.1 server. It wouldn't start a transaction and leave it in an aborted state, but it would just fail. I am wondering if pg_export_snapshot() is still the right name, since the snapshot is no longer exported to the user. It is exported to a file but that's an implementation detail. It's still exporting the snapshot to other sessions, that name still seems appropriate to me. ok. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] synchronized snapshots
On Mon, Aug 15, 2011 at 6:41 AM, Florian Weimer fwei...@bfk.de wrote: * Simon Riggs: I don't see the need to change the BEGIN command, which is SQL Standard. We don't normally do that. Some language bindings treat BEGIN specially, so it might be difficult to use this feature. It's true, the command might require explicit support from language bindings. However I used some perl test scripts, where you can also send a START TRANSACTION command in an $dbh-do(...). The intended use case of this feature is still pg_dump btw... Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] synchronized snapshots
On Mon, Aug 15, 2011 at 6:09 PM, Jim Nasby j...@nasby.net wrote: I suspect that all the other cases of BEGIN failing would be syntax errors, so you would immediately know in testing that something was wrong. A missing file is definitely not a syntax error, so we can't really depend on user testing to ensure this is handled correctly. IMO, that makes it critical that that error puts us in an aborted transaction. Why can we not just require the user to verify if his BEGIN query failed or succeeded? Is that really too much to ask for? Also see what Robert wrote about proxies in between that keep track of the transaction state. Consider they see a BEGIN query that fails. How would they know if the session is now in an aborted transaction or not in a transaction at all? Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Snapshot synchronization, again...
On Mon, Feb 28, 2011 at 6:38 PM, Robert Haas robertmh...@gmail.com wrote: Remember that it's not only about saving shared memory, it's also about making sure that the snapshot reflects a state of the database that has actually existed at some point in the past. Furthermore, we can easily invalidate a snapshot that we have published earlier by deleting its checksum in shared memory as soon as the original transaction commits/aborts. And for these two a checksum seems to be a good fit. Saving memory then comes as a benefit and makes all those happy who don't want to argue about how many slots to reserve in shared memory or don't want to have another GUC for what will probably be a low-usage feature. But you can do all of this with files too, can't you? Just remove or truncate the file when the snapshot is no longer valid. Sure we can, but it looked like the consensus of the first discussion was that the through-the-client approach was more flexible. But then again nobody is actively arguing for that anymore. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Snapshot synchronization, again...
On Sun, Feb 27, 2011 at 3:04 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Why exactly, Heikki do you think the hash is more troublesome? It just feels wrong to rely on cryptography just to save some shared memory. Remember that it's not only about saving shared memory, it's also about making sure that the snapshot reflects a state of the database that has actually existed at some point in the past. Furthermore, we can easily invalidate a snapshot that we have published earlier by deleting its checksum in shared memory as soon as the original transaction commits/aborts. And for these two a checksum seems to be a good fit. Saving memory then comes as a benefit and makes all those happy who don't want to argue about how many slots to reserve in shared memory or don't want to have another GUC for what will probably be a low-usage feature. I realize that there are conflicting opinions on this, but from user point-of-view the hash is just a variant of the idea of passing the snapshot through shared memory, just implemented in an indirect way. The user will never see the hash, why should he bother? The user point of view is that he receives data and can obtain the same snapshot if he passed that data back. This user experience is no different from any other way of passing the snapshot through the client. And from the previous discussions this seemed to be what most people wanted. And how could we validate/invalidate snapshots without the checksum (assuming the through-the-client approach instead of storing the whole snapshot in shared memory)? Either you accept anything that passes sanity checks, or you store the whole snapshot in shared memory (or a temp file). I'm not sure which is better, but they both seem better than the hash. True, both might work but I don't see a real technical advantage over the checksum approach for any of them, rather the opposite. Nobody has come up with a use case for the accept-anything option so far, so I don't see why we should commit ourselves to this feature at this point, given that we have a cheap and easy way of validating/invalidating snapshots. And I might be just paranoid but I also fear that someone could raise security issues for the fact that you would be able to request an arbitrary database state from the past and inspect changes of other peoples' transactions. We might want to allow that later though and I realize that we have to allow it for a standby server that would take over a snapshot from the master anyway, but I don't want to add this complexity into this first patch. I want however be able to possibly allow this in the future without touching the external API of the feature. And for the tempfile approach, I don't see that the creation and removal of the temp file is any less code complexity than flipping a number in shared memory. Also it seemed that people rather wanted to go with the through-the-client approach because it seems to be more flexible. Maybe you should just look at it as a conservative accept-anything approach that uses a checksum to allow only snapshots that we know have existed and have been published. Does the checksum still look so weird from this perspective? Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Snapshot synchronization, again...
On Tue, Feb 22, 2011 at 3:34 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 22.02.2011 16:29, Robert Haas wrote: On Tue, Feb 22, 2011 at 8:58 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: No, the hash is stored in shared memory. The hash of the garbage has to match. Oh. Well that's really silly. At that point you might as well just store the snapshot and an integer identifier in shared memory, right? Yes, that's the point I was trying to make. I believe the idea of a hash was that it takes less memory than storing the whole snapshot (and more importantly, a fixed amount of memory per snapshot). But I'm not convinced either that dealing with a hash is any less troublesome. Both Tom and Robert voted quite explicitly against the store-in-shared-memory idea. Others don't want to allow people request arbitrary snapshots and again others wanted to pass the snapshot through the client so that in the future we could also request snapshots from standby servers. The hash idea seemed to at least make nobody unhappy. For easier review, here are a few links to the previous discusion: http://archives.postgresql.org/pgsql-hackers/2010-12/msg00361.php http://archives.postgresql.org/pgsql-hackers/2010-12/msg00383.php http://archives.postgresql.org/pgsql-hackers/2010-12/msg00481.php http://archives.postgresql.org/pgsql-hackers/2010-12/msg02454.php Why exactly, Heikki do you think the hash is more troublesome? And how could we validate/invalidate snapshots without the checksum (assuming the through-the-client approach instead of storing the whole snapshot in shared memory)? Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Snapshot synchronization, again...
Hi, On Mon, Feb 21, 2011 at 4:56 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: What's the reason for this restriction? if (databaseId != MyDatabaseId) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg(cannot import snapshot from a different database))); I just couldn't think of a use case for it and so didn't want to introduce a feature that we might have to support in the future then. Why are we using bytea as the output format instead of text? The output is just plain text anyway, as can be seen by setting bytea_output to 'escape'. Perhaps there would be a value to using bytea if we were to pglz_compress the result, but would there be a point in doing so? Normally exported info should be short enough that it would cause more overhead than it would save anyway. It is bytea because it should be thought of just some data. It should be regarded more as a token than as text and should not be inspected/interpreted at all. If anybody decides to do so, fine, but then they should not rely on the stability of the message format for the future. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Snapshot synchronization, again...
On Sat, Feb 19, 2011 at 9:17 PM, Peter Eisentraut pete...@gmx.net wrote: The only consideration against MD5 might be that it would make us look quite lame. We should probably provide builtin SHA1 and SHA2 functions for this and other reasons. In this particular case however the checksum is never exposed to the user but stays within shared memory. So nobody would notice that we are still using lame old md5sum :-) Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Snapshot synchronization, again...
On Fri, Feb 18, 2011 at 8:57 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: 1. why are you using the expansible char array stuff instead of using the StringInfo facility? 2. is md5 the most appropriate digest for this? If you need a cryptographically secure hash, do we need something stronger? If not, why not just use hash_any? We don't need a cryptographically secure hash. There is no special reason for why it is like it is, I just didn't think of the better alternatives that you are proposing. Should I send an updated patch? Anything else? Thanks for the review, Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump directory archive format / parallel pg_dump
On Tue, Feb 8, 2011 at 8:31 PM, Itagaki Takahiro itagaki.takah...@gmail.com wrote: On Tue, Feb 8, 2011 at 13:34, Robert Haas robertmh...@gmail.com wrote: So how close are we to having a committable version of this? Should we push this out to 9.2? I think so. The feature is pretty attractive, but more works are required: * Re-base on synchronized snapshots patch * Consider to use pipe also on Windows. * Research libpq + fork() issue. We have a warning in docs: http://developer.postgresql.org/pgdocs/postgres/libpq-connect.html | On Unix, forking a process with open libpq connections can lead to unpredictable results Just for the records, once the sync snapshot patch is committed, there is no need to do fancy libpq + fork() combinations anyway. Unfortunately, so far no committer has commented on the synchronized snapshot patch at all. I am not fighting for getting parallel pg_dump done in 9.1, as I don't really have a personal use case for the patch. However it would be the irony of the year if we shipped 9.1 with a synchronized snapshot patch but no parallel dump :-) Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump directory archive format / parallel pg_dump
Hi Jaime, thanks for your review! On Sun, Feb 6, 2011 at 2:12 PM, Jaime Casanova ja...@2ndquadrant.com wrote: code review: something i found, and is a very simple one, is this warning (there's a similar issue in _StartMasterParallel with the buf variable) pg_backup_directory.c: In function ‘_EndMasterParallel’: pg_backup_directory.c:856: warning: ‘status’ may be used uninitialized in this function Cool. My compiler didn't tell me about this. i guess the huge amount of info is showing the patch is just for debugging and will be removed before commit, right? That's right. functional review: it works good most of the time, just a few points: - if i interrupt the process the connections stay, i guess it could catch the signal and finish the connections Hm, well, recovering gracefully out of errors could be improved. In your example you would signal the children implicitly because the parent process dies and the pipes to the children would get broken as well. Of course the parent could more actively terminate the children but it might not be the best option to just kill them, as then there will be a lot of unexpected EOF connections in the log. So if an error condition comes up in the parent (as in your example, because you canceled the process), then ideally the parent should signal the children with a non-lethal signal and the children should catch this please terminate signal and exit cleanly but as soon as possible. If the error case comes up at the child however, then we'd need to make sure that the user sees the error message from the child. This should work well as-is but currently it could happen that the parent exists before all of the children have exited. I'll investigate this a bit... - if i have an exclusive lock on a table and a worker starts dumping it, it fails because it can't take the lock but it just say it was ok and would prefer an error I'm getting a clear pg_dump: [Archivierer] could not lock table public.c: ERROR: could not obtain lock on relation c but I'll look into this as well. Regarding your other post: - there is no docs True... - pg_dump and pg_restore are inconsistent: pg_dump requires the directory to be provided with the -f option: pg_dump -Fd -f dir_dump pg_restore pass the directory as an argument for -Fd: pg_restore -Fd dir_dump Well, this is there with pg_dump and pg_restore currently as well. -F is the switch for the format and it just takes d as the format. The dir_dump is an option without any switch. See the output for the --help switches: Usage: pg_dump [OPTION]... [DBNAME] Usage: pg_restore [OPTION]... [FILE] So in either case you don't need to give a switch for what you have. If you run pg_dump you don't give the switch for the database but you need to give it for the output (-f) and with pg_restore you don't give a switch for the file that you're restoring but you'd need to give -d for restoring to a database. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump directory archive format / parallel pg_dump
On Thu, Feb 3, 2011 at 11:46 PM, Itagaki Takahiro itagaki.takah...@gmail.com wrote: I think we have 2 important technical issues here: * The consistency is not perfect. Each transaction is started with small delays in step 1, but we cannot guarantee no other transaction between them. This is exactly where the patch for synchronized snapshot comes into the game. See https://commitfest.postgresql.org/action/patch_view?id=480 * Can we inherit connections to child processes with fork() ? Moreover, we also need to pass running transactions to children. I wonder libpq is designed for such usage. As far as I know you can inherit sockets to a child program, as long as you make sure that after the fork only one, father or child, uses the socket, the other one should close it. But this wouldn't be a matter with the above mentioned patch anyway. It might be better to remove Windows-specific codes from the first try. I doubt Windows message queue is the best API in such console-based application. I hope we could use the same implementation for all platforms for inter-process/thread communication. Windows doesn't support pipes, but offers the message queues to exchange messages. Parallel pg_dump only exchanges messages in the form of DUMP 39209 or RESTORE OK 48 23 93, it doesn't exchange any large chunks of binary data, just these small textual messages. The messages also stay within the same process, they are just sent between the different threads. The windows part worked just fine when I tested it last time. Do you have any other technology in mind that you think is better suited? Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump directory archive format / parallel pg_dump
On Sun, Jan 30, 2011 at 5:26 PM, Robert Haas robertmh...@gmail.com wrote: The parallel pg_dump portion of this patch (i.e. the still-uncommitted part) no longer applies. Please rebase. Here is a rebased version with some minor changes as well. I haven't tested it on Windows now but will do so as soon as the Unix part has been reviewed. Joachim parallel_pg_dump.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Snapshot synchronization, again...
Here is a new version of the patch incorporating most of Noah's suggestions. The patch now also adds documentation. Since I couldn't really find a suitable section to document the two new functions, I added a new one for now. Any better ideas where it should go? On Thu, Jan 20, 2011 at 1:37 AM, Noah Misch n...@leadboat.com wrote: Just to clarify, I was envisioning something like: typedef struct { bool valid; char value[16]; } snapshotChksum; I don't object to the way you've done it, but someone else might not like the extra marshalling between text and binary. Your call. I didn't like it in the beginning but later on I adopted your proposal. I have also changed the locking to be more natural. Even though we don't really need it, I am now grabbing shared ProcArrayLock for any reads of shared memory and exclusive lock for writes. Of course no additional lock is taken if the feature is not used. You're right. Then consider VALUES (pg_import_snapshot('...'), (SELECT count(*) from t)) at READ COMMITTED. It works roughly as I'd guess; the subquery uses the imported snapshot. If I flip the two expressions and do VALUES ((SELECT count(*) from t), pg_import_snapshot('...')), the subquery uses the normal snapshot. That makes sense, but I can't really see a use case for it. A warning, like your code has today, seems appropriate. Yeah, that would do what you wanted to illustrate but it truely cannot be considered the standard use case :-) Is it valid to scribble directly on snapshots like this? I figured that previously executed code still has references pointing to the snapshots and so we cannot just push a new snapshot on top but really need to change the memory where they are pointing to. Okay. Had no special reason to believe otherwise, just didn't know. This is one part where I'd like someone more experienced than me look into it. Thanks; that was handy. One thing I noticed is that the second SELECT * FROM kidseen yields zero rows instead of yielding ERROR: relation kidseen does not exist. I changed things around per the attached test-drop.pl, such that the table is already gone in the ordinary snapshot, but still visible to the imported snapshot. Note how the pg_class row is visible, but an actual attempt to query the table fails. Must some kind of syscache invalidation follow the snapshot alteration to make this behave normally? As discussed with Noah off-list this is just not an MVCC safe operation. You could hit on this in a regular SERIALIZABLE transaction as well: Somebody else can delete a table and you won't be able to access it anymore. This is also the reason why pg_dump in the beginning gets a shared lock on every table that it will dump later on. Thanks for the review Noah, Joachim diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index ed2039c..4169594 100644 *** a/doc/src/sgml/func.sgml --- b/doc/src/sgml/func.sgml *** FOR EACH ROW EXECUTE PROCEDURE suppress_ *** 14761,14764 --- 14761,14860 xref linkend=SQL-CREATETRIGGER. /para /sect1 + + sect1 id=functions-snapshotsync +titleSnapshot Synchronization Functions/title + +indexterm + primarypg_export_snapshot/primary +/indexterm +indexterm + primarypg_import_snapshot/primary +/indexterm + +para + productnamePostgreSQL/ allows different sessions to synchronize their + snapshots. A database snapshot determines which data is visible to + the client that is using this snapshot. Synchronized snapshots are necessary if + two clients need to see the same content in the database. If these two clients + just connected to the database and opened their transactions, then they could + never be sure that there was no data modification right between both + connections. To solve this, productnamePostgreSQL/ offers the two functions + functionpg_export_snapshot/ and functionpg_import_snapshot/. +/para +para + The idea is that one client retrieves the information about the snapshot that it + is currently using and then the second client passes this information back to + the server. The database server will then modify the second client's snapshot + to match the snapshot of the first and from then on both can see the identical + data in the database. +/para +para + Note that a snapshot can only be imported as long as the transaction that + exported it originally is held open. Also note that even after the + synchronization both clients still run their own independent transactions. + As a consequence, even though synchronized with respect to reading pre-existing + data, both transactions won't be able to see each other's uncommitted data. +/para +table id=functions-snapshot-synchronization + titleSnapshot Synchronization Functions/title + tgroup cols=3 + thead + rowentryName/entry entryReturn
Re: [HACKERS] pg_dump directory archive format / parallel pg_dump
On Thu, Jan 20, 2011 at 6:07 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: It's part of the overall idea to make sure files are not inadvertently exchanged between different backups and that a file is not truncated. In the future I'd also like to add a checksum to the TOC so that a backup can be checked for integrity. This will cost performance but with the parallel backup it can be distributed to several processors. Ok. I'm going to leave out the filesize. I can see some value in that, and the CRC, but I don't want to add stuff that's not used at this point. Okay. The header is there to identify a file, it contains the header that every other pgdump file contains, including the internal version number and the unique backup id. The tar format doesn't support compression so going from one to the other would only work for an uncompressed archive and special care must be taken to get the order of the tar file right. Hmm, tar format doesn't support compression, but looks like the file format issue has been thought of already: there's still code there to add .gz suffix for compressed files. How about adopting that convention in the directory format too? That would make an uncompressed directory format compatible with the tar format. So what you could do is dump in the tar format, untar and restore in the directory format. I see that this sounds nice but still I am not sure why someone would dump to the tar format in the first place. But you still cannot go back from the directory archive to the tar archive because the standard command line tar will not respect the order of the objects that pg_restore expects in a tar format, right? That seems pretty attractive anyway, because you can then dump to a directory, and manually gzip the data files later. The command line gzip will probably add its own header to the file that pg_restore would need to strip off... This is a valid use case for people who are concerned with a fast dump, usually they would dump uncompressed and later compress the archive. However once we have parallel pg_dump, this advantage vanishes. Now that we have an API for compression in compress_io.c, it probably wouldn't be very hard to implement the missing compression support to tar format either. True, but the question to the advantage of the tar format remains :-) A tar archive has the advantage that you can postprocess the dump data with other tools but for this we could also add an option that gives you only the data part of a dump file (and uncompresses it at the same time if compressed). Once we have that however, the question is what anybody would then still want to use the tar format for... I don't know how popular it'll be in practice, but it seems very nice to me if you can do things like parallel pg_dump in directory format first, and then tar it up to a file for archival. Yes, but you cannot pg_restore the archive then if it was created with standard tar, right? Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump directory archive format / parallel pg_dump
On Wed, Jan 19, 2011 at 7:47 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Here are the latest patches all of them also rebased to current HEAD. Will update the commitfest app as well. What's the idea of storing the file sizes in the toc file? It looks like it's not used for anything. It's part of the overall idea to make sure files are not inadvertently exchanged between different backups and that a file is not truncated. In the future I'd also like to add a checksum to the TOC so that a backup can be checked for integrity. This will cost performance but with the parallel backup it can be distributed to several processors. It would be nice to have this format match the tar format. At the moment, there's a couple of cosmetic differences: * TOC file is called TOC, instead of toc.dat * blobs TOC file is called BLOBS.TOC instead of blobs.toc * each blob is stored as blobs/oid.dat, instead of blob_oid.dat That can be done easily... The only significant difference is that in the directory archive format, each data file has a header in the beginning. What are the benefits of the data file header? Would it be better to leave it out, so that the format would be identical to the tar format? You could then just tar up the directory to get a tar archive, or vice versa. The header is there to identify a file, it contains the header that every other pgdump file contains, including the internal version number and the unique backup id. The tar format doesn't support compression so going from one to the other would only work for an uncompressed archive and special care must be taken to get the order of the tar file right. If you want to drop the header altogether, fine with me but if it's just for the tar - directory conversion, then I am failing to see what the use case of that would be. A tar archive has the advantage that you can postprocess the dump data with other tools but for this we could also add an option that gives you only the data part of a dump file (and uncompresses it at the same time if compressed). Once we have that however, the question is what anybody would then still want to use the tar format for... Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Snapshot synchronization, again...
Noah, thank you for this excellent review. I have taken over most (allmost all) of your suggestions (except for the documentation which is still missing). Please check the updated attached patch for details. On Fri, Jan 14, 2011 at 10:13 PM, Noah Misch n...@leadboat.com wrote: is a valid md5 message digest. To avoid always accepting a snapshot with that digest, we would need to separately store a flag indicating whether a given syncSnapshotChksums member is currently valid. Maybe that hole is acceptable, though. In the end I decided to store md5 checksums as printable characters in shmem. We now need a few more bytes for each checksum but as soon as a byte is NUL we know that it is not a valid checksum. + + if (!XactReadOnly) + elog(WARNING, A snapshot exporting function should be readonly.); There are legitimate use cases for copying the snapshot of a read-write transaction. Consider a task like calculating some summaries for insert into a table. To speed this up, you might notionally partition the source data and have each of several workers calculate each partition. I'd vote for removing this warning entirely. Warning removed, adding this fact to the documentation only is fine with me. InvalidBackendId is -1. Is InvalidOid (0) in fact also invalid as a backend ID? Yes, there is a check in utils/init/postinit.c that makes sure that 0 is never a valid backendId. But anyway, I have now made this more explicit by adding an own parse routine for ints. + while ((xid = parseXactFromText(s, xip:)) != InvalidTransactionId) + { + if (xid == GetTopTransactionIdIfAny()) + continue; + snapshot-xip[i++] = xid; This works on a virgin GetSnapshotData() snapshot, which always has enough space (sized according to max_connections). A snapshot from CopySnapshot(), however, has just enough space for the original usage. I get a SIGSEGV at this line with the attached test script. If I change things around so xip starts non-NULL, I get messages like these as the code scribbles past the end of the allocation: This has been fixed. xip/subxip are now allocated separately if necessary. Though unlikely, the other backend may not even exist by now. Indeed, that could have happened and RecentGlobalXmin advanced since we compared the checksum. Not sure what the right locking is here, but something is needed. Good catch. What I have done now is a second check at the end of ImportSnapshot(). At that point we have set up the new snapshot and adjusted MyProc-xmin. If we succeed, then we are fine. If not we throw an error and invalidate the whole transaction. + * Instead we must check to not go forward (we might have opened a cursor + * in this transaction and still have its snapshot registered) + */ I'm thinking this case should yield an ERROR. Otherwise, our net result would be to silently adopt a snapshot that is neither our old self nor the target. Maybe there's a use case for that, but none comes to my mind. This can happen when you do: DECLARE foo CURSOR FOR SELECT * FROM bar; import snapshot... FETCH 1 FROM foo; I guess the use case for pg_import_snapshot from READ COMMITTED would be something like DO $$BEGIN PERFORM pg_import_snapshot('...'); stuff; END$$;. First off, would that work (stuff use the imported snapshot)? If it does work, we should take the precedent of SET LOCAL and issue no warning. If it doesn't work, then we should document that the snapshot does take effect until the next command and probably keep this warning or something similar. No, this will also give you a new snapshot for every query, no matter if it is executed within or outside of a DO-Block. + Datum + pg_import_snapshot(PG_FUNCTION_ARGS) + { + bytea *snapshotData = PG_GETARG_BYTEA_P(0); + bool ret = true; + + if (ActiveSnapshotSet()) + ret = ImportSnapshot(snapshotData, GetActiveSnapshot()); + + ret = ImportSnapshot(snapshotData, GetTransactionSnapshot()); Is it valid to scribble directly on snapshots like this? I figured that previously executed code still has references pointing to the snapshots and so we cannot just push a new snapshot on top but really need to change the memory where they are pointing to. I am also adding a test script that shows the difference of READ COMMITTED and SERIALIZABLE in an importing transaction in a DO-Block. It's based on the script you sent. So thanks again and I'm looking forward to your next review... :-) Joachim diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index 2dbac56..c96942a 100644 *** a/src/backend/storage/ipc/ipci.c --- b/src/backend/storage/ipc/ipci.c *** CreateSharedMemoryAndSemaphores(bool mak *** 124,129 --- 124,130 size = add_size(size, BTreeShmemSize()); size = add_size(size, SyncScanShmemSize());
Re: [HACKERS] Snapshot synchronization, again...
On Thu, Dec 30, 2010 at 7:31 AM, Joachim Wieland j...@mcknight.de wrote: What I am proposing now is the following: We return snapshot information as a chunk of data to the client. At the same time however, we set a checksum in shared memory to protect against modification of the snapshot. A publishing backend can revoke its snapshot by deleting the checksum and a backend that is asked to install a snapshot can verify that the snapshot is correct and current by calculating the checksum and comparing it with the one in shared memory. So here's the patch implementing this idea. I named the user visible functions pg_export_snapshot() and pg_import_snapshot(). A user starts a transaction and calls pg_export_snapshot() to get a chunk of bytea data. In a different connection he opens a transaction in isolation level serializable and passes that chunk of data into pg_import_snapshot(). Now subsequent queries of the second connection see the snapshot that was current when pg_export_snapshot() has been executed. In case both transactions are in isolation level serializable then both see the same data from this moment on (this is the case of pg_dump). There are most probably a few loose ends and someone who is more familiar to this area than me needs to look into it but at least everybody should be happy now with the overall approach. These are the implementation details and restrictions of the patch: The exporting transaction - should be read-only (to discourage people from expecting that writes of the exporting transaction can be seen by the importing transaction) - must not be a subtransaction (we don't add subxips of our own transaction to the snapshot, so importing the snapshot later would result in missing subxips) - adds its own xid (if any) to the xip-array The importing transaction - will not import a snapshot of the same backend (even though it would probably work) - will not import a snapshot of a different database in the cluster - should be isolation level serializable - must not be a subtransaction (can we completely rollback on subxact-rollback?) - leaves curcid as is, otherwise effects of previous commands would get lost and magically appear later when curcid increases - applies xmin, xmax, xip, subxip values of the exported snapshot to GetActiveSnapshot() and GetTransactionSnapshot() - takes itself out of the xip array - updates MyProc-xmin but sets it only backwards but not forwards The snapshot is invalidated on transaction commit/rollback as well as when doing prepare transaction. Comments welcome. Joachim diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index 95beba8..c24150f 100644 *** a/src/backend/storage/ipc/ipci.c --- b/src/backend/storage/ipc/ipci.c *** CreateSharedMemoryAndSemaphores(bool mak *** 124,129 --- 124,130 size = add_size(size, BTreeShmemSize()); size = add_size(size, SyncScanShmemSize()); size = add_size(size, AsyncShmemSize()); + size = add_size(size, SyncSnapshotShmemSize()); #ifdef EXEC_BACKEND size = add_size(size, ShmemBackendArraySize()); #endif *** CreateSharedMemoryAndSemaphores(bool mak *** 228,233 --- 229,235 BTreeShmemInit(); SyncScanShmemInit(); AsyncShmemInit(); + SyncSnapshotInit(); #ifdef EXEC_BACKEND diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 980996e..e851bcd 100644 *** a/src/backend/storage/ipc/procarray.c --- b/src/backend/storage/ipc/procarray.c *** *** 50,60 --- 50,62 #include access/transam.h #include access/xact.h #include access/twophase.h + #include libpq/md5.h #include miscadmin.h #include storage/procarray.h #include storage/spin.h #include storage/standby.h #include utils/builtins.h + #include utils/bytea.h #include utils/snapmgr.h *** static int KnownAssignedXidsGetAndSetXmi *** 159,164 --- 161,170 static TransactionId KnownAssignedXidsGetOldestXmin(void); static void KnownAssignedXidsDisplay(int trace_level); + typedef char snapshotChksum[16]; + static snapshotChksum *syncSnapshotChksums; + static Snapshot exportedSnapshot; + /* * Report shared-memory space needed by CreateSharedProcArray. */ *** KnownAssignedXidsDisplay(int trace_level *** 3065,3067 --- 3071,3335 pfree(buf.data); } + + + /* + * Report space needed for our shared memory area, which is basically an + * md5 checksum per connection. + */ + Size + SyncSnapshotShmemSize(void) + { + return PROCARRAY_MAXPROCS * sizeof(snapshotChksum); + } + + void + SyncSnapshotInit(void) + { + Size size; + bool found; + int i; + + size = SyncSnapshotShmemSize(); + + syncSnapshotChksums = (snapshotChksum*) ShmemInitStruct(SyncSnapshotChksums, + size, found); + if (!found) + for (i = 0; i
Re: [HACKERS] Snapshot synchronization, again...
On Fri, Dec 31, 2010 at 8:28 AM, Alvaro Herrera alvhe...@commandprompt.com wrote: A backend can have any number of snapshots registered, and those don't allow GlobalXmin to advance. Consider an open cursor, for example. Even if the rest of the transaction is read committed, the snapshot registered by the open cursor still holds back GlobalXmin. My (handwavy) idea is that whenever the transaction calls pg_publish_snapshot(), said snapshot is registered, which makes it safe to use even if the transaction continues to operate and obtain newer snapshots. Cool, even better that this is taken care of already :-) Thanks, Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Snapshot synchronization, again...
The snapshot synchronization discussion from the parallel pg_dump patch somehow ended without a clear way to go forward. Let me sum up what has been brought up and propose a short- and longterm solution. Summary: Passing snapshot sync information can be done either: a) by returning complete snapshot information from the backend to the client so that the client can pass it along to a different backend b) or by returning only a unique identifier to the client and storing the actual snapshot data somewhere on the server side Advantage of a: no memory is used in the backend and no memory needs to get cleaned up, it is also theoretically possible that we could forward that data to a hot standby server and do e.g. a dump partially on the master server and partially on the hot standby server or among several hot standby servers. Disadvantage of a: The snapshot must be validated to make sure that its information is still current, it might be difficult to cover all cases of this validation. A client can not only access exactly a published snapshot, but just about any snapshot that fits and passes the validation checks (this is more a disadvantage than an advantage because it allows to see a database state that never existed in reality). Advantage of b: No validation necessary, as soon as the transaction that publishes the snapshot loses that snapshot, it will also revoke the snapshot information (either by removing a temp file or deleting it from shared memory) Disadvantage of b: It doesn't allow a snapshot to be installed on a different server. It requires a serializable open transaction to hold the snapshot. What I am proposing now is the following: We return snapshot information as a chunk of data to the client. At the same time however, we set a checksum in shared memory to protect against modification of the snapshot. A publishing backend can revoke its snapshot by deleting the checksum and a backend that is asked to install a snapshot can verify that the snapshot is correct and current by calculating the checksum and comparing it with the one in shared memory. This only costs us a few bytes for the checksum * max_connection in shared memory and apart from resetting the checksum it does not have cleanup and verification issues. Note that we are also free to change the internal format of the chunk of data we return whenever we like, so we are free to enhance this feature in the future, transparently to the client. Thoughts? Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Snapshot synchronization, again...
On Thu, Dec 30, 2010 at 9:40 AM, Alvaro Herrera alvhe...@commandprompt.com wrote: Disadvantage of b: It doesn't allow a snapshot to be installed on a different server. It requires a serializable open transaction to hold the snapshot. Why does it require a serializable transaction? You could simply register the snapshot in any transaction. (Of course, the net effect would be pretty similar to a serializable transaction). I am not assuming that the publishing transaction blocks until its snapshot is being picked up. A read committed transaction would get a new snapshot for every other query, so the published snapshot is no longer represented by an actual backend until it is being picked up by one. Since nobody is holding off xmin/GlobalXmin, eventually vacuum would remove tuples that the published-but-not-yet-picked-up snapshot should still be able to see, no? Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Snapshot synchronization, again...
On Thu, Dec 30, 2010 at 9:49 AM, Florian Pflug f...@phlo.org wrote: On Dec30, 2010, at 13:31 , Joachim Wieland wrote: We return snapshot information as a chunk of data to the client. At the same time however, we set a checksum in shared memory to protect against modification of the snapshot. A publishing backend can revoke its snapshot by deleting the checksum and a backend that is asked to install a snapshot can verify that the snapshot is correct and current by calculating the checksum and comparing it with the one in shared memory. We'd still have to stream these checksums to the standbys though, or would they be exempt from the checksum checks? I am not talking about having synchronized snapshots among standby servers at all. I am only proposing a client API that will work for this future idea as well. I still wonder whether these checks are worth the complexity. I believe we'd only allow snapshot modifications for read-only queries anyway, so what point is there in preventing clients from setting broken snapshots? What's the use case for it? As soon as nobody comes up with a reasonable use case for it, let's aim for the robust version. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] page compression
On Tue, Dec 28, 2010 at 10:10 AM, Andy Colson a...@squeakycode.net wrote: I know its been discussed before, and one big problem is license and patent problems. Would this project be a problem: http://oldhome.schmorp.de/marc/liblzf.html It looks like even liblzf is not going to be accepted. I have proposed to only link against liblzf if available for pg_dump and have somehow failed, see: http://archives.postgresql.org/pgsql-hackers/2010-11/msg00824.php Remember that PostgreSQL has toast tables to compress large values and store them externally, so it still has to be proven that page compression has the same benefit for PostgreSQL as for other databases. Ironically we also use an LZ compression algorithm for toast compression (defined in pg_lzcompress.c). I am still failing to understand why linking against liblzf would bring us deeper into the compression patents mine field than we already are by hardwiring and shipping this other algorithm in pg_lzcompress.c. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] directory archive format for pg_dump
On Thu, Dec 16, 2010 at 12:48 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: As soon as we have parallel pg_dump, the next big thing is going to be parallel dump of the same table using multiple processes. Perhaps we should prepare for that in the directory archive format, by allowing the data of a single table to be split into multiple files. That way parallel pg_dump is simple, you just split the table in chunks of roughly the same size, say 10GB each, and launch a process for each chunk, writing to a separate file. How exactly would you just split the table in chunks of roughly the same size ? Which queries should pg_dump send to the backend? If it just sends a bunch of WHERE queries, the server would still scan the same data several times since each pg_dump client would result in a seqscan over the full table. Ideally pg_dump should be able to query for all data in only one relation segment so that each segment is scanned by only one backend process. However this requires backend support and we would be sending queries that we'd not want clients other than pg_dump to send... If you were thinking about WHERE queries to get equally sized partitions, how would we deal with unindexed and/or non-numerical data in a large table? Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: too many KnownAssignedXids
On Tue, Dec 7, 2010 at 3:42 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Ok, I've committed this patch now. I can confirm that I could continue replaying the logfiles on the standby host with this patch. Thanks a lot, Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] directory archive format for pg_dump
On Thu, Dec 2, 2010 at 2:52 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Ok, committed, with some small cleanup since the last patch I posted. Could you update the directory-format patch on top of the committed version, please? Thanks for committing the first part. Here is the updated and rebased directory-format patch. Joachim pg_dump-directory-rebased.diff.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parallel pg_dump
On Sun, Dec 5, 2010 at 9:27 PM, Robert Haas robertmh...@gmail.com wrote: On Sun, Dec 5, 2010 at 9:04 PM, Andrew Dunstan and...@dunslane.net wrote: Why not just say give me the snapshot currently held by process ? And please, not temp files if possible. As far as I'm aware, the full snapshot doesn't normally exist in shared memory, hence the need for publication of some sort. We could dedicate a shared memory region for publication but then you have to decide how many slots to allocate, and any number you pick will be too many for some people and not enough for others, not to mention that shared memory is a fairly precious resource. So here is a patch that I have been playing with in the past, I have done it a while back and thanks go to Koichi Suzuki for his helpful comments. I have not published it earlier because I haven't worked on it recently and from the discussion that I brought up in march I got the feeling that people are fine with having a first version of parallel dump without synchronized snapshots. I am not really sure that what the patch does is sufficient nor if it does it in the right way but I hope that it can serve as a basis to collect ideas (and doubt). My idea is pretty much similar to Robert's about publishing snapshots and subscribing to them, the patch even uses these words. Basically the idea is that a transaction in isolation level serializable can publish a snapshot and as long as this transaction is alive, its snapshot can be adopted by other transactions. Requiring the publishing transaction to be serializable guarantees that the copy of the snapshot in shared memory is always current. When the transaction ends, the copy of the snapshot is also invalidated and cannot be adopted anymore. So instead of doing explicit checks, the patch aims at always having a reference transaction around that guarantees validity of the snapshot information in shared memory. The patch currently creates a new area in shared memory to store snapshot information but we can certainly discuss this... I had a GUC in mind that can control the number of available slots, similar to max_prepared_transactions. Snapshot information can become quite large, especially with a high number of max_connections. Known limitations: the patch is lacking awareness of prepared transactions completely and doesn't check if both backends belong to the same user. Joachim diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index 95beba8..c24150f 100644 *** a/src/backend/storage/ipc/ipci.c --- b/src/backend/storage/ipc/ipci.c *** CreateSharedMemoryAndSemaphores(bool mak *** 124,129 --- 124,130 size = add_size(size, BTreeShmemSize()); size = add_size(size, SyncScanShmemSize()); size = add_size(size, AsyncShmemSize()); + size = add_size(size, SyncSnapshotShmemSize()); #ifdef EXEC_BACKEND size = add_size(size, ShmemBackendArraySize()); #endif *** CreateSharedMemoryAndSemaphores(bool mak *** 228,233 --- 229,235 BTreeShmemInit(); SyncScanShmemInit(); AsyncShmemInit(); + SyncSnapshotInit(); #ifdef EXEC_BACKEND diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 6e7a6db..00522fb 100644 *** a/src/backend/storage/ipc/procarray.c --- b/src/backend/storage/ipc/procarray.c *** typedef struct ProcArrayStruct *** 91,96 --- 91,111 static ProcArrayStruct *procArray; + + /* this should be a GUC later... */ + #define MAX_SYNC_SNAPSHOT_SETS 4 + typedef struct + { + SnapshotData ssd; + char name[NAMEDATALEN]; + BackendId backendId; + OiddatabaseId; + } NamedSnapshotData; + + typedef NamedSnapshotData* NamedSnapshot; + + static NamedSnapshot syncSnapshots; + /* * Bookkeeping for tracking emulated transactions in recovery */ *** static int KnownAssignedXidsGetAndSetXmi *** 159,164 --- 174,182 static TransactionId KnownAssignedXidsGetOldestXmin(void); static void KnownAssignedXidsDisplay(int trace_level); + static bool DeleteSyncSnapshot(const char *name); + static bool snapshotPublished = false; /* true if we have published at least one snapshot */ + /* * Report shared-memory space needed by CreateSharedProcArray. */ *** ProcArrayRemove(PGPROC *proc, Transactio *** 350,355 --- 368,379 void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid) { + if (snapshotPublished) + { + DeleteSyncSnapshot(NULL); + snapshotPublished = false; + } + if (TransactionIdIsValid(latestXid)) { /* *** KnownAssignedXidsDisplay(int trace_level *** 3104,3106 --- 3132,3374 pfree(buf.data); } + + + /* + * Report space needed for our shared memory area. + * + * Memory is structured as follows: + * + * NamedSnapshotData[0] + * NamedSnapshotData[1] + * NamedSnapshotData[2] + * Xids for NamedSnapshotData[0] + *
Re: [HACKERS] WIP patch for parallel pg_dump
On Thu, Dec 2, 2010 at 6:19 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I don't see the point of the sort-by-relpages code. The order the objects are dumped should be irrelevant, as long as you obey the restrictions dictated by dependencies. Or is it only needed for the multiple-target-dirs feature? Frankly I don't see the point of that, so it would be good to cull it out at least in this first stage. A guy called Dimitri Fontaine actually proposed the serveral-directories feature here and other people liked the idea. http://archives.postgresql.org/pgsql-hackers/2008-02/msg01061.php :-) The code doesn't change much with or without it, and if people are no longer in favour of it, I have no problem with taking it out. As Dimitri has already pointed out, the relpage sorting thing is there to start with the largest table(s) first. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parallel pg_dump
On Thu, Dec 2, 2010 at 12:56 PM, Josh Berkus j...@agliodbs.com wrote: Now, if only I could think of some way to write a parallel dump to a set of pipes, I'd be in heaven. What exactly are you trying to accomplish with the pipes? Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parallel pg_dump
On Thu, Dec 2, 2010 at 9:33 PM, Tom Lane t...@sss.pgh.pa.us wrote: In particular, this issue *has* been discussed before, and there was a consensus that preserving dump consistency was a requirement. I don't think that Joachim gets to bypass that decision just by submitting a patch that ignores it. I am not trying to bypass anything here :) Regarding the locking issue I probably haven't done sufficient research, at least I managed to miss the emails that mentioned it. Anyway, that seems to be solved now fortunately, I'm going to implement your idea over the weekend. Regarding snapshot cloning and dump consistency, I brought this up already several months ago and asked if the feature is considered useful even without snapshot cloning. And actually it was you who motivated me to work on it even without having snapshot consistency... http://archives.postgresql.org/pgsql-hackers/2010-03/msg01181.php In my patch pg_dump emits a warning when called with -j, if you feel better with an extra option --i-know-that-i-have-no-synchronized-snapshots, fine with me :-) In the end we provide a tool with limitations, it might not serve all use cases but there are use cases that would benefit a lot. I personally think this is better than to provide no tool at all... Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] directory archive format for pg_dump
On Wed, Dec 1, 2010 at 9:05 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Forgot attachment. This is also available in the above git repo. I have quickly checked your modifications, on the one hand I like the reduction of functions, I would have said that we have AH around all the time and so we could just allocate once and stuff it all into ctx-cs and reuse the buffers for every object, but re-allocating them for every (dumpable) object should be fine as well. Regarding the function pointers that you removed, you are now putting back in what Dimitri wanted me to take out, namely switch/case instructions for the algorithms and then #ifdefs for every algorithm. It's not too many now since we have taken out LZF. Well, I can live with both ways. There is one thing however that I am not in favor of, which is the removal of the sizeHint parameter for the read functions. The reason for this parameter is not very clear now without LZF but I have tried to put in a few comments to explain the situation (which you have taken out as well :-) ). The point is that zlib is a stream based compression algorithm, you just stuff data in and from time to time you get data out and in the end you explicitly flush the compressor. The read function can just return as many bytes as it wants and we can just hand it all over to zlib. Other compression algorithms however are block based and first write a block header that contains the information on the next data block, including uncompressed and compressed sizes. Now with the sizeHint parameter I used, the compressor could tell the read function that it just wants to read the fixed size header (6 bytes IIRC). In the header it would look up the compressed size for the next block and would then ask the read function to get exactly this amount of data, decompress it and go on with the next block, and so forth... Of course you can possibly do that memory management inside the compressor with an extra buffer holding what you got in excess but it's a pain. If you removed that part on purpose on the grounds that there is no block based compression algorithm in core and probably never will be, then that's okay :-) Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Suggested easy TODO: pg_dump --from-list
On Wed, Nov 24, 2010 at 1:15 AM, Tom Lane t...@sss.pgh.pa.us wrote: Nope ... those strings are just helpful comments, they aren't really guaranteed to be unique identifiers. In any case, it seems unlikely that a user could expect to get the more complicated cases exactly right other than by consulting pg_dump | pg_restore -l output. Which makes the use-case kind of dubious to me. In which case would the catalogId, i.e. (tableoid, oid) not be unique? Or do you rather mean that it does not necessarily refer to the same object if that object got somehow recreated or that it could be different on different installations of the same database? Thanks, Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Suggested easy TODO: pg_dump --from-list
On Wed, Nov 24, 2010 at 9:38 AM, Andrew Dunstan and...@dunslane.net wrote: It would be unique, but a pain in the neck for users to get. Robert's idea will have more traction with users. Whatever approach we use, we need to think about the use case where 1% of the objects should be dumped but should also make sure that you can more or less easily dump 99% of the objects. Roberts use case is the 1% use case. Especially for the 99% case however, pg_dump needs to create a full list of all available objects in whatever format as a proposal so that you could just save edit this list and then delete what you don't want instead of writing such a list from scratch. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Suggested easy TODO: pg_dump --from-list
On Tue, Nov 23, 2010 at 10:24 PM, Andrew Dunstan and...@dunslane.net wrote: Well, very little about pg_dump is very [E], IMNSHO. The question in my mind here is what format the list file will take. For example, how would we specify a function? Would we need to specify all the argument types (or at least the IN arguments)? It's not as easy as a list with pg_restore, which is just a list of TOC ids, and all the rest is just a comment in the list file. I certainly don't think we should put this on the list without at least having the idea fleshed out some more. I think the list should be generated by pg_dump itself in a first run, by building a complete TOC and then dumping a pg_restore -l like list format (without dumpIds) where the user just deletes the objects that he doesn't want to get dumped. The list wouldn't contain dumpIds, but catalogIds and those should be sufficiently unique and easy to parse and compare. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: too many KnownAssignedXids
On Tue, Nov 23, 2010 at 8:45 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 19.11.2010 23:46, Joachim Wieland wrote: FATAL: too many KnownAssignedXids. head: 0, tail: 0, nxids: 9978, pArray-maxKnownAssignedXids: 6890 Hmm, that's a lot of entries in KnownAssignedXids. Can you recompile with WAL_DEBUG, and run the recovery again with wal_debug=on ? That will print all the replayed WAL records, which is a lot of data, but it might give a hint what's going on. Sure, but this gives me only one more line: [...] LOG: redo starts at 1F8/FC00E978 LOG: REDO @ 1F8/FC00E978; LSN 1F8/FC00EE90: prev 1F8/FC00E930; xid 385669; len 21; bkpb1: Heap - insert: rel 1663/16384/18373; tid 3829898/23 FATAL: too many KnownAssignedXids CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23 LOG: startup process (PID 4587) exited with exit code 1 LOG: terminating any other active server processes Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: too many KnownAssignedXids
On Sun, Nov 21, 2010 at 11:48 PM, Fujii Masao masao.fu...@gmail.com wrote: -- If you suspect a bug in Hot Standby, please set trace_recovery_messages = DEBUG2 in postgresql.conf and repeat the action Always useful to know * max_connections * current number of sessions * whether we have two phase commits happening -- The trace_recovery_messages parameter does not give more output... max_connections is set to 100 there have been no sessions on the standby itself, but a few on the primary database, I don't know how much but probably not more than 10. The sessions there were doing quite some load however, among them slony synchronization, since the hot standby master database was actually a slony replica. max_prepared_transactions has not been changed from its default value of 0. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Hot Standby: too many KnownAssignedXids
Hi, I am seeing the following here on 9.0.1 on Linux x86-64: LOG: redo starts at 1F8/FC00E978 FATAL: too many KnownAssignedXids CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23 and this is the complete history: postgres was running as HS in foreground, Ctrl-C'ed it for a restart. LOG: received fast shutdown request LOG: aborting any active transactions FATAL: terminating walreceiver process due to administrator command FATAL: terminating connection due to administrator command LOG: shutting down LOG: database system is shut down Started it up again: $ postgres -D /db/ LOG: database system was shut down in recovery at 2010-11-19 14:36:30 EST LOG: entering standby mode cp: cannot stat `/archive/000101F90001': No such file or directory cp: cannot stat `/archive/000101F800FC': No such file or directory LOG: redo starts at 1F8/FC00E978 FATAL: too many KnownAssignedXids CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23 LOG: startup process (PID 30052) exited with exit code 1 LOG: terminating any other active server processes (copied the log files over...) ./postgres -D /db/ LOG: database system was interrupted while in recovery at log time 2010-11-19 14:36:12 EST HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. LOG: entering standby mode LOG: restored log file 000101F90001 from archive LOG: restored log file 000101F800FC from archive LOG: redo starts at 1F8/FC00E978 FATAL: too many KnownAssignedXids CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23 LOG: startup process (PID 31581) exited with exit code 1 LOG: terminating any other active server processes Changing the line in the source code to give some more output gives me: FATAL: too many KnownAssignedXids. head: 0, tail: 0, nxids: 9978, pArray-maxKnownAssignedXids: 6890 I still have the server, if you want me to debug anything or send a patch against 9.0.1 that gives more output, just let me know. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] directory archive format for pg_dump
Hi Dimitri, thanks for reviewing my patch! On Fri, Nov 19, 2010 at 2:44 PM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote: I think I'd like to see a separate patch for the new compression support. Sorry about that, I realize that's extra work… I guess it wouldn't be a very big deal but I also doubt that it makes the review that much easier. Basically the compression refactor patch would just touch pg_backup_custom.c (because this is the place where the libz compression is currently burried into) and the two new compress_io.(c|h) files. Everything else is pretty much the directory stuff and is on top of these changes. And it could be about personal preferences, but the way you added the liblzf support strikes me at odd, with all those #ifdefs everywhere. Is it possible to have a specific file for each supported compression format, then some routing code in src/bin/pg_dump/compress_io.c? Sure we could. But I wanted to wait with any fancy function pointer stuff until we have decided if we want to include the liblzf support at all. The #ifdefs might be a bit ugly but in case we do not include liblzf support, it's the easiest way to take it out again. As written in my introduction, this patch is not really about liblzf, liblzf is just a proof of concept for factoring out the compression part and I have included it, so that people can use it and see how much speed improvement they get. The routing code already exists but then the file is full of #ifdef sections to define the right supporting function when I think having a compress_io_zlib and a compress_io_lzf files would be better. Sure! I completely agree... Then there's the bulk of the new dump format feature in the other part of the patch, namely src/bin/pg_dump/pg_backup_directory.c. You have to update the copyright in the file header there, at least :) Well, not sure if we can just change the copyright notice, because in the end the structure was copied from one of the other files which all have the copyright notice in them, so my work is based on those other files... I'm hesitant as far as marking the patch Waiting on author to get it split. Joachim, what do you think? I will see if I can split it. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] directory archive format for pg_dump
On Fri, Nov 19, 2010 at 11:53 PM, Tom Lane t...@sss.pgh.pa.us wrote: Dimitri Fontaine dimi...@2ndquadrant.fr writes: I think I'd like to see a separate patch for the new compression support. Sorry about that, I realize that's extra work… That part of the patch is likely to get rejected outright anyway, so I *strongly* recommend splitting it out. We have generally resisted adding random compression algorithms to pg_dump because of license and patent considerations, and I see no reason to suppose this one is going to pass muster. I was already anticipating that possiblitiy and my inital patch description is along these lines. However, liblzf is BSD licensed so on the license side we should be fine. Regarding patents, your last comment was that you'd like to see if it's really worth it and so I have included support for lzf for anybody to go ahead and find that out. Will send an updated split up patch this weekend (which would actually be four patches already...). Joachim
Re: [HACKERS] directory archive format for pg_dump
Hi Jose, 2010/11/19 José Arthur Benetasso Villanova jose.art...@gmail.com: The dir format generated in my database 60 files, with different sizes, and it looks very confusing. Is it possible to use the same trick as pigz and pbzip2, creating a concatenated file of streams? What pigz is parallelizing is the actual computation of the compressed data. The directory archive format however is a preparation for a parallel pg_dump, dumping several tables (especially large tables of course) in parallel via multiple database connections and multiple pg_dump frontends. The idea of multiplexing their output into one file has been rejected on the grounds that it would probably slow down the whole process. Nevertheless pigz could be implemented as an alternative compression algorithm and that way the custom and the directory archive format could use it, but here as well, license and patent questions might be in the way, even though it is based on libz. The md5.c and kwlookup.c reuse using a link doesn't look nice either. This way you need to compile twice, among others things, but I think that its temporary, right? No, it isn't. md5.c is used in the same way by e.g. libpq and there are other examples for links in core, check out src/bin/psql for example. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] host name support in pg_hba.conf
On Tue, Oct 5, 2010 at 3:41 PM, Peter Eisentraut pete...@gmx.net wrote: The reason why I think this is semi-important and not just cosmetic is that (for some reason that is not entirely clear to me) clients connecting to localhost end up appearing to the server as ::1 on a lot of machines I use which are not otherwise keen on IPv6, and it is a common mistake to forget to keep the pg_hba.conf entries for 127.0.0.1 and ::1 in sync. This is exactly what I am seeing here. However contrary to your case the patch makes it even worse on my side. With the patch compiled in and a pg_hba.conf entry of localhost, I cannot connect anymore to -h localhost, I get no pg_hba.conf entry for host ::1. This is mostly standard Ubuntu setup. Joachim
Re: [HACKERS] Patch for PKST timezone
On Wed, May 12, 2010 at 12:39 AM, Tom Lane t...@sss.pgh.pa.us wrote: Joachim Wieland j...@mcknight.de writes: Good we have found that inconsistency now, so let's add PKST. OK, done. BTW, I notice that PKT was labeled (not in zic), which is not the case, per this discussion. I seem to recall having noticed some others that seemed to be mislabeled the same way. What process did you use to compare this list to the zic files, and do we need to revisit it? I have used a modified version of zic.c that outputs the data while generating the binary timezone files. Generating the timezone offset files from that then included some scripts and some manual work. It seems that we should have an automated process for that, at least for checking against our current set. I'll see if I can come up with that. PKST for example was valid only for a single year in the past but in the newer timezone data it is now valid forever. Ideally we can run a script that tells us about such changes whenever we bring in new timezone data. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for PKST timezone
On Tue, May 11, 2010 at 10:45 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I don't think we want to include all timezone names in the default config, timezone abbreviations aren't always unique for example. But we should include PKST because we already include PKT; it would be nasty for an application to work during winter, and stop working when the daylight saving time begins. I agree with everything Heikki says. As the original author of the timezone files I guess that I am to blame for not having included PKST in the first place. However I have the excuse that it was like that already back when the timezone information was still hardcoded, we had pkt but not pkst: http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/adt/datetime.c?rev=1.168;content-type=text%2Fx-cvsweb-markup Good we have found that inconsistency now, so let's add PKST. I also agree with not adding more than necessary to the default config set. Given that so few people complain about it we seem to have a good working default set already. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] a faster compression algorithm for pg_dump
On Fri, Apr 9, 2010 at 5:51 AM, Greg Stark gsst...@mit.edu wrote: Linking against as an option isn't nearly as bad since the user compiling it can choose whether to include the restricted feature or not. That's what we do with readline. However it's not nearly as attractive when it restricts what file formats Postgres supports -- it means someone might generate backup dump files that they later discover they don't have a legal right to read and restore :( If we only linked against it, we'd leave it up to the user to weigh the risk as long as we are not aware of any such violation. Our top priority is to make sure that the project would not be harmed if one day such a patent showed up. If I understood you correctly, this is not an issue, even if we included lzf and less again if we only link against it. The rest is about user education and using lzf only in pg_dump and not for toasting, we could show a message in pg_dump if lzf is chosen to make the user aware of the possible issues. If we still cannot do this, then what I am asking is: What does the project need to be able to at least link against such a compression algorithm? Is it a list of 10, 20, 50 or more other projects using it or is it a lawyer saying: There is no patent.? But then, how can we be sure that the lawyer is right? Or couldn't we include it even if we had both, because again, we couldn't be sure... ? Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] a faster compression algorithm for pg_dump
I'd like to revive the discussion about offering another compression algorithm than zlib to at least pg_dump. There has been a previous discussion here: http://archives.postgresql.org/pgsql-performance/2009-08/msg00053.php and it ended without any real result. The results so far were: - There exist BSD-licensed compression algorithms - Nobody knows a patent that is in our way - Nobody can confirm that no patent is in our way I do see a very real demand for replacing zlib which compresses quite well but is slow as hell. For pg_dump what people want is cheap compression, they usually prefer an algorithm that compresses less optimal but that is really fast. One question that I do not yet see answered is, do we risk violating a patent even if we just link against a compression library, for example liblzf, without shipping the actual code? I have checked what other projects do, especially about liblzf which would be my favorite choice (BSD license, available since quite some time...) and there are other projects that actually ship the lzf code (I haven't found a project that just links to it). The most prominent projects are - KOffice (implements a derived version in koffice-2.1.2/libs/store/KoXmlReader.cpp) - Virtual Box (ships it in vbox-ose-1.3.8/src/libs/liblzf-1.51) - TuxOnIce (formerly known as suspend2 - linux kernel patch, ships it in the patch) We have pg_lzcompress.c which implements the compression routines for the tuple toaster. Are we sure that we don't violate any patents with this algorithm? Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] message clarifications
On Sat, Apr 3, 2010 at 9:02 PM, Robert Haas robertmh...@gmail.com wrote: On Apr 3, 2010, at 11:13 AM, Tom Lane t...@sss.pgh.pa.us wrote: Peter Eisentraut pete...@gmx.net writes: The following messages from the postgres catalog either appear to be internal errors that should be marked differently, or they are in my estimation unintelligible to users and should be rephrased. #: commands/async.c:1424 msgid pg_notify queue is %.0f%% full This one is probably my responsibility (the others all look like Simon's code). What do you not like about it, exactly? Perhaps it should be NOTIFY queue is x% full? I think maybe the question is why the user should care, or what they are expected to do about it? The user/administrator should make sure that all backends work through the list of pending notifications. He does it by making sure that there are no long-running or idle-in-transaction backends. Actually there is more information given via errdetail and errhint: ereport(WARNING, (errmsg(pg_notify queue is %.0f%% full, fillDegree * 100), (minPid != InvalidPid ? errdetail(PID %d is among the slowest backends., minPid) : 0), (minPid != InvalidPid ? errhint(Cleanup can only proceed if this backend ends its current transaction.) : 0))); Peter, if you consider the additional information given here, do you still see an issue? Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] five-key syscaches
On Mon, Mar 29, 2010 at 12:32 AM, Robert Haas robertmh...@gmail.com wrote: Per previous discussion, PFA a patch to change the maximum number of keys for a syscache from 4 to 5. http://archives.postgresql.org/pgsql-hackers/2010-02/msg01105.php This is intended for application to 9.1, and is supporting infrastructure for knngist. It looks like there should be a 5 rather than a 4 for nkeys of SearchSysCacheList(). +#define SearchSysCacheList5(cacheId, key1, key2, key3, key4, key5) \ + SearchSysCacheList(cacheId, 4, key1, key2, key3, key4, key5) Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Parallel pg_dump for 9.1
People have been talking about a parallel version of pg_dump a few times already. I have been working on some proof-of-concept code for this feature every now and then and I am planning to contribute this for 9.1. There are two main issues with a parallel version of pg_dump: The first one is that it requires a consistent snapshot among multiple pg_dump clients and the second is that currently the output goes to a single file and it is unclear what to do about multiple processes writing into a single file. - There are ideas on how to solve the issue with the consistent snapshot but in the end you can always solve it by stopping your application(s). I actually assume that whenever people are interested in a very fast dump, it is because they are doing some maintenance task (like migrating to a different server) that involves pg_dump. In these cases, they would stop their system anyway. Even if we had consistent snapshots in a future version, would we forbid people to run parallel dumps against old server versions? What I suggest is to just display a big warning if run against a server without consistent snapshot support (which currently is every version). - Regarding the output of pg_dump I am proposing two solutions. The first one is to introduce a new archive type directory where each table and each blob is a file in a directory, similar to the experimental files archive type. Also the idea has come up that you should be able to specify multiple directories in order to make use of several physical disk drives. Thinking this further, in order to manage all the mess that you can create with this, every file of the same backup needs to have a unique identifier and pg_restore should have a check parameter that tells you if your backup directory is in a sane and complete state (think about moving a file from one backup directory to another one or trying to restore from two directories which are from different backup sets...). The second solution to the single-file-problem is to generate no output at all, i.e. whatever you export from your source database you import directly into your target database, which in the end turns out to be a parallel form of pg_dump | psql. In fact, technically this is rather a parallel pg_restore than a pg_dump as you need to respect the dependencies between objects. The good news is that with the parallel pg_restore of the custom archive format we have everything in place already for this dependency checking. The addition is a new archive type that dumps (just-in-time) whatever the dependency-algorithm decides to restore next. This is probably the fastest way that we can copy or upgrade a database when pg_migrator cannot be used (for example when you migrate to a different hardware architecture). As said, I have some working code for the features described (unix only), if anybody would like to give it a try already now, just let me know, I'd be happy to get some early test reports and you could check for the speedup to expect. But before I continue, I'd like to have a discussion about what is what people actually want and what is the best way to go forward here. I am currently not planning to make parallel dumps work with the custom format even though this would be possible if we changed the format to a certain degree. Comments? Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.0 release notes done
On Sat, Mar 20, 2010 at 5:02 AM, Bruce Momjian br...@momjian.us wrote: Interestingly the 9.0 release notes contain 201 items, while the 8.4 release notes contained 314 items. Is the following pg_dump change covered by the release notes? I couldn't find it. It was the last committed patch from the 2010-01 commitfest... http://archives.postgresql.org/pgsql-committers/2010-02/msg00233.php https://commitfest.postgresql.org/action/patch_view?id=247 Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Hot Standby query cancellation and Streaming Replication integration
On Sun, Feb 28, 2010 at 2:54 PM, Greg Stark gsst...@mit.edu wrote: Really? I think we get lots of suprised wows from the field from the idea that a long-running read-only query can cause your database to bloat. I think the only reason that's obvious to us is that we've been grappling with that problem for so long. It seems to me that the scenario that you are looking at is one where people run different queries with and without HS, i.e. that they will run longer read-only queries than now once they have HS. I don't think that is the case. If it isn't you cannot really speak of a master bloat. Instead, I assume that most people who will grab 9.0 and use HS+SR do already have a database with a certain query profile. Now with HS+SR they will try to put the most costly and longest read-only queries to the standby but in the end will run the same number of queries with the same overall complexity. Now let's take a look at both scenarios from the administrators' point of view: 1) With the current implementation they will see better performance on the master and more aggressive vacuum (!), since they have less long-running queries now on the master and autovacuum can kick in and clean up with less delay than before. On the other hand their queries on the standby might fail and they will start thinking that this HS+SR feature is not as convincing as they thought it was... Next step for them is to take the documentation and study it for a few days to learn all about vacuum, different delays, transaction ids and age parameters and experiment a few weeks until no more queries fail - for a while... But they can never be sure... In the end they might also modify the parameters in the wrong direction or overshoot because of lack of time to experiment and lose another important property without noticing (like being as close as possible to the master). 2) On the other hand if we could ship 9.0 with the xmin-propagation feature, people would still see a better performance and have a hot standby system but this time without query cancellations. Again: the read-only queries that will be processed by the HS in the future are being processed by the master today anyway, so why should it get worse? The first impression will be that it just works nicely out of the box, is easy to set up and has no negative effect (query cancellation) that has not already shown up before (vacuum lag). I guess that most people will just run fine with this setup and never get to know about the internals. Of course we should still offer an expert mode where you can turn all kinds of knobs and where you can avoid the vacuum dependency but it would be nice if this could be the expert mode only. Tuning this is highly installation specific and you need to have a deep understanding of how PostgreSQL and HS work internally and what you actually want to achieve... Agreed. Though I think it'll be bad in that case even if we have a plan B. It'll mean no file-based log shipping replicas and no guarantee that what you run on the standby can't affect the master -- which is a pretty nice guarantee. It'll also mean it'll be much more fragile against network interruptions. Regarding the network interruptions... in reality if you have network interruptions of several minutes between your primary and your standby, you have worse problems anyway... If the standby does not renew its xmin for n seconds, log a message and just go on... Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Hot Standby query cancellation and Streaming Replication integration
On Sun, Feb 28, 2010 at 8:47 PM, Josh Berkus j...@agliodbs.com wrote: 1) Automated retry of cancelled queries on the slave. I have no idea how hard this would be to implement, but it makes the difference between writing lots of exception-handling code for slave connections (unacceptable) to just slow response times on the slave (acceptable). We're not only canceling queries, we are effectively canceling transactions. It seems quite impossible to repeat all queries from a transaction that has started in the past. One query might be or include the result of a previous query and as the data we see now has changed since then, the client might now want to execute a different query when it gets a different result out of a previous query... And even if it was possible, how often would you retry? You still have no guarantee that your query succeeds the second time. I'd claim that if a query failed once, chances are even higher that it fails again than that it succeeds the second time. Moreover if you continue to repeat the query and if queries come in at a certain rate, you need to process more and more queries on the slave which will not really help other queries to finish in time nor will it be beneficial for the throughput of the system as a whole... I fully agree with what you say about user expectations: We need to assume that many programs are not prepared for failures of simple read-only queries because in the past they have always worked... Another thing to keep in mind in these discussions is the inexpensiveness of servers today. This means that, if slaves have poor performance, that's OK; one can always spin up more slaves. But if each slave imposes a large burden on the master, then that limits your scalability. The burden of the xmin-publication feature is not the number of slaves, it's just the longest running queries on whatever slave they are. So your argument applies to both cases... To minimize the burden on the master, get additional slaves so that you can run your most expensive queries in a shorter time :-) Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Listen/Notify payload and interfaces
This one is for the maintainers of the various postgresql interfaces: With the new listen/notify implementation we can send a payload along with the notification. This has been in the protocol already since years and there is no change needed for libpq. However we need to adapt the various interfaces to allow a payload to be sent and received. A notification is sent via SQL: http://developer.postgresql.org/pgdocs/postgres/sql-notify.html and received via the libpq call PQnotifies: http://developer.postgresql.org/pgdocs/postgres/libpq-notify.html Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Listen / Notify - what to do when the queue is full
On Tue, Feb 16, 2010 at 11:41 PM, Tom Lane t...@sss.pgh.pa.us wrote: Joachim Wieland j...@mcknight.de writes: [ listen/notify patch ] I found a number of implementation problems having to do with wraparound behavior and error recovery. I think they're all fixed, but any remaining bugs are probably my fault not yours. First, thanks for the rework you have done and thanks for applying this. While I can see a lot of improvements over my version, I think the logic in asyncQueueProcessPageEntries() needs to be reordered: + static bool + asyncQueueProcessPageEntries(QueuePosition *current, +QueuePosition stop, +char *page_buffer) [...] + do + { [...] + /* +* Advance *current over this message, possibly to the next page. +* As noted in the comments for asyncQueueReadAllNotifications, we +* must do this before possibly failing while processing the message. +*/ + reachedEndOfPage = asyncQueueAdvance(current, qe-length); [...] + if (TransactionIdDidCommit(qe-xid)) [...] + else if (TransactionIdDidAbort(qe-xid)) [...] + else + { + /* +* The transaction has neither committed nor aborted so far, +* so we can't process its message yet. Break out of the loop. +*/ + reachedStop = true; + break; In the beginning you are advancing *current but later on you could find out that the transaction is still running. As the position in the queue has already advanced you would miss one notification here because you'd restart directly behind this notification in the queue... Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LISTEN/NOTIFY and notification timing guarantees
On Tue, Feb 16, 2010 at 6:20 AM, Tom Lane t...@sss.pgh.pa.us wrote: Another possibility is to force a ProcessIncomingNotifies scan to occur before we reach ReadyForQuery if we sent any notifies in the just-finished transaction --- but that won't help if there are uncommitted messages in front of ours. What about dealing with self-notifies in memory? i.e. copy them into a subcontext of TopMemoryContext in precommit and commit as usual. Then as a first step in ProcessIncomingNotifies() deliver whatever is in memory and then delete the context. While reading the queue, ignore all self-notifies there. If we abort for some reason, delete the context in AtAbort_Notify(). Would that work? Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LISTEN/NOTIFY and notification timing guarantees
On Tue, Feb 16, 2010 at 1:31 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Tom Lane wrote: We could adopt the historical policy of sending self-notifies pre-commit, but that doesn't seem tremendously appetizing from the standpoint of transactional integrity. But one traditional aspect of transactional integrity is that a transaction always sees *its own* uncommitted work. True but notifications aren't sent until the transaction commits anyway. At the time when an application receives its self-notifies, it has already committed the transaction so there is no uncommitted work anymore. Wouldn't the historical policy of PostgreSQL self-notifies be consistent with that? No. The policy is also to not see the committed work if for some reason the transaction had to roll back during commit. In this case we'd also expect getting no notification from this transaction at all and this is what is violated here. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LISTEN/NOTIFY and notification timing guarantees
On Mon, Feb 15, 2010 at 3:31 AM, Tom Lane t...@sss.pgh.pa.us wrote: I'm not sure how probable it is that applications might be coded in a way that relies on the properties lost according to point #2 or #3. Your observations are all correct as far as I can tell. One question regarding #2: Is a client application able to tell whether or not it has received all notifications from one batch? i.e. does PQnotifies() return NULL only when the backend has sent over the complete batch of notifications or could it also return NULL while a batch is still being transmitted but the client-side buffer just happens to be empty? We could fix #2 by not releasing AsyncQueueLock between pages when queuing messages. This has no obvious downsides as far as I can see; if anything it ought to save some cycles and contention. Currently transactions with a small number of notifications can deliver their notifications and then proceed with their commit while transactions with many notifications need to stay there longer, so the current behavior is fair in this respect. Changing the locking strategy makes the small volume transactions wait for the bigger ones. Also currently readers can already start reading while writers are still writing (until they hit the first uncommitted transaction of their database). I think preserving the property that self-notifies are delivered immediately upon commit might be more important than that. Fine with me, sounds reasonable :-) Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Listen / Notify - what to do when the queue is full
On Sun, Feb 14, 2010 at 11:44 PM, Simon Riggs si...@2ndquadrant.com wrote: Next set of questions * Will this work during Hot Standby now? The barrier was that it wrote to a table and so we could not allow that. ISTM this new version can and should work with Hot Standby. Can you test that and if so, remove the explicit barrier code and change tests and docs to enable it? I have tested it already. The point where it currently fails is the following line: qe-xid = GetCurrentTransactionId(); We record the TransactionId (of the notifying transaction) in the notification in order to later check if this transaction has committed successfully or not. If you tell me how we can find this out in HS, we might be done... The reason why we are doing all this is because we fear that we can not write the notifications to disk once we have committed to clog... So we write them to disk before committing to clog and therefore need to record the TransactionId. * We also discussed the idea of having a NOTIFY command that would work from Primary to Standby. All this would need is some code to WAL log the NOTIFY if not in Hot Standby and for some recovery code to send the NOTIFY to any listeners on the standby. I would suggest that would be an option on NOTIFY to WAL log the notification: e.g. NOTIFY me 'with_payload' FOR STANDBY ALSO; What should happen if you wanted to replay a NOTIFY WAL record in the standby but cannot write to the pg_notify/ directory? * Don't really like pg_listening() as a name. Perhaps pg_listening_to() or pg_listening_on() or pg_listening_for() or pg_listening_channels() or pg_listen_channels() pg_listen_channels() sounds best to me but I leave this decision to a native speaker. * I think it's confusing that pg_notify is both a data structure and a function. Suggest changing one of those to avoid issues in understanding. Use pg_notify might be confused by a DBA. You are talking about the libpq datastructure PGnotify I suppose... I don't see it overly confusing but I wouldn't object changing it. There was a previous discussion about the name, see the last paragraph of http://archives.postgresql.org/message-id/dc7b844e1002021510i4aaa879fy8bbdd003729d2...@mail.gmail.com Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Listen / Notify - what to do when the queue is full
On Mon, Feb 15, 2010 at 1:48 PM, Simon Riggs si...@2ndquadrant.com wrote: On Mon, 2010-02-15 at 12:59 +0100, Joachim Wieland wrote: I have tested it already. The point where it currently fails is the following line: qe-xid = GetCurrentTransactionId(); That's a shame. So it will never work in Hot Standby mode unless you can think of a different way. We could probably fake this on the Hot Standby in the following way: We introduce a commit record for every notifying transaction and write it into the queue itself. So right before writing anything else, we write an entry which informs readers that the following records are not yet committed. Then we write the actual notifications and commit. In post-commit we return back to the commit record and flip its status. Reading backends would stop at the commit record and we'd signal them so that they can continue. This actually plays nicely with Tom's intent to not have interleaved notifications in the queue (makes things a bit easier but would probably work either way)... However we'd need to make sure that we clean up that commit record even if something weird happens (similar to TransactionIdDidAbort() returning true) in order to allow the readers to proceed. Comments? Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parameter name standby_mode
On Fri, Feb 12, 2010 at 8:59 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Agreed. I've changed it now so that if primary_conninfo is not set, it doesn't try to establish a streaming connection. If you want to get the connection information from environment variables, you can use primary_conninfo=''. Why not just remove the default: If no primary_conninfo variable is set explicitly in the configuration file, check the environment variables. If the environment variable is not set, don't try to establish a connection. ? Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers