from:"Joachim Wieland"

[HACKERS] pg_dump bug in 9.4beta2 and HEAD

2014-08-13 Thread Joachim Wieland

I'm seeing an assertion failure with pg_dump -c --if-exists which is
not ready to handle BLOBs it seems:

pg_dump: pg_backup_archiver.c:472: RestoreArchive: Assertion `mark !=
((void *)0)' failed.

To reproduce:

$ createdb test
$ pg_dump -c --if-exists test  (works, dumps empty database)
$ psql test -c select lo_create(1);
$ pg_dump -c --if-exists test  (fails, with the above mentioned assertion)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allowing parallel pg_restore from pipe

2013-04-24 Thread Joachim Wieland

On Wed, Apr 24, 2013 at 4:05 PM, Stefan Kaltenbrunner 
ste...@kaltenbrunner.cc wrote:

  What might make sense is something like pg_dump_restore which would have
  no intermediate storage at all, just pump the data etc from one source
  to another in parallel. But I pity the poor guy who has to write it :-)

 hmm pretty sure that Joachims initial patch for parallel dump actually
 had a PoC for something very similiar to that...


That's right, I implemented that as an own output format and named it
migrator I think, which wouldn't write each stream to a file as the
directory output format does but that instead pumps it back into a restore
client.

Actually I think the logic was even reversed, it was a parallel restore
that got the data from internally calling pg_dump functionality instead of
from reading files... The neat thing about this approach was that the order
was optimized and correct, i.e. largest tables start first and dependencies
get resolved in the right order.

I could revisit that patch for 9.4 if enough people are interested.

Joachim

Re: [HACKERS] Optimizing pglz compressor

2013-03-06 Thread Joachim Wieland

On Tue, Mar 5, 2013 at 8:32 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 With these tweaks, I was able to make pglz-based delta encoding perform
 roughly as well as Amit's patch.

Out of curiosity, do we know how pglz compares with other algorithms, e.g. lz4 ?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Materialized view assertion failure in HEAD

2013-03-05 Thread Joachim Wieland

I'm getting an assertion failure in HEAD with materialized views, see
below for backtrace.

To reproduce, just run make installcheck, dump the regression database
and then restore it, the server crashes during restore.

(gdb) bt
#0  0x7f283a366425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f283a369b8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00888429 in ExceptionalCondition (
conditionName=0xa0c840 !(!matviewRel-rd_rel-relhasoids),
errorType=0xa0c78a FailedAssertion, fileName=0xa0c780 matview.c,
lineNumber=135) at assert.c:54
#3  0x005dbfc4 in ExecRefreshMatView (stmt=0x1b70a28,
queryString=0x1b6ff98 REFRESH MATERIALIZED VIEW tvmm;\n\n\n, params=0x0,
completionTag=0x7fff47af0a60 ) at matview.c:135
#4  0x007758a5 in standard_ProcessUtility (parsetree=0x1b70a28,
queryString=0x1b6ff98 REFRESH MATERIALIZED VIEW tvmm;\n\n\n, params=0x0,
dest=0x1b70da0, completionTag=0x7fff47af0a60 ,
context=PROCESS_UTILITY_TOPLEVEL) at utility.c:1173
#5  0x00773e3f in ProcessUtility (parsetree=0x1b70a28,
queryString=0x1b6ff98 REFRESH MATERIALIZED VIEW tvmm;\n\n\n, params=0x0,
dest=0x1b70da0, completionTag=0x7fff47af0a60 ,
context=PROCESS_UTILITY_TOPLEVEL) at utility.c:341
#6  0x00772d7e in PortalRunUtility (portal=0x1aeb6d8,
utilityStmt=0x1b70a28, isTopLevel=1 '\001', dest=0x1b70da0,
completionTag=0x7fff47af0a60 ) at pquery.c:1185
#7  0x00772f56 in PortalRunMulti (portal=0x1aeb6d8,
isTopLevel=1 '\001', dest=0x1b70da0, altdest=0x1b70da0,
completionTag=0x7fff47af0a60 ) at pquery.c:1317
#8  0x00772481 in PortalRun (portal=0x1aeb6d8,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x1b70da0,
altdest=0x1b70da0, completionTag=0x7fff47af0a60 ) at pquery.c:814
#9  0x0076c155 in exec_simple_query (
query_string=0x1b6ff98 REFRESH MATERIALIZED VIEW tvmm;\n\n\n)
at postgres.c:1048
#10 0x00770517 in PostgresMain (argc=2, argv=0x1ab4bb0,
username=0x1ab4a88 joe) at postgres.c:3969
#11 0x0070ccec in BackendRun (port=0x1ae5ac0) at postmaster.c:3989
#12 0x0070c401 in BackendStartup (port=0x1ae5ac0) at postmaster.c:3673
#13 0x00708ce6 in ServerLoop () at postmaster.c:1575
#14 0x00708420 in PostmasterMain (argc=3, argv=0x1ab2420)
at postmaster.c:1244
#15 0x006704f7 in main (argc=3, argv=0x1ab2420) at main.c:197


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Materialized view assertion failure in HEAD

2013-03-05 Thread Joachim Wieland

On Tue, Mar 5, 2013 at 9:11 AM, Kevin Grittner kgri...@ymail.com wrote:
 Will investigate.
 You don't have default_with_oids = on, do you?

No, this was a default install with a default postgresql.conf


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] posix_fadvise missing in the walsender

2013-02-20 Thread Joachim Wieland

On Wed, Feb 20, 2013 at 4:54 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Feb 19, 2013 at 5:48 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I agree with Merlin and Joachim - if we have the call in one place, we
 should have it in both.

 We might want to assess whether we even want to have it one place.
 I've seen cases where the existing call hurts performance, because of
 WAL file recycling.

That's interesting, I hadn't thought about WAL recycling.

I now agree that this whole thing is even more complicated, you might
have an archive_command set as well, like cp for instance, that
reads in the WAL file again, possibly even right after we called
posix_fadvise on it.

It appears to me that the right strategy depends on a few factors:

a) what ratio of your active dataset fits into RAM?
b) how many WAL files do you have?
c) how long does it take for them to get recycled?
d) archive_command set / wal_senders active?

And recommendations for the two extremes would be:

If your dataset fits mostly into RAM and if you have only few WAL
files that get recycled quickly then you don't want to evict the WAL
file from the buffer cache.
On the other hand if your dataset doesn't fit into RAM and you have
many WAL files that take a while until they get recycled, then you
should consider hinting to the OS.

If you're in that second category (I am) and you're also using the
archive_command you could just piggyback the posix_fadvise call onto
your archive_command, assuming that the walsender is already done with
the file at that moment. And I'm also pretty certain that Robert's
setup that he used for the write scalability tests fell into the first
category.

So given the above, I think it's possible to come up with benchmarks
that prove whatever you want to prove :-)


Joachim


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] posix_fadvise missing in the walsender

2013-02-19 Thread Joachim Wieland

On Tue, Feb 19, 2013 at 5:48 PM, Simon Riggs si...@2ndquadrant.com wrote:
 In access/transam/xlog.c we give the OS buffer caching a hint that we
 won't need a WAL file any time soon with

  posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);


 I agree with Merlin and Joachim - if we have the call in one place, we
 should have it in both.

You could argue that if it's considered beneficial in the case with no
walsender, then you should definitely have it if there are walsenders
around:
The walsenders reopen and read those files which gives the OS reason
to believe that other processes might do the same in the near future
and hence that it should not evict those pages too early.


Joachim


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] posix_fadvise missing in the walsender

2013-02-17 Thread Joachim Wieland

In access/transam/xlog.c we give the OS buffer caching a hint that we
won't need a WAL file any time soon with

posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);

before closing the WAL file, but only if we don't have walsenders.
That's reasonable because the walsender will reopen that same file
shortly after.

However the walsender doesn't call posix_fadvise once it's done with
the file and I'm proposing to add this to walsender.c for consistency
as well.

Since there could be multiple walsenders, only the slowest one
should call this function. Finding out the slowest walsender can be
done by inspecting the shared memory and looking at the sentPtr of
each walsender.

Any comments?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical decoding exported base snapshot

2012-12-11 Thread Joachim Wieland

On Tue, Dec 11, 2012 at 6:52 PM, Andres Freund and...@2ndquadrant.com wrote:
 One problem I see is that while exporting a snapshot solves the
 visibility issues of the table's contents it does not protect against
 schema changes. I am not sure whether thats a problem.

 If somebody runs a CLUSTER or something like that, the table's contents
 will be preserved including MVCC semantics. That's fine.
 The more problematic cases I see are TRUNCATE, DROP and ALTER
 TABLE.

This is why the pg_dump master process executes a

lock table table in access share mode

for every table, so your commands would all block.

In fact it's even more complicated because the pg_dump worker
processes also need to lock the table. They try to get a similar lock
in NOWAIT mode right before dumping the table. If they don't get the
lock that means that somebody else is waiting for an exclusive lock
(this is the case you describe) and the backup will fail.


 Problem 2:

 To be able to do a SET TRANSACTION SNAPSHOT the source transaction
 needs to be alive. That's currently solved by exporting the snapshot in
 the walsender connection that did the INIT_LOGICAL_REPLICATION. The
 question is how long should we preserve that snapshot?

You lost me on this one after the first sentence... But in general the
snapshot isn't so much a magical thing: As long the exporting
transaction is open, it guarantees that there is a transaction out
there that is holding off vacuum from removing data and it's also
guaranteeing that the snapshot as is has existed at some time in the
past.

Once it is applied to another transaction you now have two
transactions that will hold off vacuum because they share the same
xmin,xmax values. You could also just end the exporting transaction at
that point.

One thought at the time was to somehow integrate exported snapshots
with the prepared transactions feature, then you could export a
snapshot, prepare the exporting transaction and have that snapshot
frozen forever and it would even survive a server restart.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallel pg_dump

2012-12-08 Thread Joachim Wieland

On Sat, Dec 8, 2012 at 3:05 PM, Bruce Momjian br...@momjian.us wrote:

 On Sat, Dec  8, 2012 at 11:13:30AM -0500, Andrew Dunstan wrote:
  I am working on it when I get a chance, but keep getting hammered.
  I'd love somebody else to review it too.

 FYI, I will be posting pg_upgrade performance numbers using Unix
 processes.  I will try to get the Windows code working but will also
 need help.


Just let me know if there's anything I can help you guys with.


Joachim

Re: [HACKERS] [PATCH] pg_dump: Sort overloaded functions in deterministic order

2012-10-17 Thread Joachim Wieland

On Wed, Oct 17, 2012 at 5:43 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 (I tested the new pg_dump with 8.2 and HEAD and also verified it passes
 pg_upgrade's make check.  I didn't test with other server versions.)

I also tested against 8.3 and 8.4 since 8.4 is the version that
introduced pg_get_function_identity_arguments. The included testcase
fails on 8.3 and succeeds on 8.4 (pg_dump succeeds in both cases of
course but it's only ordered deterministically in 8.4+).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add FET to Default and Europe.txt

2012-10-08 Thread Joachim Wieland

On Mon, Oct 8, 2012 at 4:00 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 We can't just refuse to deal with this ambiguity.  We have to have some
 very-low-pain way to install settings that will please those large
 fractions of our user base.  Moreover, if that very-low-pain way isn't
 the exact same way it's been done for the last half dozen releases,
 you'll already have managed to annoy those selfsame large fractions.
 You'd better have a good reason for changing it.

As previously noted, there are two topics that are being discussed here:

- how to allow users to configure their own timezone abbreviations
and
- how to keep the timezone abbreviations that we ship updated wrt zic changes

Regarding configurability, we currently allow users (=administrators)
to change their abbreviations to whatever they like to use. The
Default file we provide has been taken from the set that used to be
a list that we compiled in back in the 8.x days (and we had this
egregious GUC australian_timezones that decided whether CST/EST was
regarded to be US or Australian timezones - there was no such hack for
any of the other ambiguities).

These timezone abbreviations have developed over several decades, most
of these decades without global communication in mind. They are full
of inconsistencies and irregularities and IMHO any way to select a
proper set for someone automatically is doomed to solve the problem
only partially.

Another funny ambiguity is that the EST timezone in Austalia could
mean both Eastern Standard Time and Eastern Summer Time, having
different offsets depending on what month of the year you're in...

The fact that we don't hear more complaints about the sets actually
suggests that most people are happy with our Default set. In fact,
Marc could just add his FET timezone to his own installations and use
it from there. His request to add it to our Default set however is
perfectly valid and should be discussed.

One thing that could be improved about configurability is the fact
that if you're not an administrator of the database, i.e. if you don't
have write access to where the timezones are defined, you're pretty
much out of luck being able to define your own abbreviations. This is
true in a shared hosting environment for example.


Wrt updating the timezone abbreviation files, I guess what we need is
a parser for the zic database, our own timezone files and a script
that compares the two datasets and spits out any conflicts so that we
can clean them up after manual inspection of the differences. I will
see if I can come up with some scripts in this direction.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Review for pg_dump: Sort overloaded functions in deterministic order

2012-09-25 Thread Joachim Wieland

Patch looks good, all concerns that had been raised previously have
been addressed in this version of the patch.

The only thing that IMO needs to change is a stylistic issue:

if (fout-remoteVersion = 80200)
{
[...]
(fout-remoteVersion = 80400) ?
pg_catalog.pg_get_function_identity_arguments(oid) : NULL::text,
[...]
}

Please just create a whole new

if (fout-remoteVersion = 80400)
{
   [...]
}

here.

Other than that, the feature works as advertised and does not
negatively affect runtime or memory consumption (the new field is only
added to functions / aggregates).

Please send a new version of the patch that changes the above
mentioned item, the patch also needs rebasing (off by a couple of
lines).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] could not open relation with OID errors after promoting the standby to master

2012-05-24 Thread Joachim Wieland

On Tue, May 22, 2012 at 9:50 AM, Robert Haas robertmh...@gmail.com wrote:
 Hmm.  I think that if you do it this way, the minimum recovery point
 won't be respected, which could leave you with a corrupted database.
 Now, if all the WAL files that you need are present in pg_xlog anyway,
 then they ought to get replayed anyway, but I think that if you are
 using restore_command (as opposed to streaming replication) we restore
 WAL segments under a different file name, which might cause this
 problem.

Uhm, maybe I add some more details, so you get a better idea of what I
did: The idea was to promote the standby to be the new master. There
was streaming replication active but at some time I had to take the
master down. IIRC from the log I saw that after the master went down,
the standby continued recovering from a bunch of archived log files
(via recovery_command), I had suspected that either the standby was
lagging behind a bit or that the master archived them during shutdown.
When the standby didn't have anything else left to recover from
(saying both xlog file foo doesn't exist and cannot connect to
master), I deleted recovery.conf on the standby and restarted it.

I wouldn't have assumed any corruption was possible given that I did
clean shutdowns on both sides...

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] could not open relation with OID errors after promoting the standby to master

2012-05-17 Thread Joachim Wieland

On Wed, May 16, 2012 at 11:38 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Well, that is not surprising in itself -- InitTempTableNamespace calls
 RemoveTempRelations to cleanup from a possibly crashed previous backend
 with the same ID.  So that part of the backtrace looks normal to me
 (unless there is something weird going on, which might very well be the
 case).

Right, I guess the stack trace is okay but some state was obviously wrong.

I was able to clean that up now by some catalog hacking, but I'm
definitely going to dump and reload soon.

I found out that it was certain backend ids which couldn't create
temporary tables, meaning that when I did a create temp table in
these few certain backend ids (about 4-5 all with low id numbers which
is why I hit them quite often), it would give me this could not open
relation with OID x error.

I also couldn't drop the temp schema in these backends:

# drop schema pg_temp_4;
ERROR:  cache lookup failed for relation 1990987636

# select oid, * from pg_namespace ;
(got oid 4664506 for pg_temp_4)

# select * from pg_class where oid = 1990987636;
(no rows returned)

# delete from pg_namespace where oid = 4664506;
DELETE 1

# create temp table myfoo(a int);
CREATE TABLE

Later on I also found some leftover pg_type entries from temporary
tables that didn't exist anymore. I'm quite that certain I shouldn't
see these anymore... And I also find a few entries in pg_class with
relistemp='t' whose oid is considerably older than anything recent.
This kinda suggests that there might be something weird going on when
you have temp tables in flight and fail over, at least that's the only
explanation I have for how this could have happened.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] could not open relation with OID errors after promoting the standby to master

2012-05-15 Thread Joachim Wieland

I've switched servers yesterday night and the previous slave is now
the master. This is 9.0.6 (originally) / 9.0.7 (now) on Linux.

Now I'm seeing a bunch of

ERROR:  could not open relation with OID 1990987633
STATEMENT:  create temp table seen_files (fileid integer)

Interestingly enough, 90% of these happen with create temp table
statements, but I also see them with regular read-only select
statements, but I'd say it's at most 10% of these.

Some reports for this error message suggested running reindex, so I've
run reindex table and also vacuum full on all system catalogs (just
for good measure) but that didn't help much. Restarting the cluster
didn't help either.

If it matters, I have not promoted the master with a trigger file but
restarted it after deleting recovery.conf.

Everything else appears to be running fine but since it's a) annoying
and b) not very comforting, I might dump and restore tomorrow night.

I added a sleep() after this error message so that I could attach a
debugger and grab a stack trace:

(gdb) bt
#0  0x003eb509a170 in __nanosleep_nocancel () from /lib64/libc.so.6
#1  0x003eb5099fc4 in sleep () from /lib64/libc.so.6
#2  0x0046375c in relation_open (relationId=1990987633,
lockmode=value optimized out) at heapam.c:906
#3  0x004b26e6 in heap_drop_with_catalog (relid=1990987633) at
heap.c:1567
#4  0x004ace01 in doDeletion (object=0x62dfbc8,
depRel=0x2b9ea756f6a8) at dependency.c:1046
#5  deleteOneObject (object=0x62dfbc8, depRel=0x2b9ea756f6a8) at
dependency.c:1004
#6  0x004aeb00 in deleteWhatDependsOn (object=value optimized
out, showNotices=0 '\000') at dependency.c:401
#7  0x004b6e90 in RemoveTempRelations () at namespace.c:3234
#8  InitTempTableNamespace () at namespace.c:3066
#9  0x004b70b5 in RangeVarGetCreationNamespace
(newRelation=value optimized out) at namespace.c:351
#10 0x00536e13 in DefineRelation (stmt=0x638da00, relkind=114
'r', ownerId=0) at tablecmds.c:409
#11 0x0063c15f in standard_ProcessUtility (parsetree=value
optimized out,
queryString=0x6298460 create temp table temp_defs_table_20068
(symhash char(32), symbolid integer), params=0x0, isTopLevel=value
optimized out,
dest=value optimized out, completionTag=value optimized out)
at utility.c:512
#12 0x00637d81 in PortalRunUtility (portal=0x61ec860,
utilityStmt=0x62992d0, isTopLevel=1 '\001', dest=0x6299670,
completionTag=0x7a9c1c30 ) at pquery.c:1191
#13 0x006384de in PortalRunMulti (portal=0x61ec860,
isTopLevel=1 '\001', dest=0x6299670, altdest=0x6299670,
completionTag=0x7a9c1c30 ) at pquery.c:1296
#14 0x00639732 in PortalRun (portal=0x61ec860,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x6299670,
altdest=0x6299670, completionTag=0x7a9c1c30 )
at pquery.c:822
#15 0x006359e5 in exec_simple_query (argc=value optimized
out, argv=value optimized out, username=value optimized out) at
postgres.c:1058
#16 PostgresMain (argc=value optimized out, argv=value optimized
out, username=value optimized out) at postgres.c:3936
#17 0x005f95d6 in BackendRun () at postmaster.c:3560
#18 BackendStartup () at postmaster.c:3247
#19 ServerLoop () at postmaster.c:1431
#20 0x005fb934 in PostmasterMain (argc=value optimized out,
argv=value optimized out) at postmaster.c:1092
#21 0x0058de36 in main (argc=3, argv=0x61dd1d0) at main.c:188
(gdb)

Any idea? It looks suspicious that it calls into RemoveTempRelations()
from InitTempNamespace() thereby removing the table it is about to
create?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallel pg_dump

2012-04-04 Thread Joachim Wieland

On Tue, Apr 3, 2012 at 9:26 AM, Andrew Dunstan and...@dunslane.net wrote:
 First, either the creation of the destination directory needs to be delayed
 until all the sanity checks have passed and we're sure we're actually going
 to write something there, or it needs to be removed if we error exit before
 anything gets written there.

pg_dump also creates empty files which is the analogous case here.
Just try to dump a nonexistant database for example (this also shows
that delaying won't help...).

 Maybe pg_dump -F d should be prepared to accept an empty directory as well as 
 a
 non-existent directory, just as initdb can.

That sounds like a good compromise. I'll implement that.


 Second, all the PrintStatus traces are annoying and need to be removed, or
 perhaps better only output in debugging mode (using ahlog() instead of just
 printf())

Sure, PrintStatus is just there for now to see what's going on. My
plan was to remove it entirely in the final patch.


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-04-04 Thread Joachim Wieland

On Tue, Apr 3, 2012 at 11:04 AM, Robert Haas robertmh...@gmail.com wrote:
 OK, but it seems like a pretty fragile assumption that none of the
 workers will ever manage to emit any other error messages.  We don't
 rely on this kind of assumption in the backend, which is a lot
 better-structured and less spaghetti-like than the pg_dump code.

Yeah, but even if they don't use exit_horribly(), the user would still
see the output, stdout/stderr aren't closed and everyone can still
write to it.

As a test, I printed out some messages upon seeing a specific dump id
in a worker:

if (strcmp(command, DUMP 3540) == 0)
{
write_msg(NULL, Info 1\n);
printf(Info 2\n);
exit_horribly(modulename, that's why\n);
}


$ ./pg_dump -j 7 ...
pg_dump: Info 1
Info 2
pg_dump: [parallel archiver] that's why


if (strcmp(command, DUMP 3540) == 0)
{
write_msg(NULL, Info 1\n);
printf(Info 2\n);
fprintf(stderr, exiting on my own\n);
exit(1);
}


$ ./pg_dump -j 7 ...
pg_dump: Info 1
Info 2
exiting on my own
pg_dump: [parallel archiver] A worker process died unexpectedly

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-04-03 Thread Joachim Wieland

On Tue, Apr 3, 2012 at 9:34 AM, Robert Haas robertmh...@gmail.com wrote:
 On Sun, Apr 1, 2012 at 12:35 PM, Joachim Wieland j...@mcknight.de wrote:
 Unfortunately this is not really the case. What is being moved out of
 pg_backup_archiver.c and into parallel.c is either the shutdown logic
 that has been applied only a few days ago or is necessary to change
 the parallel restore logic from one-thread-per-dump-object to the
 message passing framework where a worker starts in the beginning and
 then receives a new dump object from the master every time it's idle.

 Hmm.  It looks to me like the part-two patch still contains a bunch of
 code rearrangement.  For example, the current code for
 pg_backup_archiver.c patch contains this:

 typedef struct ParallelState
 {
        int                     numWorkers;
        ParallelStateEntry *pse;
 } ParallelState;

 In the revised patch, that's removed, and parallel.c instead contains this:

 typedef struct _parallel_state
 {
      int                     numWorkers;
      ParallelSlot *parallelSlot;
 } ParallelState;

Yes, this is what I referred to as the part of the shutdown logic that
we only applied a few days ago. I basically backported what I had in
parallel.c into pg_backup_archiver.c which is where all the parallel
logic lives at the moment. Moving it out of pg_backup_archiver.c and
into a new parallel.c file means that I'd either have to move the
declaration to a header or create accessor functions and declare these
in a header. I actually tried it and both solutions created more lines
than they would save later on, especially with the lines that will
remove this temporary arrangement again...

The current parallel restore engine already has a ParallelSlot
structure but uses it in a slightly different way. That's why I
created the one in the shutdown logic as ParallelStateEntry for now.
This will be gone with the final patch and at the end there will only
be a ParallelSlot left that will serve both purposes. That's why you
see this renaming (and the removal of the current ParallelSlot
structure).

struct _parallel_state won't be used anywhere, except for forward
declarations in headers. I just used it because that seemed to be the
naming scheme, other structures are called similarly, e.g.:

$ grep struct _ pg_backup_archiver.c
typedef struct _restore_args
typedef struct _parallel_slot
typedef struct _outputContext

I'm fine with any name, just let me know what you prefer.


 On a similar note, what's the point of changing struct Archive to have
 int numWorkers instead of int number_of_jobs, and furthermore
 shuffling the declaration around to a different part of the struct?

number_of_jobs was in the struct RestoreOptions before, a structure
that is not used when doing a dump. I moved it to the Archive as it is
used by both dump and restore and since all other code talks about
workers I changed the name to numWorkers.


 On another note, I am not sure that I like the messaging protocol
 you've designed.  It seems to me that this has little chance of being
 reliable:

 + void (*on_exit_msg_func)(const char *modulename, const char *fmt,
 va_list ap) = vwrite_msg;

 I believe the idea here is that you're going to capture the dying gasp
 of the worker thread and send it to the master instead of printing it
 out.  But that doesn't seem very reliable.  There's code all over the
 place (and, in particular, in pg_dump.c) that assumes that we may
 write messages at any time, and in particular that we may emit
 multiple messages just before croaking.

I guess you're talking about the code in pg_dump that reads in the
database schema and the details of all the different objects in the
schema. This code is run before forking off workers and is always
executed in the master.

pg_dump only forks when all the catalog data has been read so only
actual TABLE DATA and BLOBs are dumped from the workers. I claim that
in at least 90% the functions involved here use exit_horribly() and
output a clear message about why they're dying. If they don't but just
die, the master will say worker died unexpectedly. As you said a few
mails before, any function exiting at this stage should rather call
exit_horribly() to properly clean up after itself.

The advantage of using the message passing system for the last error
message is that you get exactly one message and it's very likely that
it accurately describes what happened to a worker to make it stop.


 Also, I like the idea of making it possible to use assertions in
 front-end code.  But I think that if we're going to do that, we should
 make it work in general, not just for things that include
 pg_backup_archiver.h.

I completely agree. Assertions helped a lot dealing with concurrent
code. How do you want to tackle this for now? Want me to create a
separate header pg_assert.h as part of my patch? Or is it okay to
factor it out later and include it from the general header then?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-03-28 Thread Joachim Wieland

On Wed, Mar 28, 2012 at 5:19 PM, Andrew Dunstan and...@dunslane.net wrote:
 First hurdle: It doesn't build under Windows/mingw-w64:

   parallel.c:40:12: error: static declaration of 'pgpipe' follows
   non-static declaration

Strange, I'm not seeing this but I'm building with VC2005. What
happens is that you're pulling in the pgpipe.h header. I have moved
these functions as static functions into pg_dump since you voted for
removing them from the other location (because as it turned out,
nobody else is currently using them).

Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-03-28 Thread Joachim Wieland

On Thu, Mar 29, 2012 at 2:46 AM, Andrew Dunstan and...@dunslane.net wrote:
 But your patch hasn't got rid of them, and so it's declared twice. There is
 no pgpipe.h, BTW, it's declared in port.h. If VC2005 doesn't complain about
 the double declaration then that's a bug in the compiler, IMNSHO. Doesn't it
 even issue a warning? I no longer use VC2005 for anything, BTW, I use VC2008
 or later.

I agree, the compiler should have found it, but no, I don't even get a
warning. I just verified it and when I add a #error right after the
prototypes in port.h then it hits the #error on every file, so it
definitely sees both prototypes and doesn't complain... cl.exe is
running with /W3.


 Anyway, ISTM the best thing is just for us to get rid of pgpipe without
 further ado. I'll try to get a patch together for that.

Thanks.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-03-28 Thread Joachim Wieland

On Wed, Mar 28, 2012 at 1:46 PM, Robert Haas robertmh...@gmail.com wrote:
 I'm wondering if we really need this much complexity around shutting
 down workers.  I'm not sure I understand why we need both a hard and
 a soft method of shutting them down.  At least on non-Windows
 systems, it seems like it would be entirely sufficient to just send a
 SIGTERM when you want them to die.  They don't even need to catch it;
 they can just die.

At least on my Linux test system, even if all pg_dump processes are
gone, the server happily continues sending data. When I strace an
individual backend process, I see a lot of Broken pipe writes, but
that doesn't stop it from just writing out the whole table to a closed
file descriptor. This is a 9.0-latest server.

--- SIGPIPE (Broken pipe) @ 0 (0) ---
read(13, \220\370\0\0\240\240r\266\3\0\4\0\264\1\320\1\0 \4
\0\0\0\0\270\237\220\0h\237\230\0..., 8192) = 8192
read(13, \220\370\0\0\350\300r\266\3\0\4\0\264\1\320\1\0 \4
\0\0\0\0\260\237\230\0h\237\220\0..., 8192) = 8192
sendto(7, d\0\0\0Acpp\t15.0\t124524\taut..., 8192, 0, NULL,
0) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
read(13, \220\370\0\\341r\266\3\0\5\0\260\1\340\1\0 \4
\0\0\0\0\270\237\220\0p\237\220\0..., 8192) = 8192
sendto(7, d\0\0\0Dcpp\t15.0\t1245672000\taut..., 8192, 0, NULL,
0) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
read(13,  unfinished ...

I guess that https://commitfest.postgresql.org/action/patch_view?id=663
would take care of that in the future, but without it (i.e anything =
9.2) it's quite annoying if you want to Ctrl-C a pg_dump and then have
to manually hunt down and kill all the backend processes.

I tested the above with immediately returning from
DisconnectDatabase() in pg_backup_db.c on Linux. The important thing
is that it calls PQcancel() on it's connection before terminating.

On Windows, several pages indicate that you can only cleanly terminate
a thread from within the thread, e.g. the last paragraph on

http://msdn.microsoft.com/en-us/library/windows/desktop/ms686724%28v=vs.85%29.aspx

The patch is basically doing what this page is recommending.

I'll try your proposal about terminating in the child when the
connection is closed, that sounds reasonable and I don't see an
immediate problem with that.


 The existing coding of on_exit_nicely is intention, and copied from
 similar logic for on_shmem_exit in the backend. Is there really a
 compelling reason to reverse the firing order of exit hooks?

No, reversing the order was not intended. I rewrote it to a for loop
because the current implementation modifies a global variable and so
on Windows only one thread would call the exit hook.

I'll add all your other suggestions to the next version of my patch. Thanks!


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-03-25 Thread Joachim Wieland

On Fri, Mar 23, 2012 at 11:11 AM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Are you going to provide a rebased version?

Rebased version attached, this patch also includes Robert's earlier suggestions.


parallel_pg_dump_5.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-03-23 Thread Joachim Wieland

On Fri, Mar 23, 2012 at 11:11 AM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Are you going to provide a rebased version?

Yes, working on that.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] double free in current HEAD's pg_dump

2012-03-17 Thread Joachim Wieland

There's a double free in the current HEAD's pg_dump. Fix attached.
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 2b0a5ff..57a6ccb 100644
*** a/src/bin/pg_dump/pg_dump.c
--- b/src/bin/pg_dump/pg_dump.c
*** dumpBlobs(Archive *fout, void *arg)
*** 2372,2379 
  		PQclear(res);
  	} while (ntups  0);
  
- 	PQclear(res);
- 
  	return 1;
  }
  
--- 2372,2377 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-03-17 Thread Joachim Wieland

On Fri, Mar 16, 2012 at 12:06 AM, Robert Haas robertmh...@gmail.com wrote:
 Good. The only exit handler I've seen so far is
 pgdump_cleanup_at_exit. If there's no other one, is it okay to remove
 all of this stacking functionality (see on_exit_nicely_index /
 MAX_ON_EXIT_NICELY) from dumputils.c and just define two global
 variables, one for the function and one for the arg that this function
 would operate on (or a struct of both)?

 No.  That code is included by other things - like pg_dumpall - that
 don't know there's such a thing as an Archive.  But I don't see that
 as a big problem; just on_exit_nicely whatever you want.  We could
 also add on_exit_nicely_reset(), if needed, to clear the existing
 handlers.

Yes, on_exit_nicely_reset() would be what I'd need to remove all
callbacks from the parent after the fork in the child process.

I still can't find any other hooks except for pgdump_cleanup_at_exit
from pg_dump.c. I guess what you're saying is that we provide
dumputil.c to other programs but even though none of them currently
sets any exit callback, you want to keep the functionality so that
they can set multiple exit hooks in the future should the need for
them arise.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-03-14 Thread Joachim Wieland

On Wed, Mar 14, 2012 at 2:02 PM, Robert Haas robertmh...@gmail.com wrote:
 I think we should get rid of die_horribly(), and instead have arrange
 to always clean up AH via an on_exit_nicely hook.

Good. The only exit handler I've seen so far is
pgdump_cleanup_at_exit. If there's no other one, is it okay to remove
all of this stacking functionality (see on_exit_nicely_index /
MAX_ON_EXIT_NICELY) from dumputils.c and just define two global
variables, one for the function and one for the arg that this function
would operate on (or a struct of both)?

We'd then have the current function and AHX (or only AH-connection
from it) in the non-parallel case and as soon as we enter the parallel
dump, we can exchange it for another function operating on
ParallelState*. This avoids having to deal with thread-local storage
on Windows, because ParallelState* is just large enough to hold all
the required data and a specific thread can easily find its own slot
with its threadId.


 Sure, but since all the function does is write to it or access it,
 what good does that do me?

 It encapsulates the variable so that it can only be used for one
 specific use case.

 Seems pointless to me.

Not so much to me if the alternative is to make ParallelState* a
global variable, but anyway, with the concept proposed above,
ParallelState* would be the arg that the parallel exit handler would
operate on, so it would indeed be global but hidden behind a different
name and a void* pointer.

(I will address all the other points you brought up in my next patch)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-03-14 Thread Joachim Wieland

On Wed, Mar 14, 2012 at 4:39 PM, Andrew Dunstan aduns...@postgresql.org wrote:
 I've just started looking at the patch, and I'm curious to know why it
 didn't follow the pattern of parallel pg_restore which created a new worker
 for each table rather than passing messages to looping worker threads as
 this appears to do. That might have avoided a lot of the need for this
 message passing infrastructure, if it could have been done. But maybe I just
 need to review the patch and the discussions some more.

The main reason for this design has now been overcome by the
flexibility of the synchronized snapshot feature, which allows to get
the snapshot of a transaction even if this other transaction has been
running for quite some time already. In other previously proposed
implementations of this feature, workers had to connect at the same
time and then could not close their transactions without losing the
snapshot.

The other drawback of the fork-per-tocentry-approach is the somewhat
limited bandwith of information from the worker back to the master,
it's basically just the return code. That's fine if there is no error,
but if there is, then the master can't tell any further details (e.g.
could not get lock on table foo, or could not write to file bar: no
space left on device).

This restriction does not only apply to error messages. For example,
what I'd also like to have in pg_dump would be checksums on a
per-TocEntry basis. The individual workers would calculate the
checksums when writing the file and then send them back to the master
for integration into the TOC. I don't see how such a feature could be
implemented in a straightforward way without a message passing
infrastructure.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-03-13 Thread Joachim Wieland

On Tue, Mar 13, 2012 at 1:53 PM, Robert Haas robertmh...@gmail.com wrote:
 What I mean is that the function ArchiveEntry() is defined in
 pg_backup_archiver.c, and it takes an argument called relpages, and
 the string relpages does not appear anywhere else in that file.

Uhm, that's kinda concerning, isn't it... fixed...


[...pgpipe...]
 Dunno.  Can we get an opinion on that from one of the Windows guys?
 Andrew, Magnus?

Waiting for the verdict here...


 If a child terminates without leaving a message, the master will still
 detect it and just say a worker process died unexpectedly (this part
 was actually broken, but now it's fixed :-) )

 All that may be true, but I still don't see why it's right for this to
 apply in the cases where the worker thread says die_horribly(), but
 not in the cases where the worker says exit_horribly().

Hm, I'm not calling the error handler from exit_horribly because it
doesn't have the AH. It looks like the code assumes that
die_horribly() is called whenever AH is available and if not,
exit_horribly() should be called which eventually calls these
preregistered exit-hooks via exit_nicely() to clean up the connection.

I think we should somehow unify both functions, the code is not very
consistent in this respect, it also calls exit_horribly() when it has
AH available. See for example pg_backup_tar.c

Or is there another distinction between them? The question how to
clean it up basically brings us back to the question what to do about
global variables in general and for error handlers in particular.


 Or we change fmtQualifiedId to take an int and then we always pass the
 archive version instead of the Archive* ?

 Hmm, I think that might make sense.

Done.


 +enum escrow_action { GET, SET };
 +static void
 +parallel_error_handler_escrow_data(enum escrow_action act,
 ParallelState *pstate)
 +{
 +       static ParallelState *s_pstate = NULL;
 +
 +       if (act == SET)
 +               s_pstate = pstate;
 +       else
 +               *pstate = *s_pstate;
 +}

 This seems like a mighty complicated way to implement a global variable.

 Well, we talked about that before, when you complained that you
 couldn't get rid of the global g_conn because of the exit handler.
 You're right that in fact it is an indirect global variable here but
 it's clearly limited to the use of the error handler and you can be
 sure that nobody other than this function writes to it or accesses it
 without calling this function.

 Sure, but since all the function does is write to it or access it,
 what good does that do me?

It encapsulates the variable so that it can only be used for one
specific use case.

Attaching a new version.


parallel_pg_dump_4.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-02-23 Thread Joachim Wieland

On Thu, Feb 16, 2012 at 1:29 PM, Robert Haas robertmh...@gmail.com wrote:
 Can you provide an updated patch?

Robert, updated patch is attached.


parallel_pg_dump_2.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-02-08 Thread Joachim Wieland

On Wed, Feb 8, 2012 at 1:21 PM, Robert Haas robertmh...@gmail.com wrote:
 In my patch I dealt with exactly the same problem for the error
 handler by creating a separate function that has a static variable (a
 pointer to the ParallelState). The value is set and retrieved through
 the same function, so yes, it's kinda global but then again it can
 only be accessed from this function, which is only called from the
 error handler.

 How did you make this thread-safe?

The ParallelState has a ParallelSlot for each worker process which
contains that worker process's thread id. So a worker process just
walks through the slots until it finds its own thread id and then goes
from there.

In particular, it only gets the file descriptors to and from the
parent from this structure, to communicate the error it encountered,
but it doesn't get the PGconn. This is because that error handler is
only called when a worker calls die_horribly(AH, ...) in which case
the connection is already known through AH.

Termination via a signal just sets a variable that is checked in the
I/O routines and there we also have AH to shut down the connection
(actually to call die_horribly()).


 So we at least need to press on far enough to get to that point.

Sure, let me know if I can help you with something.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-02-07 Thread Joachim Wieland

On Tue, Feb 7, 2012 at 4:59 PM, Robert Haas robertmh...@gmail.com wrote:
 It turns out that (as you anticipated) there are some problems with my
 previous proposal.

I assume you're talking to Tom, as my powers of anticipation are
actually quite limited... :-)

 This is not
 quite enough to get rid of g_conn, but it's close: the major stumbling
 block at this point is probably exit_nicely().  The gyrations we're
 going through to make sure that AH-connection gets closed before
 exiting are fairly annoying; maybe we should invent something in
 dumputils.c along the line of the backend's on_shmem_exit().

Yeah, this becomes even more important with parallel jobs where you
want all worker processes die once the parent exits. Otherwise some 10
already started processes would continue to dump your 10 largest
tables for the next few hours with the master process long dead... All
while you're about to start up the next master process...

In my patch I dealt with exactly the same problem for the error
handler by creating a separate function that has a static variable (a
pointer to the ParallelState). The value is set and retrieved through
the same function, so yes, it's kinda global but then again it can
only be accessed from this function, which is only called from the
error handler.


 I'm starting to think it might make sense to press on with this
 refactoring just a bit further and eliminate the distinction between
 Archive and ArchiveHandle.

How about doing more refactoring after applying the patch, you'd then
see what is really needed and then we'd also have an actual use case
for more than one connection (You might have already guessed that this
proposal is heavily influenced by my self-interest of avoiding too
much work to make my patch match your refactoring)...

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-02-03 Thread Joachim Wieland

On Fri, Feb 3, 2012 at 7:52 AM, Robert Haas robertmh...@gmail.com wrote:
 If you're OK with that much I'll go do it.

Sure, go ahead!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-02-02 Thread Joachim Wieland

On Wed, Feb 1, 2012 at 12:24 PM, Robert Haas robertmh...@gmail.com wrote:
 I think we're more-or-less proposing to rename Archive to
 Connection, aren't we?

 And then ArchiveHandle can store all the things that aren't related to
 a specific connection.

How about something like that:
(Hopefully you'll come up with better names...)

StateHandle {
   Connection
   Schema
   Archive
   Methods
   union {
  DumpOptions
  RestoreOptions
   }
}

Dumping would mean to do:

  Connection - Schema - Archive using DumpOptions through the
specified Methods

Restore:

   Archive - Schema - Connection using RestoreOptions through the
specified Methods

Dumping from one database and restoring into another one would be two
StateHandles with different Connections, Archive == NULL, Schema
pointing to the same Schema, Methods most likely also pointing to the
same function pointer table and each with different Options in the
union of course.

Granted, you could say that above I've only grouped the elements of
the ArchiveHandle, but I don't really see that breaking it up into
several objects makes it any better or easier. There could be some
accessor functions to hide the details of the whole object to the
different consumers.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-01-31 Thread Joachim Wieland

On Tue, Jan 31, 2012 at 9:05 AM, Robert Haas robertmh...@gmail.com wrote:
 And just for added fun and excitement, they all have inconsistent
 naming conventions and inadequate documentation.

 I think if we need more refactoring in order to support multiple
 database connections, we should go do that refactoring.  The current
 situation is not serving anyone well.

I guess I'd find it cleaner to have just one connection per Archive
(or ArchiveHandle). If you need two connections, why not just have two
Archive objects, as they would have different characteristics anyway,
one for dumping data, one to restore.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-01-30 Thread Joachim Wieland

On Mon, Jan 30, 2012 at 12:20 PM, Robert Haas robertmh...@gmail.com wrote:
 But the immediate problem is that pg_dump.c is heavily reliant on
 global variables, which isn't going to fly if we want this code to use
 threads on Windows (or anywhere else).  It's also bad style.

Technically, since most of pg_dump.c dumps the catalog and since this
isn't done in parallel but only in the master process, most functions
need not be changed for the parallel restore. Only those that are
called from the worker threads need to be changed, this has been done
in e.g. dumpBlobs(), the change that you quoted upthread.

 But it
 seems possible that we might someday want to dump from one database
 and restore into another database at the same time, so maybe we ought
 to play it safe and use different variables for those things.

Actually I've tried that but in the end concluded that it's best to
have at most one database connection in an ArchiveHandle if you don't
want to do a lot more refactoring. Besides the normal connection
parameters like host, port, ... there's also std_strings, encoding,
savedPassword, currUser/currSchema, lo_buf, remoteVersion ... that
wouldn't be obvious where they belonged to.

Speaking about refactoring, I'm happy to also throw in the idea to
make the dump and restore more symmetrical than they are now. I kinda
disliked RestoreOptions* being a member of the ArchiveHandle without
having something similar for the dump. Ideally I'd say there should be
a DumpOptions and a RestoreOptions field (with a struct Connection
being part of them containing all the different connection
parameters). They could be a union if you wanted to allow only one
connection, or not if you want more than one.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-01-29 Thread Joachim Wieland

On Fri, Jan 27, 2012 at 4:57 PM, Robert Haas robertmh...@gmail.com wrote:
 But even if you do know that subclassing
 is intended, that doesn't prove that the particular Archive object is
 always going to be an ArchiveHandle under the hood.  If it is, why not
 just pass it as an ArchiveHandle to begin with?

I know that you took back some of your comments, but I'm with you
here. Archive is allocated as an ArchiveHandle and then casted back to
Archive*, so you always know that an Archive is an ArchiveHandle. I'm
all for getting rid of Archive and just using ArchiveHandle throughout
pg_dump which would get rid of these useless casts. I admit that I
might have made it a bit worse by adding a few more of these casts but
the fundamental issue was already there and there is precedence for
casting between them in both directions :-)

Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sending notifications from the master to the standby

2012-01-10 Thread Joachim Wieland

On Tue, Jan 10, 2012 at 12:00 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 So this design is non-optimal both for existing uses and for the
 proposed new uses, which means nobody will like it.  You could
 ameliorate #1 by adding a GUC that determines whether NOTIFY actually
 writes WAL, but that's pretty ugly.  In any case ISTM that problem #2
 means this design is basically broken.

I chose to do it this way because it seemed like the most natural way
to do it (which of course doesn't mean it's the best)  :-). I agree
that there should be a way to avoid the replication of the NOTIFYs.
Regarding your second point though, remember that on the master we
write notifications to the queue in pre-commit. And we also don't
interleave notifications of different transactions. So once the commit
record makes it to the standby, all the notifications are already
there, just as on the master. In a burst of notifications, both
solutions should more or less behave the same way but yes, the one
involving the WAL file would be slower as it goes to the file system
and back.

 I wonder whether it'd be practical to not involve WAL per se in this
 at all, but to transmit NOTIFY messages by having walsender processes
 follow the notify stream (as though they were listeners) and send the
 notify traffic as a separate message stream interleaved with the WAL
 traffic.

Agreed, having walsender/receiver work as NOTIFY proxies is kinda smart...


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sending notifications from the master to the standby

2012-01-10 Thread Joachim Wieland

On Tue, Jan 10, 2012 at 11:55 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 [ Tom sketches a design ]
 Seems a bit overcomplicated.  I was just thinking of having walreceiver
 note the WAL endpoint at the instant of receipt of a notify message,
 and not release the notify message to the slave ring buffer until WAL
 replay has advanced that far.

How about this: We mark a notify message specially if it is the last
message sent by a transaction and also add a flag to
commit/abort-records, indicating whether or not the transaction has
sent notifys. Now if such a last message is being put into the regular
ring buffer on the standby and the xid is known to have committed or
aborted, signal the backends. Also signal from a commit/abort-record
if the flag is set.

If the notify messages make it to the standby first, we just put
messages of a not-yet-committed transaction into the queue, just as on
the master. Listeners will get signaled when the commit record
arrives. If the commit record arrives first, we signal, but the
listeners won't find anything (at least not the latest notifications).
When the last notify of that transaction finally arrives, the
transaction is known to have committed and the listeners will get
signaled.

What could still happen is that the standby receives notifys, the
commit message and more notifys. Listeners would still eventually get
all the messages but potentially not all of them at once. Is this a
problem? If so, then we could add a special stop reading-record into
the queue before we write the notifys, that we subsequently change
into a continue reading-record once all notifications are in the
queue. Readers would treat a stop reading record just like a
not-yet-committed transaction and ignore a continue reading record.


 Suggest we add something to initial handshake from standby to say
 please send me notify traffic,

 +1 on that.

From what you said I imagined this walsender listener as a regular
listener that listens on the union of all sets of channels that
anybody is listening on on the standby, with the LISTEN transaction on
the standby return from commit once the listener is known to have been
set up on the master.


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_restore --no-post-data and --post-data-only

2011-12-08 Thread Joachim Wieland

On Tue, Nov 15, 2011 at 6:14 PM, Andrew Dunstan and...@dunslane.net wrote:
 Updated version with pg_restore included is attached.

The patch applies with some fuzz by now but compiles without errors or warnings.

The feature just works, it is not adding a lot of new code, basically
it parses the given options and then skips over steps depending on the
selected section.

I verified the equivalence of -a and -s to the respective sections in
the different archive formats and no surprise here either, they were
equivalent except for the header (which has a timestamp).

If you ask pg_restore to restore a section out of an archive which
doesn't have this section, there is no error and the command just
succeeds. This is what I expected and I think it's the right thing to
do but maybe others think that
there should be a warning.

In pg_restore, pre-data cannot be run in parallel, it would only run
serially, data and post-data can run in parallel, though. This is also
what I had expected but it might be worth to add a note about this to
the documentation.

What I didn't like about the implementation was the two set_section()
functions, I'd prefer them to move to a file that is shared between
pg_dump and pg_restore and become one function...


Minor issues:

{section, required_argument, NULL, 5} in pg_dump.c is not in the alphabetical
order of the options.

 ./pg_restore --section=foobar
pg_restore: unknown section name foobar)

Note the trailing ')', it's coming from a _(...) confusion

Some of the lines in the patch have trailing spaces and in the
documentation part tabs and spaces are mixed.

int skip used as bool skip in dumpDumpableObject()


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] synchronized snapshots

2011-09-28 Thread Joachim Wieland

Hi Marko,

On Wed, Sep 28, 2011 at 2:29 AM, Marko Tiikkaja
marko.tiikk...@2ndquadrant.com wrote:
 In a sequence such as this:

  BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
  INSERT INTO foo VALUES (-1);
  SELECT pg_export_snapshot();

 the row added to foo is not visible in the exported snapshot.  If that's
 the desired behaviour, I think it should be mentioned in the documentation.

Yes, that's the desired behaviour, the patch add this paragraph to the
documentation already:

Also note that even after the synchronization both clients still run
their own independent transactions. As a consequence, even though
synchronized with respect to reading pre-existing data, both
transactions won't be able to see each other's uncommitted data.

I'll take a look at the other issues and update the patch either
tonight or tomorrow.


Thank you,
Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] synchronized snapshots

2011-08-15 Thread Joachim Wieland

On Mon, Aug 15, 2011 at 3:47 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 15.08.2011 04:31, Joachim Wieland wrote:

 The one thing that it does not implement is leaving the transaction in
 an aborted state if the BEGIN TRANSACTION command failed for an
 invalid snapshot identifier.

 So what if the snapshot is invalid, the SNAPSHOT clause silently ignored?
 That sounds really bad.

No, the command would fail, but since it fails, it doesn't change the
transaction state.

What was proposed originally was to start a transaction but throw an
error that leaves the transaction in the aborted state. But then the
command had some effect because it started a transaction block, even
though it failed.


 I can certainly see that this would be
 useful but I am not sure if it justifies introducing this
 inconsistency. We would have a BEGIN TRANSACTION command that left the
 session in a different state depending on why it failed...

 I don't understand what inconsistency you're talking about. What else can
 cause BEGIN TRANSACTION to fail? Is there currently any failure mode that
 doesn't leave the transaction in aborted state?

Granted, it might only fail for parse errors so far, but that would
include for example sending BEGIN DEFERRABLE to a pre-9.1 server. It
wouldn't start a transaction and leave it in an aborted state, but it
would just fail.


 I am wondering if pg_export_snapshot() is still the right name, since
 the snapshot is no longer exported to the user. It is exported to a
 file but that's an implementation detail.

 It's still exporting the snapshot to other sessions, that name still seems
 appropriate to me.

ok.


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] synchronized snapshots

2011-08-15 Thread Joachim Wieland

On Mon, Aug 15, 2011 at 6:41 AM, Florian Weimer fwei...@bfk.de wrote:
 * Simon Riggs:

 I don't see the need to change the BEGIN command, which is SQL
 Standard. We don't normally do that.

 Some language bindings treat BEGIN specially, so it might be difficult
 to use this feature.

It's true, the command might require explicit support from language
bindings. However I used some perl test scripts, where you can also
send a START TRANSACTION command in an $dbh-do(...).

The intended use case of this feature is still pg_dump btw...


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] synchronized snapshots

2011-08-15 Thread Joachim Wieland

On Mon, Aug 15, 2011 at 6:09 PM, Jim Nasby j...@nasby.net wrote:
 I suspect that all the other cases of BEGIN failing would be syntax errors, so
 you would immediately know in testing that something was wrong. A missing file
 is definitely not a syntax error, so we can't really depend on user testing 
 to ensure
 this is handled correctly. IMO, that makes it critical that that error puts 
 us in an
 aborted transaction.

Why can we not just require the user to verify if his BEGIN query
failed or succeeded?
Is that really too much to ask for?

Also see what Robert wrote about proxies in between that keep track of
the transaction
state. Consider they see a BEGIN query that fails. How would they know
if the session
is now in an aborted transaction or not in a transaction at all?


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Snapshot synchronization, again...

2011-02-28 Thread Joachim Wieland

On Mon, Feb 28, 2011 at 6:38 PM, Robert Haas robertmh...@gmail.com wrote:
 Remember that it's not only about saving shared memory, it's also
 about making sure that the snapshot reflects a state of the database
 that has actually existed at some point in the past. Furthermore, we
 can easily invalidate a snapshot that we have published earlier by
 deleting its checksum in shared memory as soon as the original
 transaction commits/aborts. And for these two a checksum seems to be a
 good fit. Saving memory then comes as a benefit and makes all those
 happy who don't want to argue about how many slots to reserve in
 shared memory or don't want to have another GUC for what will probably
 be a low-usage feature.

 But you can do all of this with files too, can't you?  Just remove or
 truncate the file when the snapshot is no longer valid.

Sure we can, but it looked like the consensus of the first discussion
was that the through-the-client approach was more flexible. But then
again nobody is actively arguing for that anymore.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Snapshot synchronization, again...

2011-02-27 Thread Joachim Wieland

On Sun, Feb 27, 2011 at 3:04 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Why exactly, Heikki do you think the hash is more troublesome?
 It just feels wrong to rely on cryptography just to save some shared memory.

Remember that it's not only about saving shared memory, it's also
about making sure that the snapshot reflects a state of the database
that has actually existed at some point in the past. Furthermore, we
can easily invalidate a snapshot that we have published earlier by
deleting its checksum in shared memory as soon as the original
transaction commits/aborts. And for these two a checksum seems to be a
good fit. Saving memory then comes as a benefit and makes all those
happy who don't want to argue about how many slots to reserve in
shared memory or don't want to have another GUC for what will probably
be a low-usage feature.


 I realize that there are conflicting opinions on this, but from user
 point-of-view the hash is just a variant of the idea of passing the snapshot
 through shared memory, just implemented in an indirect way.

The user will never see the hash, why should he bother? The user point
of view is that he receives data and can obtain the same snapshot if
he passed that data back. This user experience is no different from
any other way of passing the snapshot through the client. And from the
previous discussions this seemed to be what most people wanted.


 And how
 could we validate/invalidate snapshots without the checksum (assuming
 the through-the-client approach instead of storing the whole snapshot
 in shared memory)?

 Either you accept anything that passes sanity checks, or you store the whole
 snapshot in shared memory (or a temp file). I'm not sure which is better,
 but they both seem better than the hash.

True, both might work but I don't see a real technical advantage over
the checksum approach for any of them, rather the opposite.

Nobody has come up with a use case for the accept-anything option so
far, so I don't see why we should commit ourselves to this feature at
this point, given that we have a cheap and easy way of
validating/invalidating snapshots. And I might be just paranoid but I
also fear that someone could raise security issues for the fact that
you would be able to request an arbitrary database state from the past
and inspect changes of other peoples' transactions. We might want to
allow that later though and I realize that we have to allow it for a
standby server that would take over a snapshot from the master anyway,
but I don't want to add this complexity into this first patch. I want
however be able to possibly allow this in the future without touching
the external API of the feature.

And for the tempfile approach, I don't see that the creation and
removal of the temp file is any less code complexity than flipping a
number in shared memory. Also it seemed that people rather wanted to
go with the through-the-client approach because it seems to be more
flexible.

Maybe you should just look at it as a conservative accept-anything
approach that uses a checksum to allow only snapshots that we know
have existed and have been published. Does the checksum still look so
weird from this perspective?


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Snapshot synchronization, again...

2011-02-22 Thread Joachim Wieland

On Tue, Feb 22, 2011 at 3:34 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 22.02.2011 16:29, Robert Haas wrote:
 On Tue, Feb 22, 2011 at 8:58 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:
 No, the hash is stored in shared memory. The hash of the garbage has to
 match.

 Oh.  Well that's really silly.  At that point you might as well just
 store the snapshot and an integer identifier in shared memory, right?

 Yes, that's the point I was trying to make. I believe the idea of a hash was
 that it takes less memory than storing the whole snapshot (and more
 importantly, a fixed amount of memory per snapshot). But I'm not convinced
 either that dealing with a hash is any less troublesome.

Both Tom and Robert voted quite explicitly against the
store-in-shared-memory idea. Others don't want to allow people request
arbitrary snapshots and again others wanted to pass the snapshot
through the client so that in the future we could also request
snapshots from standby servers. The hash idea seemed to at least make
nobody unhappy.

For easier review, here are a few links to the previous discusion:

http://archives.postgresql.org/pgsql-hackers/2010-12/msg00361.php
http://archives.postgresql.org/pgsql-hackers/2010-12/msg00383.php
http://archives.postgresql.org/pgsql-hackers/2010-12/msg00481.php
http://archives.postgresql.org/pgsql-hackers/2010-12/msg02454.php

Why exactly, Heikki do you think the hash is more troublesome? And how
could we validate/invalidate snapshots without the checksum (assuming
the through-the-client approach instead of storing the whole snapshot
in shared memory)?


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Snapshot synchronization, again...

2011-02-21 Thread Joachim Wieland

Hi,

On Mon, Feb 21, 2011 at 4:56 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 What's the reason for this restriction?
        if (databaseId != MyDatabaseId)
                ereport(ERROR,
                        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                         errmsg(cannot import snapshot from a different 
 database)));

I just couldn't think of a use case for it and so didn't want to
introduce a feature that we might have to support in the future then.


 Why are we using bytea as the output format instead of text?  The output
 is just plain text anyway, as can be seen by setting bytea_output to
 'escape'.  Perhaps there would be a value to using bytea if we were to
 pglz_compress the result, but would there be a point in doing so?
 Normally exported info should be short enough that it would cause more
 overhead than it would save anyway.

It is bytea because it should be thought of just some data. It
should be regarded more as a token than as text and should not be
inspected/interpreted at all. If anybody decides to do so, fine, but
then they should not rely on the stability of the message format for
the future.


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Snapshot synchronization, again...

2011-02-19 Thread Joachim Wieland

On Sat, Feb 19, 2011 at 9:17 PM, Peter Eisentraut pete...@gmx.net wrote:
 The only consideration against MD5 might be
 that it would make us look quite lame.  We should probably provide
 builtin SHA1 and SHA2 functions for this and other reasons.

In this particular case however the checksum is never exposed to the
user but stays within shared memory. So nobody would notice that we
are still using lame old md5sum :-)


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Snapshot synchronization, again...

2011-02-18 Thread Joachim Wieland

On Fri, Feb 18, 2011 at 8:57 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 1. why are you using the expansible char array stuff instead of using
 the StringInfo facility?

 2. is md5 the most appropriate digest for this?  If you need a
 cryptographically secure hash, do we need something stronger?  If not,
 why not just use hash_any?

We don't need a cryptographically secure hash.

There is no special reason for why it is like it is, I just didn't
think of the better alternatives that you are proposing.

Should I send an updated patch? Anything else?


Thanks for the review,
Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-02-08 Thread Joachim Wieland

On Tue, Feb 8, 2011 at 8:31 PM, Itagaki Takahiro
itagaki.takah...@gmail.com wrote:
 On Tue, Feb 8, 2011 at 13:34, Robert Haas robertmh...@gmail.com wrote:
 So how close are we to having a committable version of this?  Should
 we push this out to 9.2?

 I think so. The feature is pretty attractive, but more works are required:
  * Re-base on synchronized snapshots patch
  * Consider to use pipe also on Windows.
  * Research libpq + fork() issue. We have a warning in docs:
 http://developer.postgresql.org/pgdocs/postgres/libpq-connect.html
 | On Unix, forking a process with open libpq connections can lead to
 unpredictable results

Just for the records, once the sync snapshot patch is committed, there
is no need to do fancy libpq + fork() combinations anyway.
Unfortunately, so far no committer has commented on the synchronized
snapshot patch at all.

I am not fighting for getting parallel pg_dump done in 9.1, as I don't
really have a personal use case for the patch. However it would be the
irony of the year if we shipped 9.1 with a synchronized snapshot patch
but no parallel dump  :-)


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-02-07 Thread Joachim Wieland

Hi Jaime,

thanks for your review!

On Sun, Feb 6, 2011 at 2:12 PM, Jaime Casanova ja...@2ndquadrant.com wrote:
 code review:

 something i found, and is a very simple one, is this warning (there's
 a similar issue in _StartMasterParallel with the buf variable)
 
 pg_backup_directory.c: In function ‘_EndMasterParallel’:
 pg_backup_directory.c:856: warning: ‘status’ may be used uninitialized
 in this function
 

Cool. My compiler didn't tell me about this.


 i guess the huge amount of info is showing the patch is just for
 debugging and will be removed before commit, right?

That's right.


 functional review:

 it works good most of the time, just a few points:
 - if i interrupt the process the connections stay, i guess it could
 catch the signal and finish the connections

Hm, well, recovering gracefully out of errors could be improved. In
your example you would signal the children implicitly because the
parent process dies and the pipes to the children would get broken as
well. Of course the parent could more actively terminate the children
but it might not be the best option to just kill them, as then there
will be a lot of unexpected EOF connections in the log. So if an
error condition comes up in the parent (as in your example, because
you canceled the process), then ideally the parent should signal the
children with a non-lethal signal and the children should catch this
please terminate signal and exit cleanly but as soon as possible. If
the error case comes up at the child however, then we'd need to make
sure that the user sees the error message from the child. This should
work well as-is but currently it could happen that the parent exists
before all of the children have exited. I'll investigate this a bit...


 - if i have an exclusive lock on a table and a worker starts dumping
 it, it fails because it can't take the lock but it just say it was
 ok and would prefer an error

I'm getting a clear

pg_dump: [Archivierer] could not lock table public.c: ERROR:  could
not obtain lock on relation c

but I'll look into this as well.

Regarding your other post:

 - there is no docs

True...

 - pg_dump and pg_restore are inconsistent:
  pg_dump requires the directory to be provided with the -f option:
 pg_dump -Fd -f dir_dump
  pg_restore pass the directory as an argument for -Fd: pg_restore -Fd dir_dump

Well, this is there with pg_dump and pg_restore currently as well. -F
is the switch for the format and it just takes d as the format. The
dir_dump is an option without any switch.

See the output for the --help switches:

Usage:
  pg_dump [OPTION]... [DBNAME]

Usage:
  pg_restore [OPTION]... [FILE]

So in either case you don't need to give a switch for what you have.
If you run pg_dump you don't give the switch for the database but you
need to give it for the output (-f) and with pg_restore you don't give
a switch for the file that you're restoring but you'd need to give -d
for restoring to a database.


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-02-04 Thread Joachim Wieland

On Thu, Feb 3, 2011 at 11:46 PM, Itagaki Takahiro
itagaki.takah...@gmail.com wrote:
 I think we have 2 important technical issues here:
  * The consistency is not perfect. Each transaction is started
   with small delays in step 1, but we cannot guarantee no other
   transaction between them.

This is exactly where the patch for synchronized snapshot comes into
the game. See https://commitfest.postgresql.org/action/patch_view?id=480


  * Can we inherit connections to child processes with fork() ?
   Moreover, we also need to pass running transactions to children.
   I wonder libpq is designed for such usage.

As far as I know you can inherit sockets to a child program, as long
as you make sure that after the fork only one, father or child, uses
the socket, the other one should close it. But this wouldn't be a
matter with the above mentioned patch anyway.


 It might be better to remove Windows-specific codes from the first try.
 I doubt Windows message queue is the best API in such console-based
 application. I hope we could use the same implementation for all
 platforms for inter-process/thread communication.

Windows doesn't support pipes, but offers the message queues to
exchange messages. Parallel pg_dump only exchanges messages in the
form of DUMP 39209 or RESTORE OK 48 23 93, it doesn't exchange any
large chunks of binary data, just these small textual messages. The
messages also stay within the same process, they are just sent between
the different threads. The windows part worked just fine when I tested
it last time. Do you have any other technology in mind that you think
is better suited?


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-02-01 Thread Joachim Wieland

On Sun, Jan 30, 2011 at 5:26 PM, Robert Haas robertmh...@gmail.com wrote:
 The parallel pg_dump portion of this patch (i.e. the still-uncommitted
 part) no longer applies.  Please rebase.

Here is a rebased version with some minor changes as well. I haven't
tested it on Windows now but will do so as soon as the Unix part has
been reviewed.


Joachim


parallel_pg_dump.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Snapshot synchronization, again...

2011-01-30 Thread Joachim Wieland

Here is a new version of the patch incorporating most of Noah's
suggestions. The patch now also adds documentation. Since I couldn't
really find a suitable section to document the two new functions, I
added a new one for now. Any better ideas where it should go?

On Thu, Jan 20, 2011 at 1:37 AM, Noah Misch n...@leadboat.com wrote:
 Just to clarify, I was envisioning something like:

 typedef struct { bool valid; char value[16]; } snapshotChksum;

 I don't object to the way you've done it, but someone else might not like the
 extra marshalling between text and binary.  Your call.

I didn't like it in the beginning but later on I adopted your
proposal. I have also changed the locking to be more natural. Even
though we don't really need it, I am now grabbing shared ProcArrayLock
for any reads of shared memory and exclusive lock for writes. Of
course no additional lock is taken if the feature is not used.


 You're right.  Then consider VALUES (pg_import_snapshot('...'), (SELECT
 count(*) from t)) at READ COMMITTED.  It works roughly as I'd guess; the
 subquery uses the imported snapshot.  If I flip the two expressions and do
 VALUES ((SELECT count(*) from t), pg_import_snapshot('...')), the subquery
 uses the normal snapshot.  That makes sense, but I can't really see a use case
 for it.  A warning, like your code has today, seems appropriate.

Yeah, that would do what you wanted to illustrate but it truely cannot
be considered the standard use case  :-)


  Is it valid to scribble directly on snapshots like this?
 I figured that previously executed code still has references pointing
 to the snapshots and so we cannot just push a new snapshot on top but
 really need to change the memory where they are pointing to.
 Okay.  Had no special reason to believe otherwise, just didn't know.

This is one part where I'd like someone more experienced than me look into it.


 Thanks; that was handy.  One thing I noticed is that the second SELECT * FROM
 kidseen yields zero rows instead of yielding ERROR:  relation kidseen does
 not exist.  I changed things around per the attached test-drop.pl, such 
 that
 the table is already gone in the ordinary snapshot, but still visible to the
 imported snapshot.  Note how the pg_class row is visible, but an actual 
 attempt
 to query the table fails.  Must some kind of syscache invalidation follow the
 snapshot alteration to make this behave normally?

As discussed with Noah off-list this is just not an MVCC safe
operation. You could hit on this in a regular SERIALIZABLE transaction
as well: Somebody else can delete a table and you won't be able to
access it anymore. This is also the reason why pg_dump in the
beginning gets a shared lock on every table that it will dump later
on.


Thanks for the review Noah,

Joachim
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index ed2039c..4169594 100644
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*** FOR EACH ROW EXECUTE PROCEDURE suppress_
*** 14761,14764 
--- 14761,14860 
  xref linkend=SQL-CREATETRIGGER.
  /para
/sect1
+ 
+   sect1 id=functions-snapshotsync
+titleSnapshot Synchronization Functions/title
+ 
+indexterm
+  primarypg_export_snapshot/primary
+/indexterm
+indexterm
+  primarypg_import_snapshot/primary
+/indexterm
+ 
+para
+  productnamePostgreSQL/ allows different sessions to synchronize their
+  snapshots. A database snapshot determines which data is visible to
+  the client that is using this snapshot. Synchronized snapshots are necessary if
+  two clients need to see the same content in the database. If these two clients
+  just connected to the database and opened their transactions, then they could
+  never be sure that there was no data modification right between both
+  connections. To solve this, productnamePostgreSQL/ offers the two functions
+  functionpg_export_snapshot/ and functionpg_import_snapshot/.
+/para
+para
+  The idea is that one client retrieves the information about the snapshot that it
+  is currently using and then the second client passes this information back to
+  the server. The database server will then modify the second client's snapshot
+  to match the snapshot of the first and from then on both can see the identical
+  data in the database.
+/para
+para
+  Note that a snapshot can only be imported as long as the transaction that
+  exported it originally is held open. Also note that even after the
+  synchronization both clients still run their own independent transactions.
+  As a consequence, even though synchronized with respect to reading pre-existing
+  data, both transactions won't be able to see each other's uncommitted data.
+/para
+table id=functions-snapshot-synchronization
+ titleSnapshot Synchronization Functions/title
+ tgroup cols=3
+  thead
+   rowentryName/entry entryReturn

Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-01-20 Thread Joachim Wieland

On Thu, Jan 20, 2011 at 6:07 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 It's part of the overall idea to make sure files are not inadvertently
 exchanged between different backups and that a file is not truncated.
 In the future I'd also like to add a checksum to the TOC so that a
 backup can be checked for integrity. This will cost performance but
 with the parallel backup it can be distributed to several processors.

 Ok. I'm going to leave out the filesize. I can see some value in that, and
 the CRC, but I don't want to add stuff that's not used at this point.

Okay.

 The header is there to identify a file, it contains the header that
 every other pgdump file contains, including the internal version
 number and the unique backup id.

 The tar format doesn't support compression so going from one to the
 other would only work for an uncompressed archive and special care
 must be taken to get the order of the tar file right.

 Hmm, tar format doesn't support compression, but looks like the file format
 issue has been thought of already: there's still code there to add .gz
 suffix for compressed files. How about adopting that convention in the
 directory format too? That would make an uncompressed directory format
 compatible with the tar format.

So what you could do is dump in the tar format, untar and restore in
the directory format. I see that this sounds nice but still I am not
sure why someone would dump to the tar format in the first place.

But you still cannot go back from the directory archive to the tar
archive because the standard command line tar will not respect the
order of the objects that pg_restore expects in a tar format, right?


 That seems pretty attractive anyway, because you can then dump to a
 directory, and manually gzip the data files later.

The command line gzip will probably add its own header to the file
that pg_restore would need to strip off...

This is a valid use case for people who are concerned with a fast
dump, usually they would dump uncompressed and later compress the
archive. However once we have parallel pg_dump, this advantage
vanishes.


 Now that we have an API for compression in compress_io.c, it probably
 wouldn't be very hard to implement the missing compression support to tar
 format either.

True, but the question to the advantage of the tar format remains :-)


 A tar archive has the advantage that you can postprocess the dump data
 with other tools  but for this we could also add an option that gives
 you only the data part of a dump file (and uncompresses it at the same
 time if compressed). Once we have that however, the question is what
 anybody would then still want to use the tar format for...

 I don't know how popular it'll be in practice, but it seems very nice to me
 if you can do things like parallel pg_dump in directory format first, and
 then tar it up to a file for archival.

Yes, but you cannot pg_restore the archive then if it was created with
standard tar, right?


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-01-19 Thread Joachim Wieland

On Wed, Jan 19, 2011 at 7:47 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Here are the latest patches all of them also rebased to current HEAD.
 Will update the commitfest app as well.

 What's the idea of storing the file sizes in the toc file? It looks like
 it's not used for anything.

It's part of the overall idea to make sure files are not inadvertently
exchanged between different backups and that a file is not truncated.
In the future I'd also like to add a checksum to the TOC so that a
backup can be checked for integrity. This will cost performance but
with the parallel backup it can be distributed to several processors.

 It would be nice to have this format match the tar format. At the moment,
 there's a couple of cosmetic differences:

 * TOC file is called TOC, instead of toc.dat

 * blobs TOC file is called BLOBS.TOC instead of blobs.toc

 * each blob is stored as blobs/oid.dat, instead of blob_oid.dat

That can be done easily...

 The only significant difference is that in the directory archive format,
 each data file has a header in the beginning.

 What are the benefits of the data file header? Would it be better to leave
 it out, so that the format would be identical to the tar format? You could
 then just tar up the directory to get a tar archive, or vice versa.

The header is there to identify a file, it contains the header that
every other pgdump file contains, including the internal version
number and the unique backup id.

The tar format doesn't support compression so going from one to the
other would only work for an uncompressed archive and special care
must be taken to get the order of the tar file right.

If you want to drop the header altogether, fine with me but if it's
just for the tar - directory conversion, then I am failing to see
what the use case of that would be.

A tar archive has the advantage that you can postprocess the dump data
with other tools  but for this we could also add an option that gives
you only the data part of a dump file (and uncompresses it at the same
time if compressed). Once we have that however, the question is what
anybody would then still want to use the tar format for...


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Snapshot synchronization, again...

2011-01-18 Thread Joachim Wieland

Noah, thank you for this excellent review. I have taken over most
(allmost all) of your suggestions (except for the documentation which
is still missing). Please check the updated  attached patch for
details.

On Fri, Jan 14, 2011 at 10:13 PM, Noah Misch n...@leadboat.com wrote:
  is a valid md5 message digest.  To avoid always 
 accepting
 a snapshot with that digest, we would need to separately store a flag 
 indicating
 whether a given syncSnapshotChksums member is currently valid.  Maybe that 
 hole
 is acceptable, though.

In the end I decided to store md5 checksums as printable characters in
shmem. We now need a few more bytes for each checksum but as soon as a
byte is NUL we know that it is not a valid checksum.

 +
 +     if (!XactReadOnly)
 +             elog(WARNING, A snapshot exporting function should be 
 readonly.);

 There are legitimate use cases for copying the snapshot of a read-write
 transaction.  Consider a task like calculating some summaries for insert into 
 a
 table.  To speed this up, you might notionally partition the source data and
 have each of several workers calculate each partition.  I'd vote for removing
 this warning entirely.

Warning removed, adding this fact to the documentation only is fine with me.


 InvalidBackendId is -1.  Is InvalidOid (0) in fact also invalid as a backend 
 ID?

Yes, there is a check in utils/init/postinit.c that makes sure that 0
is never a valid backendId. But anyway, I have now made this more
explicit by adding an own parse routine for ints.


 +     while ((xid = parseXactFromText(s, xip:)) != InvalidTransactionId)
 +     {
 +             if (xid == GetTopTransactionIdIfAny())
 +                     continue;
 +             snapshot-xip[i++] = xid;

 This works on a virgin GetSnapshotData() snapshot, which always has enough 
 space
 (sized according to max_connections).  A snapshot from CopySnapshot(), 
 however,
 has just enough space for the original usage.  I get a SIGSEGV at this line 
 with
 the attached test script.  If I change things around so xip starts non-NULL, I
 get messages like these as the code scribbles past the end of the allocation:

This has been fixed. xip/subxip are now allocated separately if necessary.


 Though unlikely, the other backend may not even exist by now.  Indeed, that
 could have happened and RecentGlobalXmin advanced since we compared the
 checksum.  Not sure what the right locking is here, but something is needed.

Good catch. What I have done now is a second check at the end of
ImportSnapshot(). At that point we have set up the new snapshot and
adjusted MyProc-xmin. If we succeed, then we are fine. If not we
throw an error and invalidate the whole transaction.


 +      * Instead we must check to not go forward (we might have opened a 
 cursor
 +      * in this transaction and still have its snapshot registered)
 +      */

 I'm thinking this case should yield an ERROR.  Otherwise, our net result would
 be to silently adopt a snapshot that is neither our old self nor the target.
 Maybe there's a use case for that, but none comes to my mind.

This can happen when you do:

DECLARE foo CURSOR FOR SELECT * FROM bar;
import snapshot...
FETCH 1 FROM foo;


 I guess the use case for pg_import_snapshot from READ COMMITTED would be
 something like DO $$BEGIN PERFORM pg_import_snapshot('...'); stuff; END$$;.
 First off, would that work (stuff use the imported snapshot)?  If it does
 work, we should take the precedent of SET LOCAL and issue no warning.  If it
 doesn't work, then we should document that the snapshot does take effect until
 the next command and probably keep this warning or something similar.

No, this will also give you a new snapshot for every query, no matter
if it is executed within or outside of a DO-Block.


 + Datum
 + pg_import_snapshot(PG_FUNCTION_ARGS)
 + {
 +     bytea  *snapshotData = PG_GETARG_BYTEA_P(0);
 +     bool    ret = true;
 +
 +     if (ActiveSnapshotSet())
 +             ret = ImportSnapshot(snapshotData, GetActiveSnapshot());
 +
 +     ret = ImportSnapshot(snapshotData, GetTransactionSnapshot());

 Is it valid to scribble directly on snapshots like this?

I figured that previously executed code still has references pointing
to the snapshots and so we cannot just push a new snapshot on top but
really need to change the memory where they are pointing to.


I am also adding a test script that shows the difference of READ
COMMITTED and SERIALIZABLE in an importing transaction in a DO-Block.
It's based on the script you sent.


So thanks again and I'm looking forward to your next review... :-)

Joachim
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2dbac56..c96942a 100644
*** a/src/backend/storage/ipc/ipci.c
--- b/src/backend/storage/ipc/ipci.c
*** CreateSharedMemoryAndSemaphores(bool mak
*** 124,129 
--- 124,130 
  		size = add_size(size, BTreeShmemSize());
  		size = add_size(size, SyncScanShmemSize());

Re: [HACKERS] Snapshot synchronization, again...

2011-01-07 Thread Joachim Wieland

On Thu, Dec 30, 2010 at 7:31 AM, Joachim Wieland j...@mcknight.de wrote:
 What I am proposing now is the following:

 We return snapshot information as a chunk of data to the client. At
 the same time however, we set a checksum in shared memory to protect
 against modification of the snapshot. A publishing backend can revoke
 its snapshot by deleting the checksum and a backend that is asked to
 install a snapshot can verify that the snapshot is correct and current
 by calculating the checksum and comparing it with the one in shared
 memory.

So here's the patch implementing this idea.

I named the user visible functions pg_export_snapshot() and
pg_import_snapshot().

A user starts a transaction and calls pg_export_snapshot() to get a
chunk of bytea data. In a different connection he opens a transaction in
isolation level serializable and passes that chunk of data into
pg_import_snapshot(). Now subsequent queries of the second connection see the
snapshot that was current when pg_export_snapshot() has been executed. In case
both transactions are in isolation level serializable then both see the same
data from this moment on (this is the case of pg_dump).

There are most probably a few loose ends and someone who is more familiar to
this area than me needs to look into it but at least everybody should be happy
now with the overall approach.

These are the implementation details and restrictions of the patch:

The exporting transaction

- should be read-only (to discourage people from expecting that writes of
  the exporting transaction can be seen by the importing transaction)
- must not be a subtransaction (we don't add subxips of our own transaction
  to the snapshot, so importing the snapshot later would result in missing
  subxips)
- adds its own xid (if any) to the xip-array

The importing transaction

- will not import a snapshot of the same backend (even though it would
  probably work)
- will not import a snapshot of a different database in the cluster
- should be isolation level serializable
- must not be a subtransaction (can we completely rollback on
subxact-rollback?)
- leaves curcid as is, otherwise effects of previous commands would get lost
  and magically appear later when curcid increases
- applies xmin, xmax, xip, subxip values of the exported snapshot to
  GetActiveSnapshot() and GetTransactionSnapshot()
- takes itself out of the xip array
- updates MyProc-xmin but sets it only backwards but not forwards

The snapshot is invalidated on transaction commit/rollback as well as when doing
prepare transaction.

Comments welcome.


Joachim
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 95beba8..c24150f 100644
*** a/src/backend/storage/ipc/ipci.c
--- b/src/backend/storage/ipc/ipci.c
*** CreateSharedMemoryAndSemaphores(bool mak
*** 124,129 
--- 124,130 
  		size = add_size(size, BTreeShmemSize());
  		size = add_size(size, SyncScanShmemSize());
  		size = add_size(size, AsyncShmemSize());
+ 		size = add_size(size, SyncSnapshotShmemSize());
  #ifdef EXEC_BACKEND
  		size = add_size(size, ShmemBackendArraySize());
  #endif
*** CreateSharedMemoryAndSemaphores(bool mak
*** 228,233 
--- 229,235 
  	BTreeShmemInit();
  	SyncScanShmemInit();
  	AsyncShmemInit();
+ 	SyncSnapshotInit();
  
  #ifdef EXEC_BACKEND
  
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 980996e..e851bcd 100644
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
***
*** 50,60 
--- 50,62 
  #include access/transam.h
  #include access/xact.h
  #include access/twophase.h
+ #include libpq/md5.h
  #include miscadmin.h
  #include storage/procarray.h
  #include storage/spin.h
  #include storage/standby.h
  #include utils/builtins.h
+ #include utils/bytea.h
  #include utils/snapmgr.h
  
  
*** static int KnownAssignedXidsGetAndSetXmi
*** 159,164 
--- 161,170 
  static TransactionId KnownAssignedXidsGetOldestXmin(void);
  static void KnownAssignedXidsDisplay(int trace_level);
  
+ typedef char snapshotChksum[16];
+ static snapshotChksum *syncSnapshotChksums;
+ static Snapshot exportedSnapshot;
+ 
  /*
   * Report shared-memory space needed by CreateSharedProcArray.
   */
*** KnownAssignedXidsDisplay(int trace_level
*** 3065,3067 
--- 3071,3335 
  
  	pfree(buf.data);
  }
+ 
+ 
+ /*
+  *  Report space needed for our shared memory area, which is basically an
+  *  md5 checksum per connection.
+  */
+ Size
+ SyncSnapshotShmemSize(void)
+ {
+ 	return PROCARRAY_MAXPROCS * sizeof(snapshotChksum);
+ }
+ 
+ void
+ SyncSnapshotInit(void)
+ {
+ 	Size	size;
+ 	bool	found;
+ 	int		i;
+ 
+ 	size = SyncSnapshotShmemSize();
+ 
+ 	syncSnapshotChksums = (snapshotChksum*) ShmemInitStruct(SyncSnapshotChksums,
+ 			size, found);
+ 	if (!found)
+ 		for (i = 0; i

Re: [HACKERS] Snapshot synchronization, again...

2010-12-31 Thread Joachim Wieland

On Fri, Dec 31, 2010 at 8:28 AM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 A backend can have any number of snapshots registered, and those don't
 allow GlobalXmin to advance.  Consider an open cursor, for example.
 Even if the rest of the transaction is read committed, the snapshot
 registered by the open cursor still holds back GlobalXmin.  My
 (handwavy) idea is that whenever the transaction calls
 pg_publish_snapshot(), said snapshot is registered, which makes it safe
 to use even if the transaction continues to operate and obtain newer
 snapshots.

Cool, even better that this is taken care of already :-)

Thanks,
Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Snapshot synchronization, again...

2010-12-30 Thread Joachim Wieland

The snapshot synchronization discussion from the parallel pg_dump
patch somehow ended without a clear way to go forward.

Let me sum up what has been brought up and propose a short- and
longterm solution.

Summary:

Passing snapshot sync information can be done either:

a) by returning complete snapshot information from the backend to the
client so that the client can pass it along to a different backend
b) or by returning only a unique identifier to the client and storing
the actual snapshot data somewhere on the server side

Advantage of a: no memory is used in the backend and no memory needs
to get cleaned up, it is also theoretically possible that we could
forward that data to a hot standby server and do e.g. a dump partially
on the master server and partially on the hot standby server or among
several hot standby servers.
Disadvantage of a: The snapshot must be validated to make sure that
its information is still current, it might be difficult to cover all
cases of this validation. A client can not only access exactly a
published snapshot, but just about any snapshot that fits and passes
the validation checks (this is more a disadvantage than an advantage
because it allows to see a database state that never existed in
reality).

Advantage of b: No validation necessary, as soon as the transaction
that publishes the snapshot loses that snapshot, it will also revoke
the snapshot information (either by removing a temp file or deleting
it from shared memory)
Disadvantage of b: It doesn't allow a snapshot to be installed on a
different server. It requires a serializable open transaction to hold
the snapshot.

What I am proposing now is the following:

We return snapshot information as a chunk of data to the client. At
the same time however, we set a checksum in shared memory to protect
against modification of the snapshot. A publishing backend can revoke
its snapshot by deleting the checksum and a backend that is asked to
install a snapshot can verify that the snapshot is correct and current
by calculating the checksum and comparing it with the one in shared
memory.

This only costs us a few bytes for the checksum * max_connection in
shared memory and apart from resetting the checksum it does not have
cleanup and verification issues. Note that we are also free to change
the internal format of the chunk of data we return whenever we like,
so we are free to enhance this feature in the future, transparently to
the client.


Thoughts?


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Snapshot synchronization, again...

2010-12-30 Thread Joachim Wieland

On Thu, Dec 30, 2010 at 9:40 AM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Disadvantage of b: It doesn't allow a snapshot to be installed on a
 different server. It requires a serializable open transaction to hold
 the snapshot.

 Why does it require a serializable transaction?  You could simply
 register the snapshot in any transaction.  (Of course, the net effect
 would be pretty similar to a serializable transaction).

I am not assuming that the publishing transaction blocks until its
snapshot is being picked up. A read committed transaction would get a
new snapshot for every other query, so the published snapshot is no
longer represented by an actual backend until it is being picked up by
one. Since nobody is holding off xmin/GlobalXmin, eventually vacuum
would remove tuples that the published-but-not-yet-picked-up snapshot
should still be able to see, no?

Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Snapshot synchronization, again...

2010-12-30 Thread Joachim Wieland

On Thu, Dec 30, 2010 at 9:49 AM, Florian Pflug f...@phlo.org wrote:
 On Dec30, 2010, at 13:31 , Joachim Wieland wrote:
 We return snapshot information as a chunk of data to the client. At
 the same time however, we set a checksum in shared memory to protect
 against modification of the snapshot. A publishing backend can revoke
 its snapshot by deleting the checksum and a backend that is asked to
 install a snapshot can verify that the snapshot is correct and current
 by calculating the checksum and comparing it with the one in shared
 memory.

 We'd still have to stream these checksums to the standbys though,
 or would they be exempt from the checksum checks?

I am not talking about having synchronized snapshots among standby
servers at all.

I am only proposing a client API that will work for this future idea as well.


 I still wonder whether these checks are worth the complexity. I
 believe we'd only allow snapshot modifications for read-only queries
 anyway, so what point is there in preventing clients from setting
 broken snapshots?

What's the use case for it? As soon as nobody comes up with a
reasonable use case for it, let's aim for the robust version.


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] page compression

2010-12-28 Thread Joachim Wieland

On Tue, Dec 28, 2010 at 10:10 AM, Andy Colson a...@squeakycode.net wrote:
 I know its been discussed before, and one big problem is license and patent
 problems.

 Would this project be a problem:

 http://oldhome.schmorp.de/marc/liblzf.html

It looks like even liblzf is not going to be accepted. I have proposed
to only link against liblzf if available for pg_dump and have somehow
failed, see:

http://archives.postgresql.org/pgsql-hackers/2010-11/msg00824.php

Remember that PostgreSQL has toast tables to compress large values and
store them externally, so it still has to be proven that page
compression has the same benefit for PostgreSQL as for other
databases.

Ironically we also use an LZ compression algorithm for toast
compression (defined in pg_lzcompress.c). I am still failing to
understand why linking against liblzf would bring us deeper into the
compression patents mine field than we already are by hardwiring and
shipping this other algorithm in pg_lzcompress.c.


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] directory archive format for pg_dump

2010-12-16 Thread Joachim Wieland

On Thu, Dec 16, 2010 at 12:48 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 As soon as we have parallel pg_dump, the next big thing is going to be
 parallel dump of the same table using multiple processes. Perhaps we should
 prepare for that in the directory archive format, by allowing the data of a
 single table to be split into multiple files. That way parallel pg_dump is
 simple, you just split the table in chunks of roughly the same size, say
 10GB each, and launch a process for each chunk, writing to a separate file.

How exactly would you just split the table in chunks of roughly the
same size ? Which queries should pg_dump send to the backend? If it
just sends a bunch of WHERE queries, the server would still scan the
same data several times since each pg_dump client would result in a
seqscan over the full table.

Ideally pg_dump should be able to query for all data in only one
relation segment so that each segment is scanned by only one backend
process. However this requires backend support and we would be sending
queries that we'd not want clients other than pg_dump to send...

If you were thinking about WHERE queries to get equally sized
partitions, how would we deal with unindexed and/or non-numerical data
in a large table?


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hot Standby: too many KnownAssignedXids

2010-12-15 Thread Joachim Wieland

On Tue, Dec 7, 2010 at 3:42 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Ok, I've committed this patch now.

I can confirm that I could continue replaying the logfiles on the
standby host with this patch.


Thanks a lot,
Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] directory archive format for pg_dump

2010-12-07 Thread Joachim Wieland

On Thu, Dec 2, 2010 at 2:52 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Ok, committed, with some small cleanup since the last patch I posted.

 Could you update the directory-format patch on top of the committed version,
 please?

Thanks for committing the first part. Here is the updated and rebased
directory-format patch.

Joachim


pg_dump-directory-rebased.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch for parallel pg_dump

2010-12-05 Thread Joachim Wieland

On Sun, Dec 5, 2010 at 9:27 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sun, Dec 5, 2010 at 9:04 PM, Andrew Dunstan and...@dunslane.net wrote:
 Why not just say give me the snapshot currently held by process ?

 And please, not temp files if possible.

 As far as I'm aware, the full snapshot doesn't normally exist in
 shared memory, hence the need for publication of some sort.  We could
 dedicate a shared memory region for publication but then you have to
 decide how many slots to allocate, and any number you pick will be too
 many for some people and not enough for others, not to mention that
 shared memory is a fairly precious resource.

So here is a patch that I have been playing with in the past, I have
done it a while back and thanks go to Koichi Suzuki for his helpful
comments. I have not published it earlier because I haven't worked on
it recently and from the discussion that I brought up in march I got
the feeling that people are fine with having a first version of
parallel dump without synchronized snapshots.

I am not really sure that what the patch does is sufficient nor if it
does it in the right way but I hope that it can serve as a basis to
collect ideas (and doubt).

My idea is pretty much similar to Robert's about publishing snapshots
and subscribing to them, the patch even uses these words.

Basically the idea is that a transaction in isolation level
serializable can publish a snapshot and as long as this transaction is
alive, its snapshot can be adopted by other transactions. Requiring
the publishing transaction to be serializable guarantees that the copy
of the snapshot in shared memory is always current. When the
transaction ends, the copy of the snapshot is also invalidated and
cannot be adopted anymore. So instead of doing explicit checks, the
patch aims at always having a reference transaction around that
guarantees validity of the snapshot information in shared memory.

The patch currently creates a new area in shared memory to store
snapshot information but we can certainly discuss this... I had a GUC
in mind that can control the number of available slots, similar to
max_prepared_transactions. Snapshot information can become quite
large, especially with a high number of max_connections.

Known limitations: the patch is lacking awareness of prepared
transactions completely and doesn't check if both backends belong to
the same user.


Joachim
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 95beba8..c24150f 100644
*** a/src/backend/storage/ipc/ipci.c
--- b/src/backend/storage/ipc/ipci.c
*** CreateSharedMemoryAndSemaphores(bool mak
*** 124,129 
--- 124,130 
  		size = add_size(size, BTreeShmemSize());
  		size = add_size(size, SyncScanShmemSize());
  		size = add_size(size, AsyncShmemSize());
+ 		size = add_size(size, SyncSnapshotShmemSize());
  #ifdef EXEC_BACKEND
  		size = add_size(size, ShmemBackendArraySize());
  #endif
*** CreateSharedMemoryAndSemaphores(bool mak
*** 228,233 
--- 229,235 
  	BTreeShmemInit();
  	SyncScanShmemInit();
  	AsyncShmemInit();
+ 	SyncSnapshotInit();
  
  #ifdef EXEC_BACKEND
  
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 6e7a6db..00522fb 100644
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
*** typedef struct ProcArrayStruct
*** 91,96 
--- 91,111 
  
  static ProcArrayStruct *procArray;
  
+ 
+ /* this should be a GUC later... */
+ #define MAX_SYNC_SNAPSHOT_SETS	4
+ typedef struct
+ {
+ 	SnapshotData	ssd;
+ 	char			name[NAMEDATALEN];
+ 	BackendId		backendId;
+ 	OiddatabaseId;
+ } NamedSnapshotData;
+ 
+ typedef NamedSnapshotData* NamedSnapshot;
+ 
+ static NamedSnapshot syncSnapshots;
+ 
  /*
   * Bookkeeping for tracking emulated transactions in recovery
   */
*** static int KnownAssignedXidsGetAndSetXmi
*** 159,164 
--- 174,182 
  static TransactionId KnownAssignedXidsGetOldestXmin(void);
  static void KnownAssignedXidsDisplay(int trace_level);
  
+ static bool DeleteSyncSnapshot(const char *name);
+ static bool snapshotPublished = false;  /* true if we have published at least one snapshot */
+ 
  /*
   * Report shared-memory space needed by CreateSharedProcArray.
   */
*** ProcArrayRemove(PGPROC *proc, Transactio
*** 350,355 
--- 368,379 
  void
  ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
  {
+ 	if (snapshotPublished)
+ 	{
+ 		DeleteSyncSnapshot(NULL);
+ 		snapshotPublished = false;
+ 	}
+ 
  	if (TransactionIdIsValid(latestXid))
  	{
  		/*
*** KnownAssignedXidsDisplay(int trace_level
*** 3104,3106 
--- 3132,3374 
  
  	pfree(buf.data);
  }
+ 
+ 
+ /*
+  *  Report space needed for our shared memory area.
+  *
+  *  Memory is structured as follows:
+  *
+  *  NamedSnapshotData[0]
+  *  NamedSnapshotData[1]
+  *  NamedSnapshotData[2]
+  *  Xids for NamedSnapshotData[0]
+  *

Re: [HACKERS] WIP patch for parallel pg_dump

2010-12-02 Thread Joachim Wieland

On Thu, Dec 2, 2010 at 6:19 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I don't see the point of the sort-by-relpages code. The order the objects
 are dumped should be irrelevant, as long as you obey the restrictions
 dictated by dependencies. Or is it only needed for the multiple-target-dirs
 feature? Frankly I don't see the point of that, so it would be good to cull
 it out at least in this first stage.

A guy called Dimitri Fontaine actually proposed the
serveral-directories feature here and other people liked the idea.

http://archives.postgresql.org/pgsql-hackers/2008-02/msg01061.php  :-)

The code doesn't change much with or without it, and if people are no
longer in favour of it, I have no problem with taking it out.

As Dimitri has already pointed out, the relpage sorting thing is there
to start with the largest table(s) first.


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch for parallel pg_dump

2010-12-02 Thread Joachim Wieland

On Thu, Dec 2, 2010 at 12:56 PM, Josh Berkus j...@agliodbs.com wrote:
 Now, if only I could think of some way to write a parallel dump to a set of
 pipes, I'd be in heaven.

What exactly are you trying to accomplish with the pipes?

Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch for parallel pg_dump

2010-12-02 Thread Joachim Wieland

On Thu, Dec 2, 2010 at 9:33 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 In particular, this issue *has* been discussed before, and there was a
 consensus that preserving dump consistency was a requirement.  I don't
 think that Joachim gets to bypass that decision just by submitting a
 patch that ignores it.

I am not trying to bypass anything here :)  Regarding the locking
issue I probably haven't done sufficient research, at least I managed
to miss the emails that mentioned it. Anyway, that seems to be solved
now fortunately, I'm going to implement your idea over the weekend.

Regarding snapshot cloning and dump consistency, I brought this up
already several months ago and asked if the feature is considered
useful even without snapshot cloning. And actually it was you who
motivated me to work on it even without having snapshot consistency...

http://archives.postgresql.org/pgsql-hackers/2010-03/msg01181.php

In my patch pg_dump emits a warning when called with -j, if you feel
better with an extra option
--i-know-that-i-have-no-synchronized-snapshots, fine with me :-)

In the end we provide a tool with limitations, it might not serve all
use cases but there are use cases that would benefit a lot. I
personally think this is better than to provide no tool at all...


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] directory archive format for pg_dump

2010-12-01 Thread Joachim Wieland

On Wed, Dec 1, 2010 at 9:05 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Forgot attachment. This is also available in the above git repo.

I have quickly checked your modifications, on the one hand I like the
reduction of functions, I would have said that we have AH around all
the time and so we could just allocate once and stuff it all into
ctx-cs and reuse the buffers for every object, but re-allocating them
for every (dumpable) object should be fine as well.

Regarding the function pointers that you removed, you are now putting
back in what Dimitri wanted me to take out, namely switch/case
instructions for the algorithms and then #ifdefs for every algorithm.
It's not too many now since we have taken out LZF. Well, I can live
with both ways.

There is one thing however that I am not in favor of, which is the
removal of the sizeHint parameter for the read functions. The reason
for this parameter is not very clear now without LZF but I have tried
to put in a few comments to explain the situation (which you have
taken out as well :-) ).

The point is that zlib is a stream based compression algorithm, you
just stuff data in and from time to time you get data out and in the
end you explicitly flush the compressor. The read function can just
return as many bytes as it wants and we can just hand it all over to
zlib. Other compression algorithms however are block based and first
write a block header that contains the information on the next data
block, including uncompressed and compressed sizes. Now with the
sizeHint parameter I used, the compressor could tell the read function
that it just wants to read the fixed size header (6 bytes IIRC). In
the header it would look up the compressed size for the next block and
would then ask the read function to get exactly this amount of data,
decompress it and go on with the next block, and so forth...

Of course you can possibly do that memory management inside the
compressor with an extra buffer holding what you got in excess but
it's a pain. If you removed that part on purpose on the grounds that
there is no block based compression algorithm in core and probably
never will be, then that's okay :-)


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Suggested easy TODO: pg_dump --from-list

2010-11-24 Thread Joachim Wieland

On Wed, Nov 24, 2010 at 1:15 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Nope ... those strings are just helpful comments, they aren't really
 guaranteed to be unique identifiers.  In any case, it seems unlikely
 that a user could expect to get the more complicated cases exactly right
 other than by consulting pg_dump | pg_restore -l output.  Which makes
 the use-case kind of dubious to me.

In which case would the catalogId, i.e. (tableoid, oid) not be unique?
Or do you rather mean that it does not necessarily refer to the same
object if that object got somehow recreated or that it could be
different on different installations of the same database?

Thanks,
Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Suggested easy TODO: pg_dump --from-list

2010-11-24 Thread Joachim Wieland

On Wed, Nov 24, 2010 at 9:38 AM, Andrew Dunstan and...@dunslane.net wrote:
 It would be unique, but a pain in the neck for users to get. Robert's idea
 will have more traction with users.

Whatever approach we use, we need to think about the use case where 1%
of the objects should be dumped but should also make sure that you can
more or less easily dump 99% of the objects. Roberts use case is the
1% use case. Especially for the 99% case however, pg_dump needs to
create a full list of all available objects in whatever format as a
proposal so that you could just save  edit this list and then delete
what you don't want instead of writing such a list from scratch.

Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Suggested easy TODO: pg_dump --from-list

2010-11-23 Thread Joachim Wieland

On Tue, Nov 23, 2010 at 10:24 PM, Andrew Dunstan and...@dunslane.net wrote:
 Well, very little about pg_dump is very [E], IMNSHO. The question in my mind
 here is what format the list file will take. For example, how would we
 specify a function? Would we need to specify all the argument types (or at
 least the IN arguments)? It's not as easy as a list with pg_restore, which
 is just a list of TOC ids, and all the rest is just a comment in the list
 file.

 I certainly don't think we should put this on the list without at least
 having the idea fleshed out some more.

I think the list should be generated by pg_dump itself in a first run,
by building a complete TOC and then dumping a pg_restore -l like
list format (without dumpIds) where the user just deletes the objects
that he doesn't want to get dumped. The list wouldn't contain dumpIds,
but catalogIds and those should be sufficiently unique and easy to
parse and compare.

Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hot Standby: too many KnownAssignedXids

2010-11-23 Thread Joachim Wieland

On Tue, Nov 23, 2010 at 8:45 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 19.11.2010 23:46, Joachim Wieland wrote:

 FATAL:  too many KnownAssignedXids. head: 0, tail: 0, nxids: 9978,
 pArray-maxKnownAssignedXids: 6890

 Hmm, that's a lot of entries in KnownAssignedXids.

 Can you recompile with WAL_DEBUG, and run the recovery again with
 wal_debug=on ? That will print all the replayed WAL records, which is a lot
 of data, but it might give a hint what's going on.

Sure, but this gives me only one more line:

[...]
LOG:  redo starts at 1F8/FC00E978
LOG:  REDO @ 1F8/FC00E978; LSN 1F8/FC00EE90: prev 1F8/FC00E930; xid
385669; len 21; bkpb1: Heap - insert: rel 1663/16384/18373; tid
3829898/23
FATAL:  too many KnownAssignedXids
CONTEXT:  xlog redo insert: rel 1663/16384/18373; tid 3829898/23
LOG:  startup process (PID 4587) exited with exit code 1
LOG:  terminating any other active server processes


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hot Standby: too many KnownAssignedXids

2010-11-22 Thread Joachim Wieland

On Sun, Nov 21, 2010 at 11:48 PM, Fujii Masao masao.fu...@gmail.com wrote:
 --
 If you suspect a bug in Hot Standby, please set
        trace_recovery_messages = DEBUG2
 in postgresql.conf and repeat the action

 Always useful to know
 * max_connections
 * current number of sessions
 * whether we have two phase commits happening
 --

The trace_recovery_messages parameter does not give more output...

max_connections is set to 100

there have been no sessions on the standby itself, but a few on the
primary database, I don't know how much but probably not more than 10.
The sessions there were doing quite some load however, among them
slony synchronization, since the hot standby master database was
actually a slony replica.

max_prepared_transactions has not been changed from its default value of 0.


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Hot Standby: too many KnownAssignedXids

2010-11-19 Thread Joachim Wieland

Hi,

I am seeing the following here on 9.0.1 on Linux x86-64:

LOG:  redo starts at 1F8/FC00E978
FATAL:  too many KnownAssignedXids
CONTEXT:  xlog redo insert: rel 1663/16384/18373; tid 3829898/23


and this is the complete history:

postgres was running as HS in foreground, Ctrl-C'ed it for a restart.

LOG:  received fast shutdown request
LOG:  aborting any active transactions
FATAL:  terminating walreceiver process due to administrator command
FATAL:  terminating connection due to administrator command
LOG:  shutting down
LOG:  database system is shut down


Started it up again:

$ postgres -D /db/
LOG:  database system was shut down in recovery at 2010-11-19 14:36:30 EST
LOG:  entering standby mode
cp: cannot stat `/archive/000101F90001': No such file or directory
cp: cannot stat `/archive/000101F800FC': No such file or directory
LOG:  redo starts at 1F8/FC00E978
FATAL:  too many KnownAssignedXids
CONTEXT:  xlog redo insert: rel 1663/16384/18373; tid 3829898/23
LOG:  startup process (PID 30052) exited with exit code 1
LOG:  terminating any other active server processes


(copied the log files over...)


./postgres -D /db/

LOG:  database system was interrupted while in recovery at log time
2010-11-19 14:36:12 EST
HINT:  If this has occurred more than once some data might be
corrupted and you might need to choose an earlier recovery target.
LOG:  entering standby mode
LOG:  restored log file 000101F90001 from archive
LOG:  restored log file 000101F800FC from archive
LOG:  redo starts at 1F8/FC00E978
FATAL:  too many KnownAssignedXids
CONTEXT:  xlog redo insert: rel 1663/16384/18373; tid 3829898/23
LOG:  startup process (PID 31581) exited with exit code 1
LOG:  terminating any other active server processes


Changing the line in the source code to give some more output gives me:

FATAL:  too many KnownAssignedXids. head: 0, tail: 0, nxids: 9978,
pArray-maxKnownAssignedXids: 6890


I still have the server, if you want me to debug anything or send a
patch against 9.0.1 that gives more output, just let me know.


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] directory archive format for pg_dump

2010-11-19 Thread Joachim Wieland

Hi Dimitri,

thanks for reviewing my patch!

On Fri, Nov 19, 2010 at 2:44 PM, Dimitri Fontaine
dimi...@2ndquadrant.fr wrote:
 I think I'd like to see a separate patch for the new compression
 support. Sorry about that, I realize that's extra work…

I guess it wouldn't be a very big deal but I also doubt that it makes
the review that much easier. Basically the compression refactor patch
would just touch pg_backup_custom.c (because this is the place where
the libz compression is currently burried into) and the two new
compress_io.(c|h) files. Everything else is pretty much the directory
stuff and is on top of these changes.


 And it could be about personal preferences, but the way you added the
 liblzf support strikes me at odd, with all those #ifdefs everywhere. Is
 it possible to have a specific file for each supported compression
 format, then some routing code in src/bin/pg_dump/compress_io.c?

Sure we could. But I wanted to wait with any fancy function pointer
stuff until we have decided if we want to include the liblzf support
at all. The #ifdefs might be a bit ugly but in case we do not include
liblzf support, it's the easiest way to take it out again. As written
in my introduction, this patch is not really about liblzf, liblzf is
just a proof of concept for factoring out the compression part and I
have included it, so that people can use it and see how much speed
improvement they get.


 The routing code already exists but then the file is full of #ifdef
 sections to define the right supporting function when I think having a
 compress_io_zlib and a compress_io_lzf files would be better.

Sure! I completely agree...


 Then there's the bulk of the new dump format feature in the other part
 of the patch, namely src/bin/pg_dump/pg_backup_directory.c. You have to
 update the copyright in the file header there, at least :)

Well, not sure if we can just change the copyright notice, because in
the end the structure was copied from one of the other files which all
have the copyright notice in them, so my work is based on those other
files...


 I'm hesitant as far as marking the patch Waiting on author to get it
 split. Joachim, what do you think?

I will see if I can split it.


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] directory archive format for pg_dump

2010-11-19 Thread Joachim Wieland

On Fri, Nov 19, 2010 at 11:53 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Dimitri Fontaine dimi...@2ndquadrant.fr writes:
  I think I'd like to see a separate patch for the new compression
  support. Sorry about that, I realize that's extra work…

 That part of the patch is likely to get rejected outright anyway,
 so I *strongly* recommend splitting it out.  We have generally resisted
 adding random compression algorithms to pg_dump because of license and
 patent considerations, and I see no reason to suppose this one is going
 to pass muster.


I was already anticipating that possiblitiy and my inital patch description
is along these lines.

However, liblzf is BSD licensed so on the license side we should be fine.
Regarding patents, your last comment was that you'd like to see if it's
really worth it and so I have included support for lzf for anybody to go
ahead and find that out.

Will send an updated split up patch this weekend (which would actually be
four patches already...).


Joachim

Re: [HACKERS] directory archive format for pg_dump

2010-11-19 Thread Joachim Wieland

Hi Jose,

2010/11/19 José Arthur Benetasso Villanova jose.art...@gmail.com:
 The dir format generated in my database 60 files, with different
 sizes, and it looks very confusing. Is it possible to use the same
 trick as pigz and pbzip2, creating a concatenated file of streams?

What pigz is parallelizing is the actual computation of the compressed
data. The directory archive format however is a preparation for a
parallel pg_dump, dumping several tables (especially large tables of
course) in parallel via multiple database connections and multiple
pg_dump frontends. The idea of multiplexing their output into one file
has been rejected on the grounds that it would probably slow down the
whole process.

Nevertheless pigz could be implemented as an alternative compression
algorithm and that way the custom and the directory archive format
could use it, but here as well, license and patent questions might be
in the way, even though it is based on libz.


 The md5.c and kwlookup.c reuse using a link doesn't look nice either.
 This way you need to compile twice, among others things, but I think
 that its temporary, right?

No, it isn't. md5.c is used in the same way by e.g. libpq and there
are other examples for links in core, check out src/bin/psql for
example.

Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] host name support in pg_hba.conf

2010-10-05 Thread Joachim Wieland

On Tue, Oct 5, 2010 at 3:41 PM, Peter Eisentraut pete...@gmx.net wrote:

 The reason why I think this is semi-important and not just cosmetic is
 that (for some reason that is not entirely clear to me) clients
 connecting to localhost end up appearing to the server as ::1 on a lot
 of machines I use which are not otherwise keen on IPv6, and it is a
 common mistake to forget to keep the pg_hba.conf entries for 127.0.0.1
 and ::1 in sync.


This is exactly what I am seeing here. However contrary to your case the
patch makes it even worse on my side. With the patch compiled in and a
pg_hba.conf entry of localhost, I cannot connect anymore to -h
localhost, I get no pg_hba.conf entry for host ::1.

This is mostly standard Ubuntu setup.


Joachim

Re: [HACKERS] Patch for PKST timezone

2010-05-16 Thread Joachim Wieland

On Wed, May 12, 2010 at 12:39 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Joachim Wieland j...@mcknight.de writes:
 Good we have found that inconsistency now, so let's add PKST.

 OK, done.  BTW, I notice that PKT was labeled (not in zic), which
 is not the case, per this discussion.  I seem to recall having noticed
 some others that seemed to be mislabeled the same way.  What process did
 you use to compare this list to the zic files, and do we need to revisit
 it?

I have used a modified version of zic.c that outputs the data while
generating the binary timezone files. Generating the timezone offset
files from that then included some scripts and some manual work. It
seems that we should have an automated process for that, at least for
checking against our current set. I'll see if I can come up with that.
PKST for example was valid only for a single year in the past but in
the newer timezone data it is now valid forever. Ideally we can run a
script that tells us about such changes whenever we bring in new
timezone data.


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch for PKST timezone

2010-05-11 Thread Joachim Wieland

On Tue, May 11, 2010 at 10:45 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I don't think we want to include all timezone names in the default
 config, timezone abbreviations aren't always unique for example. But we
 should include PKST because we already include PKT; it would be nasty
 for an application to work during winter, and stop working when the
 daylight saving time begins.

I agree with everything Heikki says. As the original author of the
timezone files I guess that I am to blame for not having included PKST
in the first place. However I have the excuse that it was like that
already back when the timezone information was still hardcoded, we had
pkt but not pkst:

http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/adt/datetime.c?rev=1.168;content-type=text%2Fx-cvsweb-markup

Good we have found that inconsistency now, so let's add PKST.

I also agree with not adding more than necessary to the default config
set. Given that so few people complain about it we seem to have a good
working default set already.


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] a faster compression algorithm for pg_dump

2010-04-10 Thread Joachim Wieland

On Fri, Apr 9, 2010 at 5:51 AM, Greg Stark gsst...@mit.edu wrote:
 Linking against as an option isn't nearly as bad since the user
 compiling it can choose whether to include the restricted feature or
 not. That's what we do with readline. However it's not nearly as
 attractive when it restricts what file formats Postgres supports -- it
 means someone might generate backup dump files that they later
 discover they don't have a legal right to read and restore :(

If we only linked against it, we'd leave it up to the user to weigh
the risk as long as we are not aware of any such violation.

Our top priority is to make sure that the project would not be harmed
if one day such a patent showed up. If I understood you correctly,
this is not an issue, even if we included lzf and less again if we
only link against it. The rest is about user education and using lzf
only in pg_dump and not for toasting, we could show a message in
pg_dump if lzf is chosen to make the user aware of the possible
issues.

If we still cannot do this, then what I am asking is: What does the
project need to be able to at least link against such a compression
algorithm? Is it a list of 10, 20, 50 or more other projects using it
or is it a lawyer saying: There is no patent.? But then, how can we
be sure that the lawyer is right? Or couldn't we include it even if we
had both, because again, we couldn't be sure... ?


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] a faster compression algorithm for pg_dump

2010-04-08 Thread Joachim Wieland

I'd like to revive the discussion about offering another compression
algorithm than zlib to at least pg_dump. There has been a previous
discussion here:

http://archives.postgresql.org/pgsql-performance/2009-08/msg00053.php

and it ended without any real result. The results so far were:

- There exist BSD-licensed compression algorithms
- Nobody knows a patent that is in our way
- Nobody can confirm that no patent is in our way

I do see a very real demand for replacing zlib which compresses quite
well but is slow as hell. For pg_dump what people want is cheap
compression, they usually prefer an algorithm that compresses less
optimal but that is really fast.

One question that I do not yet see answered is, do we risk violating a
patent even if we just link against a compression library, for example
liblzf, without shipping the actual code?

I have checked what other projects do, especially about liblzf which
would be my favorite choice (BSD license, available since quite some
time...) and there are other projects that actually ship the lzf code
(I haven't found a project that just links to it). The most prominent
projects are

- KOffice (implements a derived version in
koffice-2.1.2/libs/store/KoXmlReader.cpp)
- Virtual Box (ships it in vbox-ose-1.3.8/src/libs/liblzf-1.51)
- TuxOnIce (formerly known as suspend2 - linux kernel patch, ships it
in the patch)

We have pg_lzcompress.c which implements the compression routines for
the tuple toaster. Are we sure that we don't violate any patents with
this algorithm?


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] message clarifications

2010-04-03 Thread Joachim Wieland

On Sat, Apr 3, 2010 at 9:02 PM, Robert Haas robertmh...@gmail.com wrote:
 On Apr 3, 2010, at 11:13 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Peter Eisentraut pete...@gmx.net writes:
 The following messages from the postgres catalog either appear to be
 internal errors that should be marked differently, or they are in my
 estimation unintelligible to users and should be rephrased.

 #: commands/async.c:1424
 msgid pg_notify queue is %.0f%% full

 This one is probably my responsibility (the others all look like
 Simon's
 code).  What do you not like about it, exactly?  Perhaps it should be
 NOTIFY queue is x% full?

 I think maybe the question is why the user should care, or what they
 are expected to do about it?

The user/administrator should make sure that all backends work through
the list of pending notifications. He does it by making sure that
there are no long-running or idle-in-transaction backends.

Actually there is more information given via errdetail and errhint:

ereport(WARNING,
(errmsg(pg_notify queue is %.0f%% full, fillDegree * 100),
 (minPid != InvalidPid ?
  errdetail(PID %d is among the slowest backends., minPid)
  : 0),
 (minPid != InvalidPid ?
  errhint(Cleanup can only proceed if this backend ends its 
current
transaction.)
  : 0)));

Peter, if you consider the additional information given here, do you
still see an issue?


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] five-key syscaches

2010-03-29 Thread Joachim Wieland

On Mon, Mar 29, 2010 at 12:32 AM, Robert Haas robertmh...@gmail.com wrote:
 Per previous discussion, PFA a patch to change the maximum number of
 keys for a syscache from 4 to 5.

 http://archives.postgresql.org/pgsql-hackers/2010-02/msg01105.php

 This is intended for application to 9.1, and is supporting
 infrastructure for knngist.

It looks like there should be a 5 rather than a 4 for nkeys of
SearchSysCacheList().

+#define SearchSysCacheList5(cacheId, key1, key2, key3, key4, key5) \
+   SearchSysCacheList(cacheId, 4, key1, key2, key3, key4, key5)


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Parallel pg_dump for 9.1

2010-03-29 Thread Joachim Wieland

People have been talking about a parallel version of pg_dump a few
times already. I have been working on some proof-of-concept code for
this feature every now and then and I am planning to contribute this
for 9.1.

There are two main issues with a parallel version of pg_dump:

The first one is that it requires a consistent snapshot among multiple
pg_dump clients and the second is that currently the output goes to a
single file and it is unclear what to do about multiple processes
writing into a single file.

- There are ideas on how to solve the issue with the consistent
snapshot but in the end you can always solve it by stopping your
application(s). I actually assume that whenever people are interested
in a very fast dump, it is because they are doing some maintenance
task (like migrating to a different server) that involves pg_dump. In
these cases, they would stop their system anyway.
Even if we had consistent snapshots in a future version, would we
forbid people to run parallel dumps against old server versions? What
I suggest is to just display a big warning if run against a server
without consistent snapshot support (which currently is every
version).

- Regarding the output of pg_dump I am proposing two solutions. The
first one is to introduce a new archive type directory where each
table and each blob is a file in a directory, similar to the
experimental files archive type. Also the idea has come up that you
should be able to specify multiple directories in order to make use of
several physical disk drives. Thinking this further, in order to
manage all the mess that you can create with this, every file of the
same backup needs to have a unique identifier and pg_restore should
have a check parameter that tells you if your backup directory is in a
sane and complete state (think about moving a file from one backup
directory to another one or trying to restore from two directories
which are from different backup sets...).

The second solution to the single-file-problem is to generate no
output at all, i.e. whatever you export from your source database you
import directly into your target database, which in the end turns out
to be a parallel form of pg_dump | psql.

In fact, technically this is rather a parallel pg_restore than a
pg_dump as you need to respect the dependencies between objects. The
good news is that with the parallel pg_restore of the custom archive
format we have everything in place already for this dependency
checking. The addition is a new archive type that dumps (just-in-time)
whatever the dependency-algorithm decides to restore next.

This is probably the fastest way that we can copy or upgrade a
database when pg_migrator cannot be used (for example when you migrate
to a different hardware architecture).

As said, I have some working code for the features described (unix
only), if anybody would like to give it a try already now, just let me
know, I'd be happy to get some early test reports and you could check
for the speedup to expect. But before I continue, I'd like to have a
discussion about what is what people actually want and what is the
best way to go forward here.

I am currently not planning to make parallel dumps work with the
custom format even though this would be possible if we changed the
format to a certain degree.

Comments?


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.0 release notes done

2010-03-22 Thread Joachim Wieland

On Sat, Mar 20, 2010 at 5:02 AM, Bruce Momjian br...@momjian.us wrote:
 Interestingly the 9.0 release notes contain 201 items, while the 8.4
 release notes contained 314 items.

Is the following pg_dump change covered by the release notes? I
couldn't find it. It was the last committed patch from the 2010-01
commitfest...

http://archives.postgresql.org/pgsql-committers/2010-02/msg00233.php

https://commitfest.postgresql.org/action/patch_view?id=247


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Hot Standby query cancellation and Streaming Replication integration

2010-02-28 Thread Joachim Wieland

On Sun, Feb 28, 2010 at 2:54 PM, Greg Stark gsst...@mit.edu wrote:
 Really? I think we get lots of suprised wows from the field from the
 idea that a long-running read-only query can cause your database to
 bloat. I think the only reason that's obvious to us is that we've been
 grappling with that problem for so long.

It seems to me that the scenario that you are looking at is one where
people run different queries with and without HS, i.e. that they will
run longer read-only queries than now once they have HS. I don't think
that is the case. If it isn't you cannot really speak of a master
bloat.

Instead, I assume that most people who will grab 9.0 and use HS+SR do
already have a database with a certain query profile. Now with HS+SR
they will try to put the most costly and longest read-only queries to
the standby but in the end will run the same number of queries with
the same overall complexity.

Now let's take a look at both scenarios from the administrators' point of view:

1) With the current implementation they will see better performance on
the master and more aggressive vacuum (!), since they have less
long-running queries now on the master and autovacuum can kick in and
clean up with less delay than before. On the other hand their queries
on the standby might fail and they will start thinking that this HS+SR
feature is not as convincing as they thought it was... Next step for
them is to take the documentation and study it for a few days to learn
all about vacuum, different delays, transaction ids and age parameters
and experiment a few weeks until no more queries fail - for a while...
But they can never be sure... In the end they might also modify the
parameters in the wrong direction or overshoot because of lack of time
to experiment and lose another important property without noticing
(like being as close as possible to the master).

2) On the other hand if we could ship 9.0 with the xmin-propagation
feature, people would still see a better performance and have a hot
standby system but this time without query cancellations. Again: the
read-only queries that will be processed by the HS in the future are
being processed by the master today anyway, so why should it get
worse? The first impression will be that it just works nicely out of
the box, is easy to set up and has no negative effect (query
cancellation) that has not already shown up before (vacuum lag).

I guess that most people will just run fine with this setup and never
get to know about the internals. Of course we should still offer an
expert mode where you can turn all kinds of knobs and where you can
avoid the vacuum dependency but it would be nice if this could be the
expert mode only. Tuning this is highly installation specific and you
need to have a deep understanding of how PostgreSQL and HS work
internally and what you actually want to achieve...


 Agreed. Though I think it'll be bad in that case even if we have a
 plan B. It'll mean no file-based log shipping replicas and no
 guarantee that what you run on the standby can't affect the master --
 which is a pretty nice guarantee. It'll also mean it'll be much more
 fragile against network interruptions.

Regarding the network interruptions... in reality if you have network
interruptions of several minutes between your primary and your
standby, you have worse problems anyway... If the standby does not
renew its xmin for n seconds, log a message and just go on...


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Hot Standby query cancellation and Streaming Replication integration

2010-02-28 Thread Joachim Wieland

On Sun, Feb 28, 2010 at 8:47 PM, Josh Berkus j...@agliodbs.com wrote:
 1) Automated retry of cancelled queries on the slave.  I have no idea
 how hard this would be to implement, but it makes the difference between
 writing lots of exception-handling code for slave connections
 (unacceptable) to just slow response times on the slave (acceptable).

We're not only canceling queries, we are effectively canceling
transactions. It seems quite impossible to repeat all queries from a
transaction that has started in the past. One query might be or
include the result of a previous query and as the data we see now has
changed since then, the client might now want to execute a different
query when it gets a different result out of a previous query...

And even if it was possible, how often would you retry? You still have
no guarantee that your query succeeds the second time. I'd claim that
if a query failed once, chances are even higher that it fails again
than that it succeeds the second time. Moreover if you continue to
repeat the query and if queries come in at a certain rate, you need to
process more and more queries on the slave which will not really help
other queries to finish in time nor will it be beneficial for the
throughput of the system as a whole...


I fully agree with what you say about user expectations: We need to
assume that many programs are not prepared for failures of simple
read-only queries because in the past they have always worked...


 Another thing to keep in mind in these discussions is the
 inexpensiveness of servers today.  This means that, if slaves have poor
 performance, that's OK; one can always spin up more slaves.  But if each
 slave imposes a large burden on the master, then that limits your
 scalability.

The burden of the xmin-publication feature is not the number of
slaves, it's just the longest running queries on whatever slave they
are. So your argument applies to both cases... To minimize the burden
on the master, get additional slaves so that you can run your most
expensive queries in a shorter time :-)


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Listen/Notify payload and interfaces

2010-02-17 Thread Joachim Wieland

This one is for the maintainers of the various postgresql interfaces:

With the new listen/notify implementation we can send a payload along
with the notification. This has been in the protocol already since
years and there is no change needed for libpq. However we need to
adapt the various interfaces to allow a payload to be sent and
received.

A notification is sent via SQL:

http://developer.postgresql.org/pgdocs/postgres/sql-notify.html


and received via the libpq call PQnotifies:

http://developer.postgresql.org/pgdocs/postgres/libpq-notify.html



Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Listen / Notify - what to do when the queue is full

2010-02-17 Thread Joachim Wieland

On Tue, Feb 16, 2010 at 11:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Joachim Wieland j...@mcknight.de writes:
 [ listen/notify patch ]

 I found a number of implementation problems having to do with wraparound
 behavior and error recovery.  I think they're all fixed, but any
 remaining bugs are probably my fault not yours.

First, thanks for the rework you have done and thanks for applying this.

While I can see a lot of improvements over my version, I think the
logic in asyncQueueProcessPageEntries() needs to be reordered:

+ static bool
+ asyncQueueProcessPageEntries(QueuePosition *current,
+QueuePosition stop,
+char *page_buffer)
[...]
+   do
+   {
[...]
+   /*
+* Advance *current over this message, possibly to the next 
page.
+* As noted in the comments for asyncQueueReadAllNotifications, 
we
+* must do this before possibly failing while processing the 
message.
+*/
+   reachedEndOfPage = asyncQueueAdvance(current, qe-length);
[...]
+   if (TransactionIdDidCommit(qe-xid))
[...]
+   else if (TransactionIdDidAbort(qe-xid))
[...]
+   else
+   {
+   /*
+* The transaction has neither committed nor 
aborted so far,
+* so we can't process its message yet.  Break 
out of the loop.
+*/
+   reachedStop = true;
+   break;

In the beginning you are advancing *current but later on you could
find out that the transaction is still running. As the position in the
queue has already advanced you would miss one notification here
because you'd restart directly behind this notification in the
queue...


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LISTEN/NOTIFY and notification timing guarantees

2010-02-16 Thread Joachim Wieland

On Tue, Feb 16, 2010 at 6:20 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Another possibility is to force a ProcessIncomingNotifies scan to occur
 before we reach ReadyForQuery if we sent any notifies in the
 just-finished transaction --- but that won't help if there are
 uncommitted messages in front of ours.

What about dealing with self-notifies in memory? i.e. copy them into a
subcontext of TopMemoryContext in precommit and commit as usual. Then
as a first step in ProcessIncomingNotifies() deliver whatever is in
memory and then delete the context. While reading the queue, ignore
all self-notifies there. If we abort for some reason, delete the
context in AtAbort_Notify(). Would that work?


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LISTEN/NOTIFY and notification timing guarantees

2010-02-16 Thread Joachim Wieland

On Tue, Feb 16, 2010 at 1:31 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Tom Lane  wrote:
 We could adopt the historical policy of sending self-notifies
 pre-commit, but that doesn't seem tremendously appetizing from the
 standpoint of transactional integrity.

 But one traditional aspect of transactional integrity is that a
 transaction always sees *its own* uncommitted work.

True but notifications aren't sent until the transaction commits
anyway. At the time when an application receives its self-notifies, it
has already committed the transaction so there is no uncommitted work
anymore.


 Wouldn't the
 historical policy of PostgreSQL self-notifies be consistent with
 that?

No. The policy is also to not see the committed work if for some
reason the transaction had to roll back during commit. In this case
we'd also expect getting no notification from this transaction at all
and this is what is violated here.


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LISTEN/NOTIFY and notification timing guarantees

2010-02-15 Thread Joachim Wieland

On Mon, Feb 15, 2010 at 3:31 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'm not sure how probable it is that applications might be coded in a
 way that relies on the properties lost according to point #2 or #3.

Your observations are all correct as far as I can tell.

One question regarding #2: Is a client application able to tell
whether or not it has received all notifications from one batch? i.e.
does PQnotifies() return NULL only when the backend has sent over the
complete batch of notifications or could it also return NULL while a
batch is still being transmitted but the client-side buffer just
happens to be empty?


 We could fix #2 by not releasing AsyncQueueLock between pages when
 queuing messages.  This has no obvious downsides as far as I can see;
 if anything it ought to save some cycles and contention.

Currently transactions with a small number of notifications can
deliver their notifications and then proceed with their commit while
transactions with many notifications need to stay there longer, so the
current behavior is fair in this respect. Changing the locking
strategy makes the small volume transactions wait for the bigger ones.
Also currently readers can already start reading while writers are
still writing (until they hit the first uncommitted transaction of
their database).


 I think preserving the
 property that self-notifies are delivered immediately upon commit might
 be more important than that.

Fine with me, sounds reasonable  :-)


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Listen / Notify - what to do when the queue is full

2010-02-15 Thread Joachim Wieland

On Sun, Feb 14, 2010 at 11:44 PM, Simon Riggs si...@2ndquadrant.com wrote:
Next set of questions

* Will this work during Hot Standby now? The barrier was that it wrote
to a table and so we could not allow that. ISTM this new version can and
should work with Hot Standby. Can you test that and if so, remove the
explicit barrier code and change tests and docs to enable it?

I have tested it already. The point where it currently fails is the
following line:

qe-xid = GetCurrentTransactionId();

We record the TransactionId (of the notifying transaction) in the
notification in order to later check if this transaction has committed
successfully or not. If you tell me how we can find this out in HS, we
might be done...

The reason why we are doing all this is because we fear that we can
not write the notifications to disk once we have committed to clog...
So we write them to disk before committing to clog and therefore need
to record the TransactionId.

* We also discussed the idea of having a NOTIFY command that would work
from Primary to Standby. All this would need is some code to WAL log the
NOTIFY if not in Hot Standby and for some recovery code to send the
NOTIFY to any listeners on the standby. I would suggest that would be an
option on NOTIFY to WAL log the notification:
e.g. NOTIFY me 'with_payload' FOR STANDBY ALSO;

What should happen if you wanted to replay a NOTIFY WAL record in the
standby but cannot write to the pg_notify/ directory?

* Don't really like pg_listening() as a name. Perhaps pg_listening_to()
or pg_listening_on() or pg_listening_for() or pg_listening_channels() or
pg_listen_channels()

pg_listen_channels() sounds best to me but I leave this decision to a
native speaker.

* I think it's confusing that pg_notify is both a data structure and a
function. Suggest changing one of those to avoid issues in
understanding. Use pg_notify might be confused by a DBA.

You are talking about the libpq datastructure PGnotify I suppose... I
don't see it overly confusing but I wouldn't object changing it. There
was a previous discussion about the name, see the last paragraph of
http://archives.postgresql.org/message-id/dc7b844e1002021510i4aaa879fy8bbdd003729d2...@mail.gmail.com

Joachim

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Listen / Notify - what to do when the queue is full

2010-02-15 Thread Joachim Wieland

On Mon, Feb 15, 2010 at 1:48 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, 2010-02-15 at 12:59 +0100, Joachim Wieland wrote:
 I have tested it already. The point where it currently fails is the
 following line:

       qe-xid = GetCurrentTransactionId();

 That's a shame. So it will never work in Hot Standby mode unless you can
 think of a different way.

We could probably fake this on the Hot Standby in the following way:

We introduce a commit record for every notifying transaction and write
it into the queue itself. So right before writing anything else, we
write an entry which informs readers that the following records are
not yet committed. Then we write the actual notifications and commit.
In post-commit we return back to the commit record and flip its
status. Reading backends would stop at the commit record and we'd
signal them so that they can continue. This actually plays nicely with
Tom's intent to not have interleaved notifications in the queue (makes
things a bit easier but would probably work either way)...

However we'd need to make sure that we clean up that commit record
even if something weird happens (similar to TransactionIdDidAbort()
returning true) in order to allow the readers to proceed.

Comments?


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parameter name standby_mode

2010-02-12 Thread Joachim Wieland

On Fri, Feb 12, 2010 at 8:59 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Agreed. I've changed it now so that if primary_conninfo is not set, it
 doesn't try to establish a streaming connection. If you want to get the
 connection information from environment variables, you can use
 primary_conninfo=''.

Why not just remove the default:

If no primary_conninfo variable is set explicitly in the configuration
file, check the environment variables. If the environment variable is
not set, don't try to establish a connection.

?

Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 197 matches

Mail list logo