Re: [HACKERS] bytea vs. pg_dump
On Monday 03 August 2009 22:11:08 Tom Lane wrote: I'm starting to look at this patch. I observe that it's setting the default output format to HEX. If changing the default behavior was agreed to, or even discussed, I do not remember where. Shouldn't the default stay the same? I did pose that question in my patch submission email. Unless there is overwhelming support in favor of changing, we probably shouldn't change it, at least not yet. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Peter Eisentraut pete...@gmx.net writes: On Monday 03 August 2009 22:11:08 Tom Lane wrote: I'm starting to look at this patch. I observe that it's setting the default output format to HEX. If changing the default behavior was agreed to, or even discussed, I do not remember where. Shouldn't the default stay the same? I did pose that question in my patch submission email. Unless there is overwhelming support in favor of changing, we probably shouldn't change it, at least not yet. While I've been poking at the pg_dump issues, it's occurred to me that changing the default would be a great forcing function for finding out any lurking problems. What I'm inclined to do now is to commit it *with* the change of default, and let it be that way at least for a few alpha-test releases. We can vote on whether to switch the default back before 8.5 final. If this seems reasonable, I can make a note of the point in the commit message, so that we won't forget when the time comes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
On Tue, Aug 4, 2009 at 10:28 AM, Tom Lanet...@sss.pgh.pa.us wrote: Peter Eisentraut pete...@gmx.net writes: On Monday 03 August 2009 22:11:08 Tom Lane wrote: I'm starting to look at this patch. I observe that it's setting the default output format to HEX. If changing the default behavior was agreed to, or even discussed, I do not remember where. Shouldn't the default stay the same? I did pose that question in my patch submission email. Unless there is overwhelming support in favor of changing, we probably shouldn't change it, at least not yet. While I've been poking at the pg_dump issues, it's occurred to me that changing the default would be a great forcing function for finding out any lurking problems. What I'm inclined to do now is to commit it *with* the change of default, and let it be that way at least for a few alpha-test releases. We can vote on whether to switch the default back before 8.5 final. If this seems reasonable, I can make a note of the point in the commit message, so that we won't forget when the time comes. Or, what we could do is start an open items for 8.5 list similar to the one we made for 8.4. That worked pretty well, I think. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
On Tue, Aug 4, 2009 at 16:31, Robert Haasrobertmh...@gmail.com wrote: On Tue, Aug 4, 2009 at 10:28 AM, Tom Lanet...@sss.pgh.pa.us wrote: Peter Eisentraut pete...@gmx.net writes: On Monday 03 August 2009 22:11:08 Tom Lane wrote: I'm starting to look at this patch. I observe that it's setting the default output format to HEX. If changing the default behavior was agreed to, or even discussed, I do not remember where. Shouldn't the default stay the same? I did pose that question in my patch submission email. Unless there is overwhelming support in favor of changing, we probably shouldn't change it, at least not yet. While I've been poking at the pg_dump issues, it's occurred to me that changing the default would be a great forcing function for finding out any lurking problems. What I'm inclined to do now is to commit it *with* the change of default, and let it be that way at least for a few alpha-test releases. We can vote on whether to switch the default back before 8.5 final. If this seems reasonable, I can make a note of the point in the commit message, so that we won't forget when the time comes. Or, what we could do is start an open items for 8.5 list similar to the one we made for 8.4. That worked pretty well, I think. +1 for that solution, it seems much better than having to go back through commit messages. We might as well start it early! -- Magnus Hagander Self: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Robert Haas robertmh...@gmail.com writes: On Tue, Aug 4, 2009 at 10:28 AM, Tom Lanet...@sss.pgh.pa.us wrote: If this seems reasonable, I can make a note of the point in the commit message, so that we won't forget when the time comes. Or, what we could do is start an open items for 8.5 list similar to the one we made for 8.4. That worked pretty well, I think. OK. Historically we haven't made such a list until beta starts, but there's no reason we couldn't start it early. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
--On Dienstag, August 04, 2009 10:28:48 -0400 Tom Lane t...@sss.pgh.pa.us wrote: While I've been poking at the pg_dump issues, it's occurred to me that changing the default would be a great forcing function for finding out any lurking problems. +1 -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Bernd Helmle maili...@oopsware.de writes: --On Samstag, Juli 11, 2009 13:40:44 +0300 Peter Eisentraut pete...@gmx.net wrote: OK, here is an updated patch. It has the setting as enum, completed documentation, and libpq support. I'll add it to the commit fest in the hope that someone else can look it over in detail. I've attached a slightly edited patch which fixes a compiler warning in encode.c, too. Committed with assorted corrections. I have not done anything about the issues mentioned in http://archives.postgresql.org/message-id/21837.1248215...@sss.pgh.pa.us mainly that pg_dump's treatment of large-object contents is not safe against changes of standard_conforming_strings. I think that ought to get dealt with before moving on. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
I wrote: While I've been poking at the pg_dump issues, it's occurred to me that changing the default would be a great forcing function for finding out any lurking problems. What I'm inclined to do now is to commit it *with* the change of default, and let it be that way at least for a few alpha-test releases. We can vote on whether to switch the default back before 8.5 final. For the archives, attached is a patch to switch the default and then make pg_dump force use of hex mode. This is just so we won't forget what needs changing if we decide to switch the default back ... regards, tom lane *** doc/src/sgml/config.sgml.orig Tue Aug 4 12:08:35 2009 --- doc/src/sgml/config.sgmlTue Aug 4 12:40:34 2009 *** *** 4068,4078 listitem para Sets the output format for values of type typebytea/type. ! Valid values are literalhex/literal (the default) and literalescape/literal (the traditional PostgreSQL format). See xref linkend=datatype-binary for more information. The typebytea/type type always accepts both formats on input, regardless of this setting. /para /listitem /varlistentry --- 4068,4079 listitem para Sets the output format for values of type typebytea/type. ! Valid values are literalhex/literal and literalescape/literal (the traditional PostgreSQL format). See xref linkend=datatype-binary for more information. The typebytea/type type always accepts both formats on input, regardless of this setting. + The default is literalescape/literal. /para /listitem /varlistentry *** doc/src/sgml/datatype.sgml.orig Tue Aug 4 12:08:35 2009 --- doc/src/sgml/datatype.sgml Tue Aug 4 12:41:03 2009 *** *** 1196,1202 quoteescape/quote format, and quotehex/quote format. Both of these are always accepted on input. The output format depends on the configuration parameter xref linkend=guc-bytea-output; ! the default is hex. (Note that the hex format was introduced in productnamePostgreSQL/productname 8.5; earlier versions and some tools don't understand it.) /para --- 1196,1202 quoteescape/quote format, and quotehex/quote format. Both of these are always accepted on input. The output format depends on the configuration parameter xref linkend=guc-bytea-output; ! the default is escape. (Note that the hex format was introduced in productnamePostgreSQL/productname 8.5; earlier versions and some tools don't understand it.) /para *** src/backend/utils/adt/varlena.c.origTue Aug 4 12:08:36 2009 --- src/backend/utils/adt/varlena.c Tue Aug 4 12:42:36 2009 *** *** 30,36 /* GUC variable */ ! int bytea_output = BYTEA_OUTPUT_HEX; typedef struct varlena unknown; --- 30,36 /* GUC variable */ ! int bytea_output = BYTEA_OUTPUT_ESCAPE; typedef struct varlena unknown; *** src/backend/utils/misc/guc.c.orig Tue Aug 4 12:08:36 2009 --- src/backend/utils/misc/guc.cTue Aug 4 12:42:04 2009 *** *** 2553,2559 NULL }, bytea_output, ! BYTEA_OUTPUT_HEX, bytea_output_options, NULL, NULL }, { --- 2553,2559 NULL }, bytea_output, ! BYTEA_OUTPUT_ESCAPE, bytea_output_options, NULL, NULL }, { *** src/backend/utils/misc/postgresql.conf.sample.orig Mon Aug 3 15:59:39 2009 --- src/backend/utils/misc/postgresql.conf.sample Tue Aug 4 12:42:15 2009 *** *** 424,430 #statement_timeout = 0# in milliseconds, 0 is disabled #vacuum_freeze_min_age = 5000 #vacuum_freeze_table_age = 15000 ! #bytea_output = 'hex' # hex, escape #xmlbinary = 'base64' #xmloption = 'content' --- 424,430 #statement_timeout = 0# in milliseconds, 0 is disabled #vacuum_freeze_min_age = 5000 #vacuum_freeze_table_age = 15000 ! #bytea_output = 'escape' # hex, escape #xmlbinary = 'base64' #xmloption = 'content' *** src/bin/pg_dump/pg_dump.c.orig Tue Aug 4 12:08:36 2009 --- src/bin/pg_dump/pg_dump.c Tue Aug 4 12:39:39 2009 *** *** 599,604 --- 599,610 do_sql_command(g_conn, SET extra_float_digits TO 2); /* +* If supported, select hex format for bytea, for speed reasons. +*/ + if (g_fout-remoteVersion = 80500) + do_sql_command(g_conn, SET bytea_output TO hex); + + /* * If synchronized scanning is supported, disable it, to prevent * unpredictable changes in row ordering
Re: [HACKERS] bytea vs. pg_dump
One other stylistic gripe: I don't much like inserting a GUC variable definition into builtins.h --- that file has traditionally only contained function extern declarations. The best alternative I can think of is to move the bytea-related stuff into a new include file include/utils/bytea.h. Has anyone got an objection or a better idea? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
On Tue, Aug 4, 2009 at 12:18 AM, Tom Lanet...@sss.pgh.pa.us wrote: One other stylistic gripe: I don't much like inserting a GUC variable definition into builtins.h --- that file has traditionally only contained function extern declarations. The best alternative I can think of is to move the bytea-related stuff into a new include file include/utils/bytea.h. Has anyone got an objection or a better idea? The other guc that controls default i/o formats for a data type is DateStyle. I can't say I expected to find that in miscadmin.h though. Perhaps move both of them into a utils/adt.h or something like that? -- greg http://mit.edu/~gsstark/resume.pdf -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Greg Stark gsst...@mit.edu writes: On Tue, Aug 4, 2009 at 12:18 AM, Tom Lanet...@sss.pgh.pa.us wrote: One other stylistic gripe: I don't much like inserting a GUC variable definition into builtins.h --- that file has traditionally only contained function extern declarations. The best alternative I can think of is to move the bytea-related stuff into a new include file include/utils/bytea.h. Has anyone got an objection or a better idea? The other guc that controls default i/o formats for a data type is DateStyle. I can't say I expected to find that in miscadmin.h though. Perhaps move both of them into a utils/adt.h or something like that? Hmm, actually now that you mention it there's a bunch of GUC variables in miscadmin.h. Surprise factor aside, I'm inclined to just shove bytea_output in there along with DateStyle/IntervalStyle/etc. I did try the new-include-file approach, and unsurprisingly found three or four files that had to be modified to include it, because they'd been expecting to find byteain and byteaout declared in builtins.h. I still think that way is a bit cleaner, but I'm not sure it's enough cleaner to risk breaking third-party code for. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Tom Lane wrote: Greg Stark gsst...@mit.edu writes: On Tue, Aug 4, 2009 at 12:18 AM, Tom Lanet...@sss.pgh.pa.us wrote: One other stylistic gripe: I don't much like inserting a GUC variable definition into builtins.h --- that file has traditionally only contained function extern declarations. �The best alternative I can think of is to move the bytea-related stuff into a new include file include/utils/bytea.h. �Has anyone got an objection or a better idea? The other guc that controls default i/o formats for a data type is DateStyle. I can't say I expected to find that in miscadmin.h though. Perhaps move both of them into a utils/adt.h or something like that? Hmm, actually now that you mention it there's a bunch of GUC variables in miscadmin.h. Surprise factor aside, I'm inclined to just shove bytea_output in there along with DateStyle/IntervalStyle/etc. I vote for a new bytea.h file that does not slurp in byteain/byteaout, to avoid breaking 3rd party code. miscadmin.h seems the worst solution, since it's already included in 210 other files. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support Index: src/pl/plpgsql/src/pl_exec.c === RCS file: /home/alvherre/Code/cvs/pgsql/src/pl/plpgsql/src/pl_exec.c,v retrieving revision 1.246 diff -c -p -r1.246 pl_exec.c *** src/pl/plpgsql/src/pl_exec.c 22 Jul 2009 02:31:38 - 1.246 --- src/pl/plpgsql/src/pl_exec.c 4 Aug 2009 01:00:06 - *** *** 23,28 --- 23,29 #include executor/spi_priv.h #include funcapi.h #include lib/stringinfo.h + #include miscadmin.h #include nodes/nodeFuncs.h #include parser/scansup.h #include storage/proc.h Index: src/pl/plpgsql/src/pl_handler.c === RCS file: /home/alvherre/Code/cvs/pgsql/src/pl/plpgsql/src/pl_handler.c,v retrieving revision 1.44 diff -c -p -r1.44 pl_handler.c *** src/pl/plpgsql/src/pl_handler.c 18 Feb 2009 11:33:04 - 1.44 --- src/pl/plpgsql/src/pl_handler.c 4 Aug 2009 00:59:54 - *** *** 18,23 --- 18,24 #include catalog/pg_proc.h #include catalog/pg_type.h #include funcapi.h + #include miscadmin.h #include utils/builtins.h #include utils/guc.h #include utils/lsyscache.h Index: src/pl/plpgsql/src/plpgsql.h === RCS file: /home/alvherre/Code/cvs/pgsql/src/pl/plpgsql/src/plpgsql.h,v retrieving revision 1.114 diff -c -p -r1.114 plpgsql.h *** src/pl/plpgsql/src/plpgsql.h 22 Jul 2009 02:31:38 - 1.114 --- src/pl/plpgsql/src/plpgsql.h 4 Aug 2009 00:59:06 - *** *** 20,26 #include access/xact.h #include fmgr.h - #include miscadmin.h #include commands/trigger.h #include executor/spi.h #include utils/tuplestore.h --- 20,25 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Alvaro Herrera alvhe...@commandprompt.com writes: I vote for a new bytea.h file that does not slurp in byteain/byteaout, to avoid breaking 3rd party code. miscadmin.h seems the worst solution, since it's already included in 210 other files. Well, unless you want to leave *all* the bytea functions in builtins.h there will still be some risk there. I'd actually sooner break calls of byteaout than other things, because in reality every caller of byteaout is going to need to be inspected to see if it's expecting the old-style output format. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Tom Lane wrote: Alvaro Herrera alvhe...@commandprompt.com writes: I vote for a new bytea.h file that does not slurp in byteain/byteaout, to avoid breaking 3rd party code. miscadmin.h seems the worst solution, since it's already included in 210 other files. Well, unless you want to leave *all* the bytea functions in builtins.h there will still be some risk there. I'd actually sooner break calls of byteaout than other things, because in reality every caller of byteaout is going to need to be inspected to see if it's expecting the old-style output format. Hmm, good point ... why avoid the breakage then? -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Alvaro Herrera alvhe...@commandprompt.com writes: Tom Lane wrote: Well, unless you want to leave *all* the bytea functions in builtins.h there will still be some risk there. I'd actually sooner break calls of byteaout than other things, because in reality every caller of byteaout is going to need to be inspected to see if it's expecting the old-style output format. Hmm, good point ... why avoid the breakage then? Maybe we shouldn't. Okay, back to plan A (separate bytea.h file). (BTW, so far as I can tell there isn't anything in the backend that will be broken in that way. pg_dump, however, is a different story... it knows way too much about pg_trigger.tgargs.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Bernd Helmle maili...@oopsware.de writes: --On Samstag, Juli 11, 2009 13:40:44 +0300 Peter Eisentraut pete...@gmx.net wrote: OK, here is an updated patch. It has the setting as enum, completed documentation, and libpq support. I'll add it to the commit fest in the hope that someone else can look it over in detail. I've attached a slightly edited patch which fixes a compiler warning in encode.c, too. I'm starting to look at this patch. I observe that it's setting the default output format to HEX. If changing the default behavior was agreed to, or even discussed, I do not remember where. Shouldn't the default stay the same? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
--On Montag, August 03, 2009 15:11:08 -0400 Tom Lane t...@sss.pgh.pa.us wrote: I'm starting to look at this patch. I observe that it's setting the default output format to HEX. If changing the default behavior was agreed to, or even discussed, I do not remember where. Shouldn't the default stay the same? I would prefer it to be the default at least for pg_dump, if we can get some significant performance improvement for both, dump and restore from it. However, here are some current performance numbers (taken from today, since yesterday i had some trouble to get on the machine): I did some restore testing based on the following flow: BEGIN; TRUNCATE ... ; COPY testtable FROM ... ; ROLLBACK; with bytea_output = 'escape' i get Time: 1478801,770 ms where bytea_output = 'hex' gives: Time: 1448871,566 ms So 'hex' is slightly faster on this machine, but not in the numbers i would have expected. The hex-based restore gives the following profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds secondscalls s/call s/call name 37.81157.22 157.2297847 0.00 0.00 pglz_compress 20.25241.4384.21 141398 0.00 0.00 CopyReadLine 14.44301.4860.05 3605691992 0.00 0.00 get_hex 8.29335.9634.48 141397 0.00 0.00 hex_decode 7.99369.2033.24133.24 398.14 DoCopy 3.95385.6316.43 esc_enc_len 0.71388.58 2.95 137268286 0.00 0.00 _bt_compare 0.54390.81 2.23 7209863 0.00 0.00 XLogInsert 0.48392.81 2.00 49329221 0.00 0.00 hash_search_with_hash_value 0.43394.59 1.78 91132579 0.00 0.00 LWLockAcquire 0.42396.34 1.75 92250421 0.00 0.00 LWLockRelease 0.42398.08 1.75 30477526 0.00 0.00 ReadBuffer_common 0.20398.93 0.85 28686690 0.00 0.00 PinBuffer 0.18399.67 0.74 21541372 0.00 0.00 _bt_binsrch 0.16400.34 0.67 39278753 0.00 0.00 AllocSetAlloc -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Bernd Helmle maili...@oopsware.de writes: --On Montag, August 03, 2009 15:11:08 -0400 Tom Lane t...@sss.pgh.pa.us wrote: I'm starting to look at this patch. I observe that it's setting the default output format to HEX. If changing the default behavior was agreed to, or even discussed, I do not remember where. Shouldn't the default stay the same? I would prefer it to be the default at least for pg_dump, Well, we could have pg_dump force the output format to hex regardless of what the default is. A disadvantage of doing that is there wouldn't be any convenient way to get pg_dump to *not* set the output format (unless we add a switch, which seems way overkill). Which would mean there would be no good way to get pg_dump to produce backwards-compatible output. But considering how many other backwards-incompatible changes we have put into pg_dump without blinking, I'm not sure this argument outweighs the probability of breaking a lot of applications. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
--On Freitag, Juli 24, 2009 20:50:16 +0200 Bernd Helmle maili...@oopsware.de wrote: I don't believe i can do very much this weekend... I have to delay that until sunday, but will get my hands on some performance and function tests again, since i have access on the customer machine then. -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Bernd Helmle maili...@oopsware.de writes: I've attached a slightly edited patch which fixes a compiler warning in encode.c, too. Bernd, are you done reviewing this or did you intend to do more? It's still marked as needs review on the commitfest page. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
--On Freitag, Juli 24, 2009 11:38:06 -0400 Tom Lane t...@sss.pgh.pa.us wrote: Bernd, are you done reviewing this or did you intend to do more? It's still marked as needs review on the commitfest page. I hoped to get more profiling data like Andrew suggested, but haven't enough time to do it :( The customer machine i can test on is not available all the time, too. I haven't looked very detailed into the source, if you plan to start a review of your own, feel free. I don't believe i can do very much this weekend... -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
--On Samstag, Juli 11, 2009 13:40:44 +0300 Peter Eisentraut pete...@gmx.net wrote: OK, here is an updated patch. It has the setting as enum, completed documentation, and libpq support. I'll add it to the commit fest in the hope that someone else can look it over in detail. I've started looking at this and did some profiling with large bytea data again. For those interested, here are the numbers: Dumping with bytea_output=hex (COPY to file): real20m38.699s user0m11.265s sys 1m0.560s Dumping with bytea_output=escape (COPY to file): real39m52.399s user0m22.085s sys 1m50.131s So the time needed dropped about 50%. The dump file dropped from around 48 GB to 28 GB with the new format. I have some profiler data for this, but the restore seems much more interesting: the time to restore for both formats is quite the same: Restore bytea_output=hex real32m11.028s user0m0.000s sys 0m0.008s Restore bytea_output=escape real31m35.378s user0m0.000s sys 0m0.000s The profile for restoring the hex format looks like this: % cumulative self self total time seconds secondscalls s/call s/call name 34.54156.79 156.7997836 0.00 0.00 pglz_compress 18.64241.3884.59 141374 0.00 0.00 CopyReadLine 12.83299.6258.24 3604740854 0.00 0.00 get_hex 8.44337.9538.33 14257432 0.00 0.00 XLogInsert 7.39371.4833.53 141373 0.00 0.00 hex_decode 7.23404.3132.83132.83 436.67 DoCopy 3.48420.1215.81 esc_enc_len 0.61422.89 2.77 134943749 0.00 0.00 _bt_compare 0.54425.36 2.47 33682172 0.00 0.00 ReadBuffer_common 0.54427.83 2.47 52166324 0.00 0.00 hash_search_with_hash_value 0.45429.89 2.06 104798203 0.00 0.00 LWLockAcquire 0.36431.53 1.64 105234314 0.00 0.00 LWLockRelease I've attached a slightly edited patch which fixes a compiler warning in encode.c, too. -- Thanks Bernddiff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 99d25d7..9a543bc 100644 *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** COPY postgres_log FROM '/full/path/to/lo *** 3735,3740 --- 3735,3757 titleStatement Behavior/title variablelist + varlistentry id=guc-bytea-output xreflabel=bytea_output + termvarnamebytea_output/varname (typeenum/type)/term + indexterm +primaryvarnamebytea_output/ configuration parameter/primary + /indexterm + listitem +para + Sets the output format for values of type typebytea/type. + Valid values are literalhex/literal (the default) + and literalescape/literal (the traditional PostgreSQL + format). The xref linkend=datatype-binary for more + information. Note that the typebytea/type type always + accepts both formats on input. +/para + /listitem + /varlistentry + varlistentry id=guc-search-path xreflabel=search_path termvarnamesearch_path/varname (typestring/type)/term indexterm diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index f7ee8e9..8576419 100644 *** a/doc/src/sgml/datatype.sgml --- b/doc/src/sgml/datatype.sgml *** *** 1,4 ! !-- $PostgreSQL$ -- chapter id=datatype title id=datatype-titleData Types/title --- 1,4 ! !-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.237 2009/04/27 16:27:35 momjian Exp $ -- chapter id=datatype title id=datatype-titleData Types/title *** SELECT b, char_length(b) FROM test2; *** 1191,1196 --- 1191,1256 /para para + The typebytea/type type supports two external formats for + input and output: the quoteescape/quote format that is + particular to PostgreSQL, and the quotehex/quote format. Both + of these are always accepted on input. The output format depends + on the configuration parameter xref linkend=guc-bytea-output; + the default is hex. (Note that the hex format was introduced in + PostgreSQL 8.5; so earlier version and some tools don't understand + it.) +/para + +para + The acronymSQL/acronym standard defines a different binary + string type, called typeBLOB/type or typeBINARY LARGE + OBJECT/type. The input format is different from + typebytea/type, but the provided functions and operators are + mostly the same. +/para + + sect2 +titleThe Hex Format/title + +para + The hex format encodes the binary data as 2 hexadecimal digits per + byte, highest significant nibble first. The entire string ist + preceded by the sequence literal\x/literal (to distinguish it + from the bytea format). In SQL literals, the backslash may need + to be escaped, but it is one logical
Re: [HACKERS] bytea vs. pg_dump
Bernd Helmle wrote: --On Samstag, Juli 11, 2009 13:40:44 +0300 Peter Eisentraut pete...@gmx.net wrote: OK, here is an updated patch. It has the setting as enum, completed documentation, and libpq support. I'll add it to the commit fest in the hope that someone else can look it over in detail. I've started looking at this and did some profiling with large bytea data again. For those interested, here are the numbers: Dumping with bytea_output=hex (COPY to file): real20m38.699s user0m11.265s sys 1m0.560s Dumping with bytea_output=escape (COPY to file): real39m52.399s user0m22.085s sys 1m50.131s So the time needed dropped about 50%. The dump file dropped from around 48 GB to 28 GB with the new format. You just tested COPY, not pg_dump, right? Some pg_dump numbers would be interesting, both for text and custom formats. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
--On Dienstag, Juli 21, 2009 16:49:45 -0400 Andrew Dunstan and...@dunslane.net wrote: You just tested COPY, not pg_dump, right? Some pg_dump numbers would be interesting, both for text and custom formats. Plain COPY, yes. I planned testing pg_dump for this round of my review but ran out of time unfortunately. The restore might be limited by xlog (didn't realize that the profile shows XLogInsert in the top four). I'll try to get some additional numbers soon, but this won't happen before thursday. -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Bernd Helmle wrote: --On Dienstag, Juli 21, 2009 16:49:45 -0400 Andrew Dunstan and...@dunslane.net wrote: You just tested COPY, not pg_dump, right? Some pg_dump numbers would be interesting, both for text and custom formats. Plain COPY, yes. I planned testing pg_dump for this round of my review but ran out of time unfortunately. The restore might be limited by xlog (didn't realize that the profile shows XLogInsert in the top four). I'll try to get some additional numbers soon, but this won't happen before thursday. If the table is created by the restore job, either use parallel pg_restore (-j nn) or use the --single-transaction flag - both will ensure that the WAL log is avoided. For plain COPY, get the same effect using: begin; truncat foo; copy foo ... ; commit; All this assumes that archive_mode is off. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Peter Eisentraut pete...@gmx.net writes: OK, here is an updated patch. It has the setting as enum, completed documentation, and libpq support. I'll add it to the commit fest in the hope that someone else can look it over in detail. I found that there is another issue that should be addressed, maybe not by this patch but by a follow-on. While looking at Itagaki-san's patch for making pg_dump --clean drop large objects, I noticed that pg_dump is still relying on the deprecated function PQescapeBytea to dump the contents of large objects when it is creating text output. This manages not to fail for common cases, but there is at least one case we overlooked: if you pg_dump with standard_conforming_strings turned on into a custom (or probably tar) archive, and then use pg_restore to generate a SQL script from that, the strings will be improperly escaped. It strikes me that the best solution for this is to emit hex-coded bytea instead of escaped bytea. While we could just hardcode that into pg_dump, it would probably be better if libpq provided a function along the lines of PQescapeByteaHex. In some far future, maybe PQescapeBytea could be rescued from the depths of deprecation by having it emit hex-coded output; but of course that would fail against pre-8.5 servers, so it's a long way off. In the nearer future, it would be possible and perhaps wise for PQescapeByteaConn to adopt hex coding when it sees the connection is to a server = 8.5. It didn't look to me like the patch addressed this either. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
On Wednesday 08 July 2009 01:07:08 Tom Lane wrote: Peter Eisentraut pete...@gmx.net writes: Here is a first cut at a new hex bytea input and output format. Example: ... SET bytea_output_hex = true; Should the configuration parameter be a boolean or an enum, opening possibilities for other formats? Enum. If we do this then it seems entirely fair that someone might want other settings someday. Also, it seems silly to pick a format partly on the grounds that it's expansible, and then not make the control GUC expansible. Perhaps SET bytea_output = [ hex | traditional ] OK, here is an updated patch. It has the setting as enum, completed documentation, and libpq support. I'll add it to the commit fest in the hope that someone else can look it over in detail. I'm attaching two versions of the patch. One it made with the -w option, which leads to less differences. diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 2572d78..fece041 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -3703,6 +3703,23 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; titleStatement Behavior/title variablelist + varlistentry id=guc-bytea-output xreflabel=bytea_output + termvarnamebytea_output/varname (typeenum/type)/term + indexterm + primaryvarnamebytea_output/ configuration parameter/primary + /indexterm + listitem + para + Sets the output format for values of type typebytea/type. + Valid values are literalhex/literal (the default) + and literalescape/literal (the traditional PostgreSQL + format). The xref linkend=datatype-binary for more + information. Note that the typebytea/type type always + accepts both formats on input. + /para + /listitem + /varlistentry + varlistentry id=guc-search-path xreflabel=search_path termvarnamesearch_path/varname (typestring/type)/term indexterm diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index c944d8f..bdead3e 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -1,4 +1,4 @@ -!-- $PostgreSQL$ -- +!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.237 2009/04/27 16:27:35 momjian Exp $ -- chapter id=datatype title id=datatype-titleData Types/title @@ -1189,6 +1189,66 @@ SELECT b, char_length(b) FROM test2; /para para +The typebytea/type type supports two external formats for +input and output: the quoteescape/quote format that is +particular to PostgreSQL, and the quotehex/quote format. Both +of these are always accepted on input. The output format depends +on the configuration parameter xref linkend=guc-bytea-output; +the default is hex. (Note that the hex format was introduced in +PostgreSQL 8.5; so earlier version and some tools don't understand +it.) + /para + + para +The acronymSQL/acronym standard defines a different binary +string type, called typeBLOB/type or typeBINARY LARGE +OBJECT/type. The input format is different from +typebytea/type, but the provided functions and operators are +mostly the same. + /para + + sect2 + titleThe Hex Format/title + + para +The hex format encodes the binary data as 2 hexadecimal digits per +byte, highest significant nibble first. The entire string ist +preceded by the sequence literal\x/literal (to distinguish it +from the bytea format). In SQL literals, the backslash may need +to be escaped, but it is one logical backslash as far as the +typebytea/type type is concerned. The hex format is compatible with a wide +range of external applications and protocols, and it tends to be +faster than the traditional bytea format, so its use is +somewhat preferrable. + /para + + para +Example: +programlisting +SELECT E'\\xDEADBEEF'; +/programlisting + /para + /sect2 + + sect2 + titleThe Escape Format/title + + para +The quoteescape/quote format is the traditional +PostgreSQL-specific format for the typebytea/type type. It +takes the approach of representing a binary string as a sequence +of ASCII characters and escaping those bytes that cannot be +represented as an ASCII character by a special escape sequence. +If, from the point of view of the application, representing bytes +as characters makes sense, then this representation can be +convenient, but in practice it is usually confusing becauses it +fuzzes up the distinction between binary strings and characters +strings, and the particular escape mechanism that was chosen is +also somewhat unwieldy. So this format should probably not be +used for most new applications. + /para + + para When entering typebytea/type values, octets of certain values emphasismust/emphasis be escaped (but all octet values emphasiscan/emphasis be escaped) when used as part @@ -1341,14 +1401,7
Re: [HACKERS] bytea vs. pg_dump
Hello, Bernd. You wrote: BH --On Dienstag, Juli 07, 2009 18:07:08 -0400 Tom Lane t...@sss.pgh.pa.us BH wrote: Enum. If we do this then it seems entirely fair that someone might want other settings someday. Also, it seems silly to pick a format partly on the grounds that it's expansible, and then not make the control GUC expansible. Perhaps SET bytea_output = [ hex | traditional ] BH I like the enum much better, too, but BH SET bytea_output = [ hex | escape ] BH looks better to me (encode/decode are using something like this already). BH -- BH Thanks BH Bernd Yeah, this looks nice for me too -- With best wishes, Pavel mailto:pa...@gf.microolap.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
--On Dienstag, Juli 07, 2009 18:07:08 -0400 Tom Lane t...@sss.pgh.pa.us wrote: Enum. If we do this then it seems entirely fair that someone might want other settings someday. Also, it seems silly to pick a format partly on the grounds that it's expansible, and then not make the control GUC expansible. Perhaps SET bytea_output = [ hex | traditional ] I like the enum much better, too, but SET bytea_output = [ hex | escape ] looks better to me (encode/decode are using something like this already). -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
2009/7/8 Bernd Helmle maili...@oopsware.de: --On Dienstag, Juli 07, 2009 18:07:08 -0400 Tom Lane t...@sss.pgh.pa.us wrote: Enum. If we do this then it seems entirely fair that someone might want other settings someday. Also, it seems silly to pick a format partly on the grounds that it's expansible, and then not make the control GUC expansible. Perhaps SET bytea_output = [ hex | traditional ] I like the enum much better, too, but SET bytea_output = [ hex | escape ] + 1 Pavel looks better to me (encode/decode are using something like this already). -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
On Wednesday 06 May 2009 18:47:57 Tom Lane wrote: So the ambiguous-input problem is solved if we define the new format(s) to be started by backslash and something that the old code would reject. I'd keep it short, like \x, but there's still room for multiple formats if anyone really wants to go to the trouble. Here is a first cut at a new hex bytea input and output format. Example: SET bytea_output_hex = true; SELECT E'\\xDeAdBeEf'::bytea; bytea \xdeadbeef (1 row) Bernd did some performance testing for me, and it looked pretty good. Questions: Should this be the default format? Should the configuration parameter be a boolean or an enum, opening possibilities for other formats? diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index c944d8f..a6ac9c8 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -1,4 +1,4 @@ -!-- $PostgreSQL$ -- +!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.237 2009/04/27 16:27:35 momjian Exp $ -- chapter id=datatype title id=datatype-titleData Types/title @@ -1189,6 +1189,63 @@ SELECT b, char_length(b) FROM test2; /para para +The typebytea/type type supports two external formats for +input and output: the traditional bytea format that is particular +to PostgreSQL, and the hex format. Both of these are always +accepted on input. The output format depends on the configuration +parameter bytea_output_format; the default is hex. (Note that the +hex format was introduced in PostgreSQL 8.5; so earlier version +and some tools don't understand it.) + /para + + para +The acronymSQL/acronym standard defines a different binary +string type, called typeBLOB/type or typeBINARY LARGE +OBJECT/type. The input format is different from +typebytea/type, but the provided functions and operators are +mostly the same. + /para + + sect2 + titleHex Format/title + + para +The hex format encodes the binary data as 2 hexadecimal digits per +byte, highest significant nibble first. The entire string ist +preceded by the sequence literal\x/literal (to distinguish it +from the bytea format). In SQL literals, the backslash may need +to be escaped, but it is one logical backslash as far as the +typebytea/type type is concerned. The hex format is compatible with a wide +range of external applications and protocols, and it tends to be +faster than the traditional bytea format, so its use is +somewhat preferrable. + /para + + para +Example: +programlisting +SELECT E'\\xDEADBEEF'; +/programlisting + /para + /sect2 + + sect2 + titleTraditional Bytea Format/title + + para +The traditional bytea format takes the approach of representing a +binary string as a sequence of ASCII characters and escaping those +bytes that cannot be represented as an ASCII character by a +special escape sequence. If, from the point of view of the +application, representing bytes as characters makes sense, then +this representation can be convenient, but in practice it is +usually confusing becauses it fuzzes up the distinction between +binary strings and characters strings, and the particular escape +mechanism that was chosen is also somewhat unwieldy. So this +format should probably not be used for most new applications. + /para + + para When entering typebytea/type values, octets of certain values emphasismust/emphasis be escaped (but all octet values emphasiscan/emphasis be escaped) when used as part @@ -1341,14 +1398,7 @@ SELECT b, char_length(b) FROM test2; have to escape line feeds and carriage returns if your interface automatically translates these. /para - - para -The acronymSQL/acronym standard defines a different binary -string type, called typeBLOB/type or typeBINARY LARGE -OBJECT/type. The input format is different from -typebytea/type, but the provided functions and operators are -mostly the same. - /para + /sect2 /sect1 diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c index eed799a..b8a3cef 100644 --- a/src/backend/utils/adt/encode.c +++ b/src/backend/utils/adt/encode.c @@ -122,8 +122,8 @@ static const int8 hexlookup[128] = { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, }; -static unsigned -hex_encode(const char *src, unsigned len, char *dst) +size_t +hex_encode(const char *src, size_t len, char *dst) { const char *end = src + len; @@ -152,8 +152,8 @@ get_hex(char c) return (char) res; } -static unsigned -hex_decode(const char *src, unsigned len, char *dst) +size_t +hex_decode(const char *src, size_t len, char *dst) { const char *s, *srcend; diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index 4cf3966..3c24686 100644 --- a/src/backend/utils/adt/varlena.c +++
Re: [HACKERS] bytea vs. pg_dump
Peter Eisentraut pete...@gmx.net writes: Here is a first cut at a new hex bytea input and output format. Example: ... SET bytea_output_hex = true; Should the configuration parameter be a boolean or an enum, opening possibilities for other formats? Enum. If we do this then it seems entirely fair that someone might want other settings someday. Also, it seems silly to pick a format partly on the grounds that it's expansible, and then not make the control GUC expansible. Perhaps SET bytea_output = [ hex | traditional ] regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
--On Samstag, Mai 30, 2009 00:47:16 +0300 Hannu Krosing ha...@2ndquadrant.com wrote: And we can also escape the need to uncompress TOAST'ed fields - just markup the compression as another \c at the beginning of data. Hmm i thought about that, but that seems only to make sense if there is an easy way to bypass compressing the data on restore. Also, it seems to me that compression/decompression isn't a real bottleneck, but that needs to be confirmed. -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
--On Freitag, Mai 29, 2009 11:06:28 +0300 Peter Eisentraut pete...@gmx.net wrote: Btw., I have started to write some code for that. Cool. Let me know if i can help out somewhere. -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
On Friday 29 May 2009 04:26:35 Bruce Momjian wrote: Added to TODO: |Improve bytea COPY format * http://archives.postgresql.org/pgsql-hackers/2009-05/msg00192.php Btw., I have started to write some code for that. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
On Wed, 2009-05-06 at 18:33 +0300, Peter Eisentraut wrote: On Tuesday 05 May 2009 17:38:33 Tom Lane wrote: Kevin Grittner kevin.gritt...@wicourts.gov writes: Bernd Helmle maili...@oopsware.de wrote: Another approach would be to just dump bytea columns in binary format only (not sure how doable that is, though). If that's not doable, perhaps a base64 option for bytea COPY? I'm thinking plain old pairs-of-hex-digits might be the best tradeoff if conversion speed is the criterion. The main problem in any case would be to decide how to control the format option. The output format can be controlled by a GUC parameter. And while we are at it, we can also make bytea understand the new output format on input, so we can offer an end-to-end alternative to the amazingly confusing current bytea format and also make byteain() equally faster at the same time. For distinguishing various input formats, we could use the backslash to escape the format specification without breaking backward compatibilty, e.g., '\hexd41d8cd98f00b204e9800998ecf8427e' With a bit of extra work we can wrap this up to be a more or less SQL- conforming blob type, which would also make a lot of people very happy. And we can also escape the need to uncompress TOAST'ed fields - just markup the compression as another \c at the beginning of data. -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
On Fri, 2009-05-29 at 11:06 +0300, Peter Eisentraut wrote: On Friday 29 May 2009 04:26:35 Bruce Momjian wrote: Added to TODO: |Improve bytea COPY format * http://archives.postgresql.org/pgsql-hackers/2009-05/msg00192.php Btw., I have started to write some code for that. why not copy bytea always in base64 encoded or similar format - this will both save at least 2x the space on average random bytea data _and_ is probably faster, as it can be more easily done by table lookups in bigger chunks an alternative is to just escape minimal amount of characters, probably just \0 , \n and \\ -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Added to TODO: |Improve bytea COPY format * http://archives.postgresql.org/pgsql-hackers/2009-05/msg00192.php --- Merlin Moncure wrote: On Sat, May 16, 2009 at 11:23 AM, Stefan Kaltenbrunner ste...@kaltenbrunner.cc wrote: Bernd Helmle wrote: --On Mittwoch, Mai 06, 2009 19:04:21 -0400 Tom Lane t...@sss.pgh.pa.us wrote: So I'm now persuaded that a better textual representation for bytea should indeed make things noticeably better here. ?It would be useful though to cross-check this thought by profiling a case that dumps a comparable volume of text data that contains no backslashes... This is a profiling result of the same data converted into a printable text format without any backslashes. The data amount is quite the same and as you already guessed, calls to appendBinaryStringInfo() and friends gives the expected numbers: time ? seconds ? seconds ? ?calls ? s/call ? s/call ?name 35.13 ? ? 24.67 ? ?24.67 ? 134488 ? ? 0.00 ? ? 0.00 ?byteaout 32.61 ? ? 47.57 ? ?22.90 ? 134488 ? ? 0.00 ? ? 0.00 ?CopyOneRowTo 28.92 ? ? 67.88 ? ?20.31 ? ?85967 ? ? 0.00 ? ? 0.00 ?pglz_decompress ?0.67 ? ? 68.35 ? ? 0.47 ?4955300 ? ? 0.00 ? ? 0.00 hash_search_with_hash_value ?0.28 ? ? 68.55 ? ? 0.20 11643046 ? ? 0.00 ? ? 0.00 ?LWLockRelease ?0.28 ? ? 68.75 ? ? 0.20 ?4828896 ? ? 0.00 ? ? 0.00 ?index_getnext ?0.24 ? ? 68.92 ? ? 0.17 ?1208577 ? ? 0.00 ? ? 0.00 ?StrategyGetBuffer ?0.23 ? ? 69.08 ? ? 0.16 11643046 ? ? 0.00 ? ? 0.00 ?LWLockAcquire ... ?0.00 ? ? 70.23 ? ? 0.00 ? 134498 ? ? 0.00 ? ? 0.00 ?enlargeStringInfo ?0.00 ? ? 70.23 ? ? 0.00 ? 134497 ? ? 0.00 ? ? 0.00 ?appendBinaryStringInfo ?0.00 ? ? 70.23 ? ? 0.00 ? 134490 ? ? 0.00 ? ? 0.00 ?AllocSetReset ?0.00 ? ? 70.23 ? ? 0.00 ? 134490 ? ? 0.00 ? ? 0.00 ?resetStringInfo ?0.00 ? ? 70.23 ? ? 0.00 ? 134488 ? ? 0.00 ? ? 0.00 ?CopySendChar ?0.00 ? ? 70.23 ? ? 0.00 ? 134488 ? ? 0.00 ? ? 0.00 ?CopySendEndOfRow while doing some pg_migrator testing I noticed that dumping a database seems to be much slower than IO-system is capable off. ie i get 100% CPU usage with no IO-wait at all with between 15-30MB/s read rate if i say do a pg_dumpall /dev/null. Part of the problem is the decompression. Can't do much about that except to not compress your data. I don't have any hard statistics on hand at the moment, but a while back we compared 'COPY' vs a hand written SPI routine that got the tuple data in binary and streamed it out field by field raw to a file. The speed difference was enormous..I don't recall the exact difference but copy was at least 2x slower. This seems to suggest there are many potential improvements to copy (my test was mainly bytea as well). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Bernd Helmle wrote: --On Mittwoch, Mai 06, 2009 19:04:21 -0400 Tom Lane t...@sss.pgh.pa.us wrote: So I'm now persuaded that a better textual representation for bytea should indeed make things noticeably better here. It would be useful though to cross-check this thought by profiling a case that dumps a comparable volume of text data that contains no backslashes... This is a profiling result of the same data converted into a printable text format without any backslashes. The data amount is quite the same and as you already guessed, calls to appendBinaryStringInfo() and friends gives the expected numbers: time seconds secondscalls s/call s/call name 35.13 24.6724.67 134488 0.00 0.00 byteaout 32.61 47.5722.90 134488 0.00 0.00 CopyOneRowTo 28.92 67.8820.3185967 0.00 0.00 pglz_decompress 0.67 68.35 0.47 4955300 0.00 0.00 hash_search_with_hash_value 0.28 68.55 0.20 11643046 0.00 0.00 LWLockRelease 0.28 68.75 0.20 4828896 0.00 0.00 index_getnext 0.24 68.92 0.17 1208577 0.00 0.00 StrategyGetBuffer 0.23 69.08 0.16 11643046 0.00 0.00 LWLockAcquire ... 0.00 70.23 0.00 134498 0.00 0.00 enlargeStringInfo 0.00 70.23 0.00 134497 0.00 0.00 appendBinaryStringInfo 0.00 70.23 0.00 134490 0.00 0.00 AllocSetReset 0.00 70.23 0.00 134490 0.00 0.00 resetStringInfo 0.00 70.23 0.00 134488 0.00 0.00 CopySendChar 0.00 70.23 0.00 134488 0.00 0.00 CopySendEndOfRow while doing some pg_migrator testing I noticed that dumping a database seems to be much slower than IO-system is capable off. ie i get 100% CPU usage with no IO-wait at all with between 15-30MB/s read rate if i say do a pg_dumpall /dev/null. The profile for that looks like: samples %image name symbol name 1333764 29.3986 postgres CopyOneRowTo 463205 10.2099 postgres enlargeStringInfo 2371175.2265 postgres AllocSetAlloc 2310175.0920 postgres appendBinaryStringInfo 2247924.9548 postgres heap_deform_tuple 1721543.7946 postgres AllocSetReset 1624343.5803 postgres DoCopyTo 1499483.3051 postgres internal_putbytes 1375483.0318 postgres OutputFunctionCall 1294802.8540 postgres heapgettup_pagemode 1010172.2266 postgres FunctionCall1 93584 2.0628 postgres pq_putmessage 86553 1.9078 postgres timesub 81400 1.7942 postgres CopySendChar 81230 1.7905 postgres int4out 78374 1.7275 postgres localsub 52003 1.1462 postgres MemoryContextAlloc 51265 1.1300 postgres CopySendEndOfRow 49849 1.0988 postgres SPI_push_conditional 48157 1.0615 postgres pg_server_to_client 47670 1.0507 postgres timestamptz_out 42762 0.9426 postgres timestamp2tm Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
On Sat, May 16, 2009 at 11:23 AM, Stefan Kaltenbrunner ste...@kaltenbrunner.cc wrote: Bernd Helmle wrote: --On Mittwoch, Mai 06, 2009 19:04:21 -0400 Tom Lane t...@sss.pgh.pa.us wrote: So I'm now persuaded that a better textual representation for bytea should indeed make things noticeably better here. It would be useful though to cross-check this thought by profiling a case that dumps a comparable volume of text data that contains no backslashes... This is a profiling result of the same data converted into a printable text format without any backslashes. The data amount is quite the same and as you already guessed, calls to appendBinaryStringInfo() and friends gives the expected numbers: time seconds seconds calls s/call s/call name 35.13 24.67 24.67 134488 0.00 0.00 byteaout 32.61 47.57 22.90 134488 0.00 0.00 CopyOneRowTo 28.92 67.88 20.31 85967 0.00 0.00 pglz_decompress 0.67 68.35 0.47 4955300 0.00 0.00 hash_search_with_hash_value 0.28 68.55 0.20 11643046 0.00 0.00 LWLockRelease 0.28 68.75 0.20 4828896 0.00 0.00 index_getnext 0.24 68.92 0.17 1208577 0.00 0.00 StrategyGetBuffer 0.23 69.08 0.16 11643046 0.00 0.00 LWLockAcquire ... 0.00 70.23 0.00 134498 0.00 0.00 enlargeStringInfo 0.00 70.23 0.00 134497 0.00 0.00 appendBinaryStringInfo 0.00 70.23 0.00 134490 0.00 0.00 AllocSetReset 0.00 70.23 0.00 134490 0.00 0.00 resetStringInfo 0.00 70.23 0.00 134488 0.00 0.00 CopySendChar 0.00 70.23 0.00 134488 0.00 0.00 CopySendEndOfRow while doing some pg_migrator testing I noticed that dumping a database seems to be much slower than IO-system is capable off. ie i get 100% CPU usage with no IO-wait at all with between 15-30MB/s read rate if i say do a pg_dumpall /dev/null. Part of the problem is the decompression. Can't do much about that except to not compress your data. I don't have any hard statistics on hand at the moment, but a while back we compared 'COPY' vs a hand written SPI routine that got the tuple data in binary and streamed it out field by field raw to a file. The speed difference was enormous..I don't recall the exact difference but copy was at least 2x slower. This seems to suggest there are many potential improvements to copy (my test was mainly bytea as well). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
--On Mittwoch, Mai 06, 2009 19:04:21 -0400 Tom Lane t...@sss.pgh.pa.us wrote: So I'm now persuaded that a better textual representation for bytea should indeed make things noticeably better here. It would be useful though to cross-check this thought by profiling a case that dumps a comparable volume of text data that contains no backslashes... This is a profiling result of the same data converted into a printable text format without any backslashes. The data amount is quite the same and as you already guessed, calls to appendBinaryStringInfo() and friends gives the expected numbers: time seconds secondscalls s/call s/call name 35.13 24.6724.67 134488 0.00 0.00 byteaout 32.61 47.5722.90 134488 0.00 0.00 CopyOneRowTo 28.92 67.8820.3185967 0.00 0.00 pglz_decompress 0.67 68.35 0.47 4955300 0.00 0.00 hash_search_with_hash_value 0.28 68.55 0.20 11643046 0.00 0.00 LWLockRelease 0.28 68.75 0.20 4828896 0.00 0.00 index_getnext 0.24 68.92 0.17 1208577 0.00 0.00 StrategyGetBuffer 0.23 69.08 0.16 11643046 0.00 0.00 LWLockAcquire ... 0.00 70.23 0.00 134498 0.00 0.00 enlargeStringInfo 0.00 70.23 0.00 134497 0.00 0.00 appendBinaryStringInfo 0.00 70.23 0.00 134490 0.00 0.00 AllocSetReset 0.00 70.23 0.00 134490 0.00 0.00 resetStringInfo 0.00 70.23 0.00 134488 0.00 0.00 CopySendChar 0.00 70.23 0.00 134488 0.00 0.00 CopySendEndOfRow -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
--On Mittwoch, Mai 06, 2009 19:04:21 -0400 Tom Lane t...@sss.pgh.pa.us wrote: So I'm now persuaded that a better textual representation for bytea should indeed make things noticeably better here. It would be useful though to cross-check this thought by profiling a case that dumps a comparable volume of text data that contains no backslashes... I'm going to try to create a profile with a converted text representation of the data. -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
--On Dienstag, Mai 05, 2009 16:57:50 -0400 Andrew Dunstan and...@dunslane.net wrote: Hex will already provide some space savings over our current encoding method for most byteas anyway. It's not like we'd be making things less efficient space-wise. And in compressed archives the space difference is likely to dissolve to not very much, I suspect. I'm dumb: I don't understand why a hex conversion would be significantly faster than what we have now? -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Bernd Helmle wrote: --On Dienstag, Mai 05, 2009 16:57:50 -0400 Andrew Dunstan and...@dunslane.net wrote: Hex will already provide some space savings over our current encoding method for most byteas anyway. It's not like we'd be making things less efficient space-wise. And in compressed archives the space difference is likely to dissolve to not very much, I suspect. I'm dumb: I don't understand why a hex conversion would be significantly faster than what we have now? Quite apart from anything else you would not need the current loop over the bytea input to calculate the result length - in hex it would just be the input length * 2. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
On Tue, May 5, 2009 at 4:14 PM, Tom Lane t...@sss.pgh.pa.us wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Tom Lane wrote: I'm thinking plain old pairs-of-hex-digits might be the best tradeoff if conversion speed is the criterion. That's a lot less space-efficient than base64, though. Well, base64 could give a 33% savings, but it's significantly harder to encode/decode. Also, since it has a much larger set of valid data characters, it would be *much* more likely to allow old-style formatting to be mistaken for new-style. Unless we can think of a more bulletproof format selection mechanism, that could be an overriding consideration. another nit with base64 is that properly encoded data requires newlines according to the standard. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Merlin Moncure wrote: On Tue, May 5, 2009 at 4:14 PM, Tom Lane t...@sss.pgh.pa.us wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Tom Lane wrote: I'm thinking plain old pairs-of-hex-digits might be the best tradeoff if conversion speed is the criterion. That's a lot less space-efficient than base64, though. Well, base64 could give a 33% savings, but it's significantly harder to encode/decode. Also, since it has a much larger set of valid data characters, it would be *much* more likely to allow old-style formatting to be mistaken for new-style. Unless we can think of a more bulletproof format selection mechanism, that could be an overriding consideration. another nit with base64 is that properly encoded data requires newlines according to the standard. er, no, not as I read rfc 3548 s 2.1. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Andrew Dunstan and...@dunslane.net writes: Bernd Helmle wrote: I'm dumb: I don't understand why a hex conversion would be significantly faster than what we have now? Quite apart from anything else you would not need the current loop over the bytea input to calculate the result length - in hex it would just be the input length * 2. Another point is that the current format results in a very large number of backslashes in the output data, which translates to extra time and space at the level of the COPY protocol itself (since that has to double all those backslashes). Of course, base64 would also have these two advantages. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Andrew Dunstan wrote: another nit with base64 is that properly encoded data requires newlines according to the standard. er, no, not as I read rfc 3548 s 2.1. cheers andrew Why does encode('my text', 'base64') include newlines in its output? I think MIME requires text to be broken into 76 char lines but why does encode do this? -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
On Wed, May 6, 2009 at 8:02 AM, Andrew Dunstan and...@dunslane.net wrote: Merlin Moncure wrote: On Tue, May 5, 2009 at 4:14 PM, Tom Lane t...@sss.pgh.pa.us wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Tom Lane wrote: I'm thinking plain old pairs-of-hex-digits might be the best tradeoff if conversion speed is the criterion. That's a lot less space-efficient than base64, though. Well, base64 could give a 33% savings, but it's significantly harder to encode/decode. Also, since it has a much larger set of valid data characters, it would be *much* more likely to allow old-style formatting to be mistaken for new-style. Unless we can think of a more bulletproof format selection mechanism, that could be an overriding consideration. another nit with base64 is that properly encoded data requires newlines according to the standard. er, no, not as I read rfc 3548 s 2.1. PostgreSQL (sort of) follows RFC 2045, not RFC 3548. I don't think it would be a good idea to introduce a second method of encoding base64. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
On Tuesday 05 May 2009 17:38:33 Tom Lane wrote: Kevin Grittner kevin.gritt...@wicourts.gov writes: Bernd Helmle maili...@oopsware.de wrote: Another approach would be to just dump bytea columns in binary format only (not sure how doable that is, though). If that's not doable, perhaps a base64 option for bytea COPY? I'm thinking plain old pairs-of-hex-digits might be the best tradeoff if conversion speed is the criterion. The main problem in any case would be to decide how to control the format option. The output format can be controlled by a GUC parameter. And while we are at it, we can also make bytea understand the new output format on input, so we can offer an end-to-end alternative to the amazingly confusing current bytea format and also make byteain() equally faster at the same time. For distinguishing various input formats, we could use the backslash to escape the format specification without breaking backward compatibilty, e.g., '\hexd41d8cd98f00b204e9800998ecf8427e' With a bit of extra work we can wrap this up to be a more or less SQL- conforming blob type, which would also make a lot of people very happy. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Peter Eisentraut pete...@gmx.net writes: For distinguishing various input formats, we could use the backslash to escape the format specification without breaking backward compatibilty, e.g., Oh, you're right! I had been thinking that byteain treats \x as just meaning x if x isn't an octal digit, but actually it throws an error for anything except octal digits and backslashes: regression=# select E'\\x'::bytea; ERROR: invalid input syntax for type bytea LINE 1: select E'\\x'::bytea; ^ and a quick check verifies it has always done that. So the ambiguous-input problem is solved if we define the new format(s) to be started by backslash and something that the old code would reject. I'd keep it short, like \x, but there's still room for multiple formats if anyone really wants to go to the trouble. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Bernd Helmle maili...@oopsware.de writes: --On Dienstag, Mai 05, 2009 10:00:37 -0400 Tom Lane t...@sss.pgh.pa.us wrote: Seems like the right response might be some micro-optimization effort on byteaout. Hmm looking into profiler statistics seems to second your suspicion: Normal COPY shows: % cumulative self self total time seconds secondscalls s/call s/call name 31.29 81.3881.38 134487 0.00 0.00 CopyOneRowTo 22.88140.8959.51 134487 0.00 0.00 byteaout 13.44175.8434.95 3052797224 0.00 0.00 appendBinaryStringInfo 12.10207.3231.48 3052990837 0.00 0.00 CopySendChar 8.45229.3121.99 3052797226 0.00 0.00 enlargeStringInfo 3.90239.4510.1455500 0.00 0.00 pglz_decompress I hadn't looked closely at these numbers before, but now that I do, what I think they are telling us is that the high proportion of backslashes in standard bytea output is a real killer for COPY performance. With no backslashes, CopySendChar wouldn't be in the picture at all here, and appendBinaryStringInfo/enlargeStringInfo would be called many fewer times (roughly 134487 not 3052797224) with proportionately more characters processed per call. The inner loop of CopyOneRowTo (I assume CopyAttributeOutText has been inlined into that function) is relatively cheap for ordinary characters and much less so for backslashes, so I bet that number would go down too. And as already noted, byteaout itself works pretty hard to produce the current representation. So I'm now persuaded that a better textual representation for bytea should indeed make things noticeably better here. It would be useful though to cross-check this thought by profiling a case that dumps a comparable volume of text data that contains no backslashes... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Bernd Helmle maili...@oopsware.de writes: From time to time we had complains about slow dump of large tables with bytea columns, people often complaining about a) size and b) duration of the dump. That latter occurred recently to me, a customer would like to dump large tables (approx. 12G in size) with pg_dump, but he was annoyed about the performance. Using COPY BINARY reduced the time (unsurprisingly) to a fraction (from 12 minutes to 3 minutes). Seems like the right response might be some micro-optimization effort on byteaout. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Bernd Helmle maili...@oopsware.de wrote: Another approach would be to just dump bytea columns in binary format only (not sure how doable that is, though). If that's not doable, perhaps a base64 option for bytea COPY? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Tom Lane t...@sss.pgh.pa.us writes: Bernd Helmle maili...@oopsware.de writes: That latter occurred recently to me, a customer would like to dump large tables (approx. 12G in size) with pg_dump, but he was annoyed about the performance. Using COPY BINARY reduced the time (unsurprisingly) to a fraction (from 12 minutes to 3 minutes). Seems like the right response might be some micro-optimization effort on byteaout. Still, apart from lack of interest from developpers and/or resources, is there some reason we don't have a pg_dump --binary option? DBA would have to make sure his exports are usable, but when the routine pg_dump backup is mainly there to be able to restore on the same machine in case of unwanted event (DELETE bug, malicious TRUNCATE, you name it), having a faster dump/restore even if local only would be of interest. Regards, -- dim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Dimitri Fontaine dfonta...@hi-media.com writes: Still, apart from lack of interest from developpers and/or resources, is there some reason we don't have a pg_dump --binary option? It seems rather antithetical to one of the main goals of pg_dump, which is to provide a dump that can reliably be loaded onto other machines or newer versions of Postgres. I don't think that we should provide such a foot-gun in hopes of getting relatively minor performance improvements; especially when we have not exhausted the alternatives. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Tom Lane t...@sss.pgh.pa.us writes: It seems rather antithetical to one of the main goals of pg_dump, which is to provide a dump that can reliably be loaded onto other machines or newer versions of Postgres. You're calling for a pg_export/pg_import tool suite, or I have to learn to read again :) I don't think that we should provide such a foot-gun in hopes of getting relatively minor performance improvements; especially when we have not exhausted the alternatives. If you think improvements will be minor while alternatives are promising, of course, I'm gonna take your word for it. Regards, -- dim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Kevin Grittner kevin.gritt...@wicourts.gov writes: Bernd Helmle maili...@oopsware.de wrote: Another approach would be to just dump bytea columns in binary format only (not sure how doable that is, though). If that's not doable, perhaps a base64 option for bytea COPY? I'm thinking plain old pairs-of-hex-digits might be the best tradeoff if conversion speed is the criterion. The main problem in any case would be to decide how to control the format option. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Tom Lane wrote: Kevin Grittner kevin.gritt...@wicourts.gov writes: Bernd Helmle maili...@oopsware.de wrote: Another approach would be to just dump bytea columns in binary format only (not sure how doable that is, though). If that's not doable, perhaps a base64 option for bytea COPY? I'm thinking plain old pairs-of-hex-digits might be the best tradeoff if conversion speed is the criterion. The main problem in any case would be to decide how to control the format option. It would be great if COPY FROM could read some fields as binary while the rest is text. That would allow us to do something like --bytea-column-format=binary --bytea-column-format=hexpair --bytea-column-format=text -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
--On Dienstag, Mai 05, 2009 10:00:37 -0400 Tom Lane t...@sss.pgh.pa.us wrote: Seems like the right response might be some micro-optimization effort on byteaout. Hmm looking into profiler statistics seems to second your suspicion: Normal COPY shows: % cumulative self self total time seconds secondscalls s/call s/call name 31.29 81.3881.38 134487 0.00 0.00 CopyOneRowTo 22.88140.8959.51 134487 0.00 0.00 byteaout 13.44175.8434.95 3052797224 0.00 0.00 appendBinaryStringInfo 12.10207.3231.48 3052990837 0.00 0.00 CopySendChar 8.45229.3121.99 3052797226 0.00 0.00 enlargeStringInfo 3.90239.4510.1455500 0.00 0.00 pglz_decompress 3.28247.97 8.523 2.84 2.84 appendStringInfoChar 1.82252.71 4.74 134489 0.00 0.00 resetStringInfo 1.72257.18 4.47 copy_dest_destroy 0.27257.89 0.71 5544679 0.00 0.00 hash_search_with_hash_value 0.09258.13 0.24 13205044 0.00 0.00 LWLockAcquire 0.08258.35 0.22 13205044 0.00 0.00 LWLockRelease COPY BINARY generates: time seconds secondscalls s/call s/call name 73.70 9.05 9.0555500 0.00 0.00 pglz_decompress 6.03 9.79 0.74 5544679 0.00 0.00 hash_search_with_hash_value 2.93 10.15 0.36 13205362 0.00 0.00 LWLockAcquire 1.87 10.38 0.23 13205362 0.00 0.00 LWLockRelease This is PostgreSQL 8.3.7 btw. -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Tom Lane wrote: Kevin Grittner kevin.gritt...@wicourts.gov writes: Bernd Helmle maili...@oopsware.de wrote: Another approach would be to just dump bytea columns in binary format only (not sure how doable that is, though). If that's not doable, perhaps a base64 option for bytea COPY? I'm thinking plain old pairs-of-hex-digits might be the best tradeoff if conversion speed is the criterion. The main problem in any case would be to decide how to control the format option. Yeah. Any ideas on how to do that? I can't think of anything very clean offhand. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Andrew Dunstan and...@dunslane.net writes: Tom Lane wrote: I'm thinking plain old pairs-of-hex-digits might be the best tradeoff if conversion speed is the criterion. The main problem in any case would be to decide how to control the format option. Yeah. Any ideas on how to do that? I can't think of anything very clean offhand. Well, there's nothing much wrong with a GUC setting to control output --- we have lots of precedent, such as DateStyle. The problem is with figuring out what ambiguous input is meant to be. There seems to be an uncomfortably high risk of misinterpreting the input. For sake of argument, suppose we define the hex format as 0x followed by pairs of hex digits. We could then modify byteaout so that if it were told to print in old-style a value that happened to start with 0x, it could output 0\x instead, which means the same but would be unambiguous. This would fix the problem going forward, but old-style dumps and un-updated clients would still be at risk. The risk might not be too high though, since the odds of successfully parsing old-style data as hex would be relatively low, particularly if we were draconian about case (ie the x MUST be lower case and the hex digits MUST be upper). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Sorry got top-posting -- stupid iphone mail client. We could eliminate the problem with old dumps by doing something like \x to indicate a new-style hex dump. That doesn't make us 100% safe against arbitrary user input but should be pretty low risk. -- Greg On 5 May 2009, at 18:51, Tom Lane t...@sss.pgh.pa.us wrote: Andrew Dunstan and...@dunslane.net writes: Tom Lane wrote: I'm thinking plain old pairs-of-hex-digits might be the best tradeoff if conversion speed is the criterion. The main problem in any case would be to decide how to control the format option. Yeah. Any ideas on how to do that? I can't think of anything very clean offhand. Well, there's nothing much wrong with a GUC setting to control output --- we have lots of precedent, such as DateStyle. The problem is with figuring out what ambiguous input is meant to be. There seems to be an uncomfortably high risk of misinterpreting the input. For sake of argument, suppose we define the hex format as 0x followed by pairs of hex digits. We could then modify byteaout so that if it were told to print in old-style a value that happened to start with 0x, it could output 0\x instead, which means the same but would be unambiguous. This would fix the problem going forward, but old-style dumps and un-updated clients would still be at risk. The risk might not be too high though, since the odds of successfully parsing old- style data as hex would be relatively low, particularly if we were draconian about case (ie the x MUST be lower case and the hex digits MUST be upper). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Tom Lane t...@sss.pgh.pa.us wrote: Unless we can think of a more bulletproof format selection mechanism Would it make any sense to have an option on the COPY command to tell it to use base64 for bytea columns? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Tom Lane wrote: I'm thinking plain old pairs-of-hex-digits might be the best tradeoff if conversion speed is the criterion. That's a lot less space-efficient than base64, though. Well, base64 could give a 33% savings, but it's significantly harder to encode/decode. Also, since it has a much larger set of valid data characters, it would be *much* more likely to allow old-style formatting to be mistaken for new-style. Unless we can think of a more bulletproof format selection mechanism, that could be an overriding consideration. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Tom Lane wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Tom Lane wrote: I'm thinking plain old pairs-of-hex-digits might be the best tradeoff if conversion speed is the criterion. That's a lot less space-efficient than base64, though. Well, base64 could give a 33% savings, but it's significantly harder to encode/decode. Also, since it has a much larger set of valid data characters, it would be *much* more likely to allow old-style formatting to be mistaken for new-style. Unless we can think of a more bulletproof format selection mechanism, that could be an overriding consideration. Hex will already provide some space savings over our current encoding method for most byteas anyway. It's not like we'd be making things less efficient space-wise. And in compressed archives the space difference is likely to dissolve to not very much, I suspect. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea vs. pg_dump
Tom Lane wrote: Kevin Grittner kevin.gritt...@wicourts.gov writes: Bernd Helmle maili...@oopsware.de wrote: Another approach would be to just dump bytea columns in binary format only (not sure how doable that is, though). If that's not doable, perhaps a base64 option for bytea COPY? I'm thinking plain old pairs-of-hex-digits might be the best tradeoff if conversion speed is the criterion. That's a lot less space-efficient than base64, though. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers