Re: [HACKERS] copy with compression progress n
Ühel kenal päeval, K, 2006-05-31 kell 17:31, kirjutas Andreas Pflug: > Tom Lane wrote: > > Andreas Pflug <[EMAIL PROTECTED]> writes: > > > >>The attached patch implements COPY ... WITH [BINARY] COMPRESSION > >>(compression implies BINARY). The copy data uses bit 17 of the flag > >>field to identify compressed data. > > > > > > I think this is a pretty horrid idea, because it changes pg_lzcompress > > from an unimportant implementation detail into a backup file format > > that we have to support till the end of time. What happens if, say, > > we need to abandon pg_lzcompress because we find out it has patent > > problems? > > > > It *might* be tolerable if we used gzip instead, > > I used pg_lzcompress because it's present in the backend. I'm fine with > every other good compression algorithm. > > > but I really don't see > > the argument for doing this inside the server at all: piping to gzip > > seems like a perfectly acceptable solution, > > As I said, this hits only if it is possible to pipe the result into gzip > in a performant way. The issue already arises if psql or any other COPY > client (slony, pg_dump) is not on the same machine: Network bandwidth > will limit throughput. Maybe make up a way to pipe COPY result through some external process (like gzip) on the server side without having shell access there. To make it secure, the external process should probably be run from a hardwired directory via chroot. -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] copy with compression progress n
Andreas Pflug <[EMAIL PROTECTED]> writes: > Do you have a comment about the progress notification and its impact on > copy to stdout? I didn't bother to comment on it because I think it's useless, as well as broken for the stdout case. Anyone who actually sees a use for it will have to comment on why they want it. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] copy with compression progress n
Tom Lane wrote: Andreas Pflug <[EMAIL PROTECTED]> writes: The attached patch implements COPY ... WITH [BINARY] COMPRESSION (compression implies BINARY). The copy data uses bit 17 of the flag field to identify compressed data. I think this is a pretty horrid idea, because it changes pg_lzcompress from an unimportant implementation detail into a backup file format that we have to support till the end of time. What happens if, say, we need to abandon pg_lzcompress because we find out it has patent problems? It *might* be tolerable if we used gzip instead, I used pg_lzcompress because it's present in the backend. I'm fine with every other good compression algorithm. but I really don't see the argument for doing this inside the server at all: piping to gzip seems like a perfectly acceptable solution, As I said, this hits only if it is possible to pipe the result into gzip in a performant way. The issue already arises if psql or any other COPY client (slony, pg_dump) is not on the same machine: Network bandwidth will limit throughput. quite possibly with higher performance than doing it all in a single process (which isn't going to be able to use more than one CPU). Which is pretty normal for pgsql. I don't see the argument for restricting it to binary only, either. That's not a restriction, but a result: compressed data is binary. Marking it as binary will make it working with older frontends as well, as long as they don't try to interpret the data. Actually, all 8.x psql versions should work (with COPY STDxx, not \copy). Do you have a comment about the progress notification and its impact on copy to stdout? Regards, Andreas ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] copy with compression progress n
Andreas Pflug <[EMAIL PROTECTED]> writes: > The attached patch implements COPY ... WITH [BINARY] COMPRESSION > (compression implies BINARY). The copy data uses bit 17 of the flag > field to identify compressed data. I think this is a pretty horrid idea, because it changes pg_lzcompress from an unimportant implementation detail into a backup file format that we have to support till the end of time. What happens if, say, we need to abandon pg_lzcompress because we find out it has patent problems? It *might* be tolerable if we used gzip instead, but I really don't see the argument for doing this inside the server at all: piping to gzip seems like a perfectly acceptable solution, quite possibly with higher performance than doing it all in a single process (which isn't going to be able to use more than one CPU). I don't see the argument for restricting it to binary only, either. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] copy with compression progress n
I've been playing around with COPYing large binary data, and implemented a COMPRESSION transfer format. The server side compression saves significant bandwidth, which may be the major limiting factor when large amounts of data is involved (i.e. in many cases where COPY TO/FROM STDIN/STDOUT is used) In addition, a progress notification can be enabled using a PROGRESS option. I tested this with a table, containing 2000 rows with a highly compressable bytea column (size 1.4GB, on-disk 138MB). Numbers are as follows (8.2 HEAD psql): pg_dump -a -F c -t 652s, 146MB \copy TO /dev/null 322s \copy TO /dev/null binary 24s \copy TO /dev/null compression 108s \copy TO /tmp/file binary 55s, 1.4GB \copy TO /tmp/file compression 108s, 133MB \copy TO STDOUT binary|gzip -1 69s, 117MB So using the plain text copy has a large overhead for text data over binary formats. OTOH, copying normal rows WITH BINARY may bloat the result too. A typical test table gave these numbers: COPY: 6014 Bytes BINARY: 15071 Bytes COMPRESSION:2334 Bytes The compression (pg_lzcompress) is less efficient than a binary copy piped to gzip, as long as the data transfer of 1.4GB from server to client isn't limited by network bandwidth. Apparently, pg_lzcompress uses 53s to compress to 133MB, while gzip only needs 14s for 117MB. Might be worth to have a look optimizing that since it's used in tuptoaster. Still, when network traffic is involved, it may be better to have some time spent on the server to reduce data (e.g. for Slony, which uses COPY to start a replication, and is likely to be operated over lines <1GBit/s). The attached patch implements COPY ... WITH [BINARY] COMPRESSION (compression implies BINARY). The copy data uses bit 17 of the flag field to identify compressed data. The PROGRESS option to throw notices each n lines has a caveat: when copying TO STDOUT, data transfer will cease after the first notice was sent. This may either mean "dont ereport(NOTICE) when COPYing data to the client" or a bug somewhere. Regards, Andreas Index: src/backend/commands/copy.c === RCS file: /projects/cvsroot/pgsql/src/backend/commands/copy.c,v retrieving revision 1.266 diff -c -r1.266 copy.c *** src/backend/commands/copy.c 26 May 2006 22:50:02 - 1.266 --- src/backend/commands/copy.c 31 May 2006 08:52:42 - *** *** 47,53 #include "utils/memutils.h" #include "utils/relcache.h" #include "utils/syscache.h" ! #define ISOCTAL(c) (((c) >= '0') && ((c) <= '7')) #define OCTVALUE(c) ((c) - '0') --- 47,53 #include "utils/memutils.h" #include "utils/relcache.h" #include "utils/syscache.h" ! #include "utils/pg_lzcompress.h" #define ISOCTAL(c) (((c) >= '0') && ((c) <= '7')) #define OCTVALUE(c) ((c) - '0') *** *** 103,114 --- 103,121 int client_encoding;/* remote side's character encoding */ boolneed_transcoding; /* client encoding diff from server? */ boolencoding_embeds_ascii; /* ASCII can be non-first byte? */ + booldo_compress;/* compress data before writing to output */ + booldo_flush; /* flush fe_msgbuf to copy target file/pipe */ + booluse_raw_buf;/* use raw buffered data for CopyGetData */ uint64 processed; /* # of tuples processed */ + uint64 progress; /* progress notice each # tuples processed */ + + MemoryContext oldcontext; /* parameters from the COPY command */ Relationrel;/* relation to copy to or from */ List *attnumlist; /* integer list of attnums to copy */ boolbinary; /* binary format? */ + boolcompression;/* binary compressed format? */ booloids; /* include OIDs? */ boolcsv_mode; /* Comma Separated Value format? */ boolheader_line;/* CSV header line? */ *** *** 153,162 * converts it. Note: we guarantee that there is a \0 at * raw_buf[raw_buf_len]. */ ! #define RAW_BUF_SIZE 65536/* we palloc RAW_BUF_SIZE+1 bytes */ char *raw_buf; int raw_buf_index; /* next byte to process */ int raw_buf_len;/* total # of bytes stored */ } CopyStateData; typedef CopyStateData *CopyState; --- 160,170 * converts it. Note: we guarantee that there is a \0 at * raw_buf[raw_buf_len]. */ ! #define RAW_BUF_SIZE 65536/* initially, we palloc RAW_BUF_SIZE+1 bytes */ char *raw_buf; int raw_bu