Re: [HACKERS] copy with compression progress n

2006-06-01 Thread Hannu Krosing
Ühel kenal päeval, K, 2006-05-31 kell 17:31, kirjutas Andreas Pflug:
> Tom Lane wrote:
> > Andreas Pflug <[EMAIL PROTECTED]> writes:
> > 
> >>The attached patch implements COPY ... WITH [BINARY] COMPRESSION 
> >>(compression implies BINARY). The copy data uses bit 17 of the flag 
> >>field to identify compressed data.
> > 
> > 
> > I think this is a pretty horrid idea, because it changes pg_lzcompress
> > from an unimportant implementation detail into a backup file format
> > that we have to support till the end of time.  What happens if, say,
> > we need to abandon pg_lzcompress because we find out it has patent
> > problems?
> > 
> > It *might* be tolerable if we used gzip instead,
> 
> I used pg_lzcompress because it's present in the backend. I'm fine with 
> every other good compression algorithm.
> 
> >  but I really don't see
> > the argument for doing this inside the server at all: piping to gzip
> > seems like a perfectly acceptable solution,
> 
> As I said, this hits only if it is possible to pipe the result into gzip 
> in a performant way. The issue already arises if psql or any other COPY 
> client (slony, pg_dump) is not on the same machine: Network bandwidth 
> will limit throughput.

Maybe make up a way to pipe COPY result through some external process
(like gzip) on the server side without having shell access there.

To make it secure, the external process should probably be run from a
hardwired directory via chroot.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] copy with compression progress n

2006-05-31 Thread Tom Lane
Andreas Pflug <[EMAIL PROTECTED]> writes:
> Do you have a comment about the progress notification and its impact on 
> copy to stdout?

I didn't bother to comment on it because I think it's useless, as well
as broken for the stdout case.  Anyone who actually sees a use for it
will have to comment on why they want it.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] copy with compression progress n

2006-05-31 Thread Andreas Pflug

Tom Lane wrote:

Andreas Pflug <[EMAIL PROTECTED]> writes:

The attached patch implements COPY ... WITH [BINARY] COMPRESSION 
(compression implies BINARY). The copy data uses bit 17 of the flag 
field to identify compressed data.



I think this is a pretty horrid idea, because it changes pg_lzcompress
from an unimportant implementation detail into a backup file format
that we have to support till the end of time.  What happens if, say,
we need to abandon pg_lzcompress because we find out it has patent
problems?

It *might* be tolerable if we used gzip instead,


I used pg_lzcompress because it's present in the backend. I'm fine with 
every other good compression algorithm.



 but I really don't see
the argument for doing this inside the server at all: piping to gzip
seems like a perfectly acceptable solution,


As I said, this hits only if it is possible to pipe the result into gzip 
in a performant way. The issue already arises if psql or any other COPY 
client (slony, pg_dump) is not on the same machine: Network bandwidth 
will limit throughput.



quite possibly with higher
performance than doing it all in a single process (which isn't going
to be able to use more than one CPU).


Which is pretty normal for pgsql.


I don't see the argument for restricting it to binary only, either.


That's not a restriction, but a result: compressed data is binary. 
Marking it as binary will make it working with older frontends as well, 
as long as they don't try to interpret the data. Actually, all 8.x psql 
versions should work (with COPY STDxx, not \copy).


Do you have a comment about the progress notification and its impact on 
copy to stdout?


Regards,
Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] copy with compression progress n

2006-05-31 Thread Tom Lane
Andreas Pflug <[EMAIL PROTECTED]> writes:
> The attached patch implements COPY ... WITH [BINARY] COMPRESSION 
> (compression implies BINARY). The copy data uses bit 17 of the flag 
> field to identify compressed data.

I think this is a pretty horrid idea, because it changes pg_lzcompress
from an unimportant implementation detail into a backup file format
that we have to support till the end of time.  What happens if, say,
we need to abandon pg_lzcompress because we find out it has patent
problems?

It *might* be tolerable if we used gzip instead, but I really don't see
the argument for doing this inside the server at all: piping to gzip
seems like a perfectly acceptable solution, quite possibly with higher
performance than doing it all in a single process (which isn't going
to be able to use more than one CPU).

I don't see the argument for restricting it to binary only, either.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[HACKERS] copy with compression progress n

2006-05-31 Thread Andreas Pflug
I've been playing around with COPYing large binary data, and implemented 
a COMPRESSION transfer format. The server side compression saves 
significant bandwidth, which may be the major limiting factor when large 
amounts of data is involved (i.e. in many cases where COPY TO/FROM 
STDIN/STDOUT is used)
In addition, a progress notification can be enabled using a PROGRESS 
 option.


I tested this with a table, containing 2000 rows with a highly 
compressable bytea column (size 1.4GB, on-disk 138MB). Numbers are as 
follows (8.2 HEAD psql):

pg_dump -a -F c -t  652s, 146MB
\copy TO /dev/null  322s
\copy TO /dev/null binary   24s
\copy TO /dev/null compression  108s
\copy TO /tmp/file binary   55s, 1.4GB
\copy TO /tmp/file compression  108s, 133MB
\copy TO STDOUT binary|gzip -1  69s, 117MB

So using the plain text copy has a large overhead for text data over 
binary formats. OTOH, copying normal rows WITH BINARY may bloat the 
result too. A typical test table gave these numbers:

COPY:   6014 Bytes
BINARY: 15071 Bytes
COMPRESSION:2334 Bytes

The compression (pg_lzcompress) is less efficient than a binary copy 
piped to gzip, as long as the data transfer of 1.4GB from server to 
client isn't limited by network bandwidth. Apparently, pg_lzcompress 
uses 53s to compress to 133MB, while gzip only needs 14s for 117MB. 
Might be worth to have a look optimizing that since it's used in 
tuptoaster. Still, when network traffic is involved, it may be better to 
have some time spent on the server to reduce data (e.g. for Slony, which 
uses COPY to start a replication, and is likely to be operated over 
lines <1GBit/s).


The attached patch implements COPY ... WITH [BINARY] COMPRESSION 
(compression implies BINARY). The copy data uses bit 17 of the flag 
field to identify compressed data.
The PROGRESS  option to throw notices each n lines has a caveat: when 
copying TO STDOUT, data transfer will cease after the first notice was 
sent. This may either mean "dont ereport(NOTICE) when COPYing data to 
the client" or a bug somewhere.


Regards,
Andreas
Index: src/backend/commands/copy.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.266
diff -c -r1.266 copy.c
*** src/backend/commands/copy.c 26 May 2006 22:50:02 -  1.266
--- src/backend/commands/copy.c 31 May 2006 08:52:42 -
***
*** 47,53 
  #include "utils/memutils.h"
  #include "utils/relcache.h"
  #include "utils/syscache.h"
! 
  
  #define ISOCTAL(c) (((c) >= '0') && ((c) <= '7'))
  #define OCTVALUE(c) ((c) - '0')
--- 47,53 
  #include "utils/memutils.h"
  #include "utils/relcache.h"
  #include "utils/syscache.h"
! #include "utils/pg_lzcompress.h"
  
  #define ISOCTAL(c) (((c) >= '0') && ((c) <= '7'))
  #define OCTVALUE(c) ((c) - '0')
***
*** 103,114 
--- 103,121 
int client_encoding;/* remote side's 
character encoding */
boolneed_transcoding;   /* client encoding diff 
from server? */
boolencoding_embeds_ascii;  /* ASCII can be non-first byte? 
*/
+ booldo_compress;/* compress data before writing to output */
+ booldo_flush;   /* flush fe_msgbuf to copy target file/pipe */
+ booluse_raw_buf;/* use raw buffered data for CopyGetData */
uint64  processed;  /* # of tuples processed */
+   uint64  progress;   /* progress notice each # 
tuples processed */
+ 
+   MemoryContext oldcontext;
  
/* parameters from the COPY command */
Relationrel;/* relation to copy to or from 
*/
List   *attnumlist; /* integer list of attnums to copy */
boolbinary; /* binary format? */
+   boolcompression;/* binary compressed format? */
booloids;   /* include OIDs? */
boolcsv_mode;   /* Comma Separated Value 
format? */
boolheader_line;/* CSV header line? */
***
*** 153,162 
 * converts it.  Note: we guarantee that there is a \0 at
 * raw_buf[raw_buf_len].
 */
! #define RAW_BUF_SIZE 65536/* we palloc RAW_BUF_SIZE+1 bytes */
char   *raw_buf;
int raw_buf_index;  /* next byte to process */
int raw_buf_len;/* total # of bytes stored */
  } CopyStateData;
  
  typedef CopyStateData *CopyState;
--- 160,170 
 * converts it.  Note: we guarantee that there is a \0 at
 * raw_buf[raw_buf_len].
 */
! #define RAW_BUF_SIZE 65536/* initially, we palloc RAW_BUF_SIZE+1 
bytes */
char   *raw_buf;
int raw_bu