Re: [HACKERS] CopyReadLineText optimization revisited

2011-02-22 Thread Robert Haas
On Fri, Feb 18, 2011 at 9:35 PM, Bruce Momjian br...@momjian.us wrote:
 Was this implemented?  Is it a TODO?

It's not entirely clear to me that there's a meaningful win here.
Speeding up COPY is already on the TODO list, with this link:

http://archives.postgresql.org/pgsql-hackers/2008-02/msg00954.php

...and it may be that some of the ideas proposed there are more
worthwhile than this one.

But maybe that same TODO item could also get a link to the start of this thread.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CopyReadLineText optimization revisited

2011-02-22 Thread Bruce Momjian
Robert Haas wrote:
 On Fri, Feb 18, 2011 at 9:35 PM, Bruce Momjian br...@momjian.us wrote:
  Was this implemented? ?Is it a TODO?
 
 It's not entirely clear to me that there's a meaningful win here.
 Speeding up COPY is already on the TODO list, with this link:
 
 http://archives.postgresql.org/pgsql-hackers/2008-02/msg00954.php
 
 ...and it may be that some of the ideas proposed there are more
 worthwhile than this one.
 
 But maybe that same TODO item could also get a link to the start of
 this thread.

OK, I added a link to this thread to the existing COPY TODO performance
item.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CopyReadLineText optimization revisited

2011-02-18 Thread Bruce Momjian

Was this implemented?  Is it a TODO?

---

Heikki Linnakangas wrote:
 I'm reviving the effort I started a while back to make COPY faster:
 
 http://archives.postgresql.org/pgsql-patches/2008-02/msg00100.php
 http://archives.postgresql.org/pgsql-patches/2008-03/msg00015.php
 
 The patch I now have is based on using memchr() to search end-of-line. 
 In a nutshell:
 
 * we perform possible encoding conversion early, one input block at a 
 time, rather than after splitting the input into lines. This allows us 
 to assume in the later stages that the data is in server encoding, 
 allowing us to search for the '\n' byte without worrying about 
 multi-byte characters.
 
 * instead of the byte-at-a-time loop in CopyReadLineText(), use memchr() 
 to find the next NL/CR character. This is where the speedup comes from. 
 Unfortunately we can't do that in the CSV codepath, because newlines can 
 be embedded in quoted, so that's unchanged.
 
 These changes seem to give an overall speedup of between 0-10%, 
 depending on the shape of the table. I tested various tables from the 
 TPC-H schema, and a narrow table consisting of just one short text column.
 
 I can't think of a case where these changes would be a net loss in 
 performance, and it didn't perform worse on any of the cases I tested 
 either.
 
 There's a small fly in the ointment: the patch won't recognize backslash 
 followed by a linefeed as an escaped linefeed. I think we should simply 
 drop support for that. The docs already say:
 
  It is strongly recommended that applications generating COPY data convert 
  data newlines and carriage returns to the \n and \r sequences respectively. 
  At present it is possible to represent a data carriage return by a 
  backslash and carriage return, and to represent a data newline by a 
  backslash and newline. However, these representations might not be accepted 
  in future releases. They are also highly vulnerable to corruption if the 
  COPY file is transferred across different machines (for example, from Unix 
  to Windows or vice versa).
 
 I vaguely recall that we discussed this some time ago already and agreed 
 that we can drop it if it makes life easier.
 
 This patch is in pretty good shape, however it needs to be tested with 
 different exotic input formats. Also, the loop in CopyReadLineText could 
 probaby be cleaned up a bit, some of the uglifications that were done 
 for performance reasons in the old code are no longer necessary, as 
 memchr() is doing the heavy-lifting and the loop only iterates 1-2 times 
 per line in typical cases.
 
 
 It's not strictly necessary, but how about dropping support for the old 
 COPY protocol, and the EOF marker \. while we're at it? It would allow 
 us to drop some code, making the remaining code simpler, and reduce the 
 testing effort. Thoughts on that?
 
 -- 
Heikki Linnakangas
EnterpriseDB   http://www.enterprisedb.com

[ Attachment, skipping... ]

 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CopyReadLineText optimization revisited

2010-08-27 Thread Dimitri Fontaine
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Ok. If we have to, we can keep that, it just requires more
 programming. After searching for a \n, we can peek at the previous byte to
 check if it's a backslash (and if it is, the one before that to see if it's
 a backslash too, and so forth until we find a non-backslash).

That's what pgloader does to allow for non-quoted fields containing
escaped separator in some contrived input formats (UNLOAD from Informix,
I'm looking at you).

I guess the same kind of playing could be applied to CSV too, but it'd
be necessary to search back to the previous \n and count the QUOTE chars
you find. Which does not sound like a huge win, even if you remember the
state at the last quoted \n.

Fancy format parsing ain't fun.

Regards,
-- 
dim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] CopyReadLineText optimization revisited

2010-08-26 Thread Heikki Linnakangas

I'm reviving the effort I started a while back to make COPY faster:

http://archives.postgresql.org/pgsql-patches/2008-02/msg00100.php
http://archives.postgresql.org/pgsql-patches/2008-03/msg00015.php

The patch I now have is based on using memchr() to search end-of-line. 
In a nutshell:


* we perform possible encoding conversion early, one input block at a 
time, rather than after splitting the input into lines. This allows us 
to assume in the later stages that the data is in server encoding, 
allowing us to search for the '\n' byte without worrying about 
multi-byte characters.


* instead of the byte-at-a-time loop in CopyReadLineText(), use memchr() 
to find the next NL/CR character. This is where the speedup comes from. 
Unfortunately we can't do that in the CSV codepath, because newlines can 
be embedded in quoted, so that's unchanged.


These changes seem to give an overall speedup of between 0-10%, 
depending on the shape of the table. I tested various tables from the 
TPC-H schema, and a narrow table consisting of just one short text column.


I can't think of a case where these changes would be a net loss in 
performance, and it didn't perform worse on any of the cases I tested 
either.


There's a small fly in the ointment: the patch won't recognize backslash 
followed by a linefeed as an escaped linefeed. I think we should simply 
drop support for that. The docs already say:



It is strongly recommended that applications generating COPY data convert data 
newlines and carriage returns to the \n and \r sequences respectively. At 
present it is possible to represent a data carriage return by a backslash and 
carriage return, and to represent a data newline by a backslash and newline. 
However, these representations might not be accepted in future releases. They 
are also highly vulnerable to corruption if the COPY file is transferred across 
different machines (for example, from Unix to Windows or vice versa).


I vaguely recall that we discussed this some time ago already and agreed 
that we can drop it if it makes life easier.


This patch is in pretty good shape, however it needs to be tested with 
different exotic input formats. Also, the loop in CopyReadLineText could 
probaby be cleaned up a bit, some of the uglifications that were done 
for performance reasons in the old code are no longer necessary, as 
memchr() is doing the heavy-lifting and the loop only iterates 1-2 times 
per line in typical cases.



It's not strictly necessary, but how about dropping support for the old 
COPY protocol, and the EOF marker \. while we're at it? It would allow 
us to drop some code, making the remaining code simpler, and reduce the 
testing effort. Thoughts on that?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
***
*** 143,167  typedef struct CopyStateData
  
  	/*
  	 * Similarly, line_buf holds the whole input line being processed. The
! 	 * input cycle is first to read the whole line into line_buf, convert it
! 	 * to server encoding there, and then extract the individual attribute
! 	 * fields into attribute_buf.  line_buf is preserved unmodified so that we
! 	 * can display it in error messages if appropriate.
  	 */
  	StringInfoData line_buf;
- 	bool		line_buf_converted;		/* converted to server encoding? */
  
  	/*
! 	 * Finally, raw_buf holds raw data read from the data source (file or
! 	 * client connection).	CopyReadLine parses this data sufficiently to
! 	 * locate line boundaries, then transfers the data to line_buf and
! 	 * converts it.  Note: we guarantee that there is a \0 at
! 	 * raw_buf[raw_buf_len].
  	 */
  #define RAW_BUF_SIZE 65536		/* we palloc RAW_BUF_SIZE+1 bytes */
  	char	   *raw_buf;
  	int			raw_buf_index;	/* next byte to process */
  	int			raw_buf_len;	/* total # of bytes stored */
  } CopyStateData;
  
  typedef CopyStateData *CopyState;
--- 143,175 
  
  	/*
  	 * Similarly, line_buf holds the whole input line being processed. The
! 	 * input cycle is first to convert the input to server encoding in
! 	 * raw_buf, then read the whole line into line_buf, and then extract the
! 	 * individual attribute fields into attribute_buf.  line_buf is preserved
! 	 * unmodified so that we can display it in error messages if appropriate.
  	 */
  	StringInfoData line_buf;
  
  	/*
! 	 * raw_buf holds raw data read from the data source (file or client
! 	 * connection), converted to server encoding if necessary. CopyReadLine
! 	 * parses this data sufficiently to locate line boundaries, then
! 	 * transfers the data to line_buf.  Note: we guarantee that there is
! 	 * a \0 at raw_buf[raw_buf_len].
  	 */
  #define RAW_BUF_SIZE 65536		/* we palloc RAW_BUF_SIZE+1 bytes */
  	char	   *raw_buf;
  	int			raw_buf_index;	/* next byte to process */
  	int			raw_buf_len;	/* total # of bytes stored */
+ 
+ 	/*
+ 	 * Finally, unconverted_buf holds residual raw data read from 

Re: [HACKERS] CopyReadLineText optimization revisited

2010-08-26 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 * we perform possible encoding conversion early, one input block at a 
 time, rather than after splitting the input into lines. This allows us 
 to assume in the later stages that the data is in server encoding, 
 allowing us to search for the '\n' byte without worrying about 
 multi-byte characters.

Seems reasonable, although the need to deal with multibyte characters
crossing a block boundary injects some ugliness that wasn't there before.

 * instead of the byte-at-a-time loop in CopyReadLineText(), use memchr() 
 to find the next NL/CR character. This is where the speedup comes from. 

That seems like the speedup, if any, would be extremely
platform-dependent.  What have you tested on?

 There's a small fly in the ointment: the patch won't recognize backslash 
 followed by a linefeed as an escaped linefeed. I think we should simply 
 drop support for that.

I think this is likely to break apps that have worked for years.  I
can't get excited about doing that in return for an 0-10% speedup
that might only materialize on some platforms.  If the numbers were
better, it'd be worth paying that price, but ...

 It's not strictly necessary, but how about dropping support for the old 
 COPY protocol, and the EOF marker \. while we're at it? It would allow 
 us to drop some code, making the remaining code simpler, and reduce the 
 testing effort. Thoughts on that?

Again, I think the threshold requirement for breaking compatibility
needs to be a lot higher than this.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CopyReadLineText optimization revisited

2010-08-26 Thread Heikki Linnakangas

On 26/08/10 22:16, Tom Lane wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  writes:

* instead of the byte-at-a-time loop in CopyReadLineText(), use memchr()
to find the next NL/CR character. This is where the speedup comes from.


That seems like the speedup, if any, would be extremely
platform-dependent.  What have you tested on?


Only on my 32-bit Intel Linux laptop.

If anyone out there has more interesting platforms to test on, that 
would be appreciated.



There's a small fly in the ointment: the patch won't recognize backslash
followed by a linefeed as an escaped linefeed. I think we should simply
drop support for that.


I think this is likely to break apps that have worked for years.  I
can't get excited about doing that in return for an 0-10% speedup
that might only materialize on some platforms.  If the numbers were
better, it'd be worth paying that price, but ...


Ok. If we have to, we can keep that, it just requires more programming. 
After searching for a \n, we can peek at the previous byte to check if 
it's a backslash (and if it is, the one before that to see if it's a 
backslash too, and so forth until we find a non-backslash).


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CopyReadLineText optimization revisited

2010-08-26 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 26/08/10 22:16, Tom Lane wrote:
 I think this is likely to break apps that have worked for years.  I
 can't get excited about doing that in return for an 0-10% speedup
 that might only materialize on some platforms.  If the numbers were
 better, it'd be worth paying that price, but ...

 Ok. If we have to, we can keep that, it just requires more programming. 
 After searching for a \n, we can peek at the previous byte to check if 
 it's a backslash (and if it is, the one before that to see if it's a 
 backslash too, and so forth until we find a non-backslash).

I seem to recall that this same problem was discussed before, and we
concluded that you couldn't reliably parse backwards to figure out
whether the newline was backslashed.  Although that might have been
in the context of data still in client encoding, where you have the
extra frammish that a backslash could be a non-first byte of a
character.  Anyway it'd be a good idea to recheck those old discussions
if you can find 'em.

Another approach that came to me was to parse forwards, and if we find
a backslash at the end of the line, then conclude that we had a
backslashed newline and slurp in another line's worth of data at that
time.  I'm not sure how much restructuring would be needed to make that
feasible, but it seems worth considering.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers