Re: [PATCHES] COPY fast parse patch

2005-06-02 Thread Neil Conway
On Thu, 2005-06-02 at 00:53 -0400, Alon Goldshuv wrote:
 My thoughts were -- see how the responses are, and if people think
 that this is a better way to go than replace the COPY parsing logic to
 the new one.

If the new approach to parsing can handle all the cases that the old
approach can handle (WRT CSV, binary, encoding conversion and the like)
and is faster to boot, I don't see a reason not to replace the current
approach with something derived from your patch.

-Neil



---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PATCHES] COPY fast parse patch

2005-06-02 Thread Andrew Dunstan
Luke Lonergan said:
 Andrew,

 I will be the first to admit that there are probably some very good
 possibilities for optimisation of this code. My impression though has
 been that in almost all cases it's fast enough anyway. I know that on
 some very modest hardware I have managed to load a 6m row TPC
 line-items table in just a few minutes. Before we start getting too
 hung up, I'd be interested to know just how much data people want to
 load and how fast they want it to be. If people have massive data
 loads that take hours, days or weeks then it's obviously worth
 improving if we can. I'm curious to know what size datasets people are
 really handling this way.

 x0+ GB files are common in data warehousing.  The issue is often can
 we load our data within the time allotted for the batch window,
 usually a matter of an hour or two.

 Assuming that TPC lineitem is 140Bytes/row, 6M rows in 3 minutes is 4.7
 MB/s.  To load a 10GB file at that rate takes about 2/3 hour.  If one
 were to restore a 300GB database, it would take 18 hours.  Maintenance
 operations are impractical after a few hours, 18 is a non-starter.

 In practice, we're usually replacing an Oracle system with PostgreSQL,
 and the load speed difference between the two is currently embarrassing
 and makes the work impractical.



OK ... that seems fair enough. The next question is where the data being
loaded comes from? pg_dump? How does load speed compare with using COPY's
binary mode?

cheers

andrew



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] COPY fast parse patch

2005-06-02 Thread Luke Lonergan
Andrew,

 OK ... that seems fair enough. The next question is where the data being
 loaded comes from? pg_dump? How does load speed compare with using COPY's
 binary mode?

Oddly, our tests in the past have shown that binary is actually slower.

Luke



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] COPY fast parse patch

2005-06-01 Thread Bruce Momjian
Alon Goldshuv wrote:
 5) Data integrity and escaping improvements. Treats all characters as data
 (unless it's an escaped delim or EOL) and therefore data
 integrity is preserved. However, some people that already got
 used to the postgres COPY escaping way may want to keep it. They could do so
 by still using the old COPY.

OK, if the COPY file contains:

\n\||\r\\a|\N

and the delimiter is '|', what does that represent?  What if there were
a literal '\' followed by a '|' in the data table being dumped?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] COPY fast parse patch

2005-06-01 Thread Neil Conway
On Wed, 2005-06-01 at 16:34 -0700, Alon Goldshuv wrote:
 1) The patch includes 2 parallel parsing code paths. One is the regular COPY
 path that we all know, and the other is the improved one that I wrote. This
 is only temporary, as there is a lot of code duplication

Right; I really dislike the idea of having two separate code paths for
COPY. When you say this approach is temporary, are you suggesting that
you intend to reimplement your changes as improvements/replacements of
the existing COPY code path rather than as a parallel code path?

 As a part of submitting this patch I also presented an argument for a use of
 a LOAD DATA command (in the NOLOGGING option thread). The points I made
 there are closely related to this message. There may be a valid argument
 that most of the points I raised could be implemented in the COPY code
 instead of a LOAD DATA command

I'm definitely not keen to see a new LOAD DATA command.

 But that may not be a good idea for some and will also be problematic
 for backwards compatiability.

In what way would the performance improvements to COPY be backward
incompatible with the existing COPY behaviour?

-Neil



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PATCHES] COPY fast parse patch

2005-06-01 Thread Andrew Dunstan
Neil Conway said:
 On Wed, 2005-06-01 at 16:34 -0700, Alon Goldshuv wrote:
 1) The patch includes 2 parallel parsing code paths. One is the
 regular COPY path that we all know, and the other is the improved one
 that I wrote. This is only temporary, as there is a lot of code
 duplication

 Right; I really dislike the idea of having two separate code paths for
 COPY. When you say this approach is temporary, are you suggesting
 that you intend to reimplement your changes as
 improvements/replacements of the existing COPY code path rather than as
 a parallel code path?


It's not an all or nothing deal. When  we put in CSV handling, we introduced
two new routines for attribute input/output and otherwise used the rest of
the COPY code. When I did a fix for the multiline problem, it was originally
done with a separate read line function for CSV mode - Bruce didn't like
that so I merged it back into the existing code. In restrospect, given this
discussion, that might not have been an optimal choice. But the point is
that you can break out at several levels.

Incidentally, there might be a good case for allowing the user to set the
line end explicitly, but you can't just hardwire it - we will get massive
Windows breakage. What is more, in CSV mode line end sequences can occur
within logical lines. You need to take that into account. It's tricky and
easy to get badly wrong.

I will be the first to admit that there are probably some very good
possibilities for optimisation of this code. My impression though has been
that in almost all cases it's fast enough anyway. I know that on some very
modest hardware I have managed to load a 6m row TPC line-items table in just
a few minutes. Before we start getting too hung up, I'd be interested to
know just how much data people want to load and how fast they want it to be.
If people have massive data loads that take hours, days or weeks then it's
obviously worth improving if we can. I'm curious to know what size datasets
people are really handling this way.

cheers

andrew



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] COPY fast parse patch

2005-06-01 Thread Luke Lonergan
Andrew,

 I will be the first to admit that there are probably some very good
 possibilities for optimisation of this code. My impression though has been
 that in almost all cases it's fast enough anyway. I know that on some very
 modest hardware I have managed to load a 6m row TPC line-items table in just
 a few minutes. Before we start getting too hung up, I'd be interested to
 know just how much data people want to load and how fast they want it to be.
 If people have massive data loads that take hours, days or weeks then it's
 obviously worth improving if we can. I'm curious to know what size datasets
 people are really handling this way.

x0+ GB files are common in data warehousing.  The issue is often can we
load our data within the time allotted for the batch window, usually a
matter of an hour or two.

Assuming that TPC lineitem is 140Bytes/row, 6M rows in 3 minutes is 4.7
MB/s.  To load a 10GB file at that rate takes about 2/3 hour.  If one were
to restore a 300GB database, it would take 18 hours.  Maintenance operations
are impractical after a few hours, 18 is a non-starter.

In practice, we're usually replacing an Oracle system with PostgreSQL, and
the load speed difference between the two is currently embarrassing and
makes the work impractical.

- Luke



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] COPY fast parse patch

2005-06-01 Thread Alon Goldshuv
Neil,

Right; I really dislike the idea of having two separate code paths for
COPY. When you say this approach is temporary, are you suggesting that
you intend to reimplement your changes as improvements/replacements of
the existing COPY code path rather than as a parallel code path?


My thoughts were -- see how the responses are, and if people think that this is 
a better way to go than replace the COPY parsing logic to the new one. The 
whole escaping discussion that goes on is something else, escapes could be 
implemented in either way, but the important thing I am trying to show is that 
there is a much faster way to parse the data instead of doing a char-by-char 
petch-examine-load.


 As a part of submitting this patch I also presented an argument for a use of
 a LOAD DATA command (in the NOLOGGING option thread). The points I made
 there are closely related to this message. There may be a valid argument
 that most of the points I raised could be implemented in the COPY code
 instead of a LOAD DATA command

I'm definitely not keen to see a new LOAD DATA command.


It seems that most people don't :-) I can see valid arguments to both having it 
and not having it.



 But that may not be a good idea for some and will also be problematic
 for backwards compatiability.

In what way would the performance improvements to COPY be backward
incompatible with the existing COPY behaviour?


That comment was in respect to the escape logic. You can regard it as 
irrelevant for now as long as the escape discussion goes on in parallel.

Alon.


-Neil






---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])