Re: [PATCHES] COPY fast parse patch
On Thu, 2005-06-02 at 00:53 -0400, Alon Goldshuv wrote: My thoughts were -- see how the responses are, and if people think that this is a better way to go than replace the COPY parsing logic to the new one. If the new approach to parsing can handle all the cases that the old approach can handle (WRT CSV, binary, encoding conversion and the like) and is faster to boot, I don't see a reason not to replace the current approach with something derived from your patch. -Neil ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PATCHES] COPY fast parse patch
Luke Lonergan said: Andrew, I will be the first to admit that there are probably some very good possibilities for optimisation of this code. My impression though has been that in almost all cases it's fast enough anyway. I know that on some very modest hardware I have managed to load a 6m row TPC line-items table in just a few minutes. Before we start getting too hung up, I'd be interested to know just how much data people want to load and how fast they want it to be. If people have massive data loads that take hours, days or weeks then it's obviously worth improving if we can. I'm curious to know what size datasets people are really handling this way. x0+ GB files are common in data warehousing. The issue is often can we load our data within the time allotted for the batch window, usually a matter of an hour or two. Assuming that TPC lineitem is 140Bytes/row, 6M rows in 3 minutes is 4.7 MB/s. To load a 10GB file at that rate takes about 2/3 hour. If one were to restore a 300GB database, it would take 18 hours. Maintenance operations are impractical after a few hours, 18 is a non-starter. In practice, we're usually replacing an Oracle system with PostgreSQL, and the load speed difference between the two is currently embarrassing and makes the work impractical. OK ... that seems fair enough. The next question is where the data being loaded comes from? pg_dump? How does load speed compare with using COPY's binary mode? cheers andrew ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PATCHES] COPY fast parse patch
Andrew, OK ... that seems fair enough. The next question is where the data being loaded comes from? pg_dump? How does load speed compare with using COPY's binary mode? Oddly, our tests in the past have shown that binary is actually slower. Luke ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] COPY fast parse patch
Alon Goldshuv wrote: 5) Data integrity and escaping improvements. Treats all characters as data (unless it's an escaped delim or EOL) and therefore data integrity is preserved. However, some people that already got used to the postgres COPY escaping way may want to keep it. They could do so by still using the old COPY. OK, if the COPY file contains: \n\||\r\\a|\N and the delimiter is '|', what does that represent? What if there were a literal '\' followed by a '|' in the data table being dumped? -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] COPY fast parse patch
On Wed, 2005-06-01 at 16:34 -0700, Alon Goldshuv wrote: 1) The patch includes 2 parallel parsing code paths. One is the regular COPY path that we all know, and the other is the improved one that I wrote. This is only temporary, as there is a lot of code duplication Right; I really dislike the idea of having two separate code paths for COPY. When you say this approach is temporary, are you suggesting that you intend to reimplement your changes as improvements/replacements of the existing COPY code path rather than as a parallel code path? As a part of submitting this patch I also presented an argument for a use of a LOAD DATA command (in the NOLOGGING option thread). The points I made there are closely related to this message. There may be a valid argument that most of the points I raised could be implemented in the COPY code instead of a LOAD DATA command I'm definitely not keen to see a new LOAD DATA command. But that may not be a good idea for some and will also be problematic for backwards compatiability. In what way would the performance improvements to COPY be backward incompatible with the existing COPY behaviour? -Neil ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PATCHES] COPY fast parse patch
Neil Conway said: On Wed, 2005-06-01 at 16:34 -0700, Alon Goldshuv wrote: 1) The patch includes 2 parallel parsing code paths. One is the regular COPY path that we all know, and the other is the improved one that I wrote. This is only temporary, as there is a lot of code duplication Right; I really dislike the idea of having two separate code paths for COPY. When you say this approach is temporary, are you suggesting that you intend to reimplement your changes as improvements/replacements of the existing COPY code path rather than as a parallel code path? It's not an all or nothing deal. When we put in CSV handling, we introduced two new routines for attribute input/output and otherwise used the rest of the COPY code. When I did a fix for the multiline problem, it was originally done with a separate read line function for CSV mode - Bruce didn't like that so I merged it back into the existing code. In restrospect, given this discussion, that might not have been an optimal choice. But the point is that you can break out at several levels. Incidentally, there might be a good case for allowing the user to set the line end explicitly, but you can't just hardwire it - we will get massive Windows breakage. What is more, in CSV mode line end sequences can occur within logical lines. You need to take that into account. It's tricky and easy to get badly wrong. I will be the first to admit that there are probably some very good possibilities for optimisation of this code. My impression though has been that in almost all cases it's fast enough anyway. I know that on some very modest hardware I have managed to load a 6m row TPC line-items table in just a few minutes. Before we start getting too hung up, I'd be interested to know just how much data people want to load and how fast they want it to be. If people have massive data loads that take hours, days or weeks then it's obviously worth improving if we can. I'm curious to know what size datasets people are really handling this way. cheers andrew ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] COPY fast parse patch
Andrew, I will be the first to admit that there are probably some very good possibilities for optimisation of this code. My impression though has been that in almost all cases it's fast enough anyway. I know that on some very modest hardware I have managed to load a 6m row TPC line-items table in just a few minutes. Before we start getting too hung up, I'd be interested to know just how much data people want to load and how fast they want it to be. If people have massive data loads that take hours, days or weeks then it's obviously worth improving if we can. I'm curious to know what size datasets people are really handling this way. x0+ GB files are common in data warehousing. The issue is often can we load our data within the time allotted for the batch window, usually a matter of an hour or two. Assuming that TPC lineitem is 140Bytes/row, 6M rows in 3 minutes is 4.7 MB/s. To load a 10GB file at that rate takes about 2/3 hour. If one were to restore a 300GB database, it would take 18 hours. Maintenance operations are impractical after a few hours, 18 is a non-starter. In practice, we're usually replacing an Oracle system with PostgreSQL, and the load speed difference between the two is currently embarrassing and makes the work impractical. - Luke ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] COPY fast parse patch
Neil, Right; I really dislike the idea of having two separate code paths for COPY. When you say this approach is temporary, are you suggesting that you intend to reimplement your changes as improvements/replacements of the existing COPY code path rather than as a parallel code path? My thoughts were -- see how the responses are, and if people think that this is a better way to go than replace the COPY parsing logic to the new one. The whole escaping discussion that goes on is something else, escapes could be implemented in either way, but the important thing I am trying to show is that there is a much faster way to parse the data instead of doing a char-by-char petch-examine-load. As a part of submitting this patch I also presented an argument for a use of a LOAD DATA command (in the NOLOGGING option thread). The points I made there are closely related to this message. There may be a valid argument that most of the points I raised could be implemented in the COPY code instead of a LOAD DATA command I'm definitely not keen to see a new LOAD DATA command. It seems that most people don't :-) I can see valid arguments to both having it and not having it. But that may not be a good idea for some and will also be problematic for backwards compatiability. In what way would the performance improvements to COPY be backward incompatible with the existing COPY behaviour? That comment was in respect to the escape logic. You can regard it as irrelevant for now as long as the escape discussion goes on in parallel. Alon. -Neil ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])