Re: [PATCHES] Libpq COPY optimization patch
Libpq copy speedup patch attached. No input buffer logic involved, just removing the expensive PQconsumeInput call per putCopyData call, and leaving parseInput as is, as discussed before. Alon. libpq_copy.patch Description: Binary data ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PATCHES] Libpq COPY optimization patch
Here is a patch against today's code 1/24. As discussed in -hackers consumeInput/parse is removed from being called every single time. It's still there for only when the data is sent to the server. Alon. pq_put_copy_data.patch Description: Binary data ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] Libpq COPY optimization patch
Tom, Here is a patch against today's code 1/24. As discussed in -hackers consumeInput/parse is removed from being called every single time. It's still there for only when the data is sent to the server. This appears to be the exact same patch you sent before. Did you test my suggestion of simply removing the PQconsumeInput call? I see no reason to add it inside the loop. My mistake. I'll make the correction. I guess that although parseInput is cheap we could still use a conditional to see when data was sent and only then call it (without PQconsumeInput) instead of calling it every single time PQputCopyData is called. Any objection to that? Alon. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Libpq COPY optimization patch
You mean something like if (input-buffer-not-empty) parseInput(); ? This still bothers me a bit since it's a mixing of logic levels; PQputCopyData is an output routine, it shouldn't be getting its fingers dirty with input buffer contents. I'm willing to tolerate this if it can be demonstrated that it provides a useful performance gain compared to the unconditional parseInput call, but let's see some numbers. Yes, I understand. We'll see what the performance gain is like and see if it's worth it, I'll report back. Alon. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] COPY FROM performance improvements
I did some performance checks after the recent code commit. The good news is that the parsing speed of COPY is now MUCH faster, which is great. It is about 5 times faster - about 100MB/sec on my machine (previously 20MB/sec at best, usually less). The better news is that my original patch parsing speed reaches 120MB/sec, about 20MB/sec faster than the code that's now in CVS. This can be significant for the long scheme of things and for large data sets. Maybe we can improve the current code a bit more to reach this number. I performed those measurement by executing *only the parsing logic* of the COPY pipeline. All data conversion (functioncall3(string...)) and tuple handling (form_heaptuple etc...) and insertion were manually disabled. So the only code measured is reading from disk and parsing to the attribute level. Cheers, Alon. On 8/7/05 1:21 AM, Luke Lonergan [EMAIL PROTECTED] wrote: Tom, On 8/6/05 9:08 PM, Tom Lane [EMAIL PROTECTED] wrote: Luke Lonergan [EMAIL PROTECTED] writes: I had some difficulty in generating test cases that weren't largely I/O-bound, but AFAICT the patch as applied is about the same speed as what you submitted. You achieve the important objective of knocking the parsing stage down a lot, but your parsing code is actually about 20% slower than Alon's. I would like to see the exact test case you are using to make this claim; the tests I did suggested my code is the same speed or faster. I showed mine - you show yours :-) Apparently our e-mail crossed. As best I can tell, my version of CopyReadAttributes is significantly quicker than Alon's, approximately balancing out the fact that my version of CopyReadLine is slower. I did the latter first, and would now be tempted to rewrite it in the same style as CopyReadAttributes, ie one pass of memory-to-memory copy using pointers rather than buffer indexes. See previous timings - looks like Alon's parsing is substantially faster. However, I'd like him to confirm by running with the shunt placed at different stages, in this case between parse and attribute conversion (not attribute parse). BTW, late today I figured out a way to get fairly reproducible non-I/O-bound numbers about COPY FROM: use a trigger that suppresses the actual inserts, thus: create table foo ... create function noway() returns trigger as 'begin return null; end' language plpgsql; create trigger noway before insert on foo for each row execute procedure noway(); then repeat: copy foo from '/tmp/foo.data'; Cool! That's a better way than hacking code and inserting shunts. Alon will likely hit this tomorrow. - Luke ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] [PATCHES] COPY FROM performance improvements
Tom, Thanks for pointing it out. I made the small required modifications to match copy.c version 1.247 and sent it to -patches list. New patch is V16. Alon. On 8/1/05 7:51 PM, Tom Lane [EMAIL PROTECTED] wrote: Alon Goldshuv [EMAIL PROTECTED] writes: This patch appears to reverse out the most recent committed changes in copy.c. Which changes do you refer to? I thought I accommodated all the recent changes (I recall some changes to the tupletable/tupleslot interface, HEADER in cvs, and hex escapes and maybe one or 2 more). What did I miss? The latest touch of copy.c, namely this patch: 2005-07-10 17:13 tgl * doc/src/sgml/ref/create_type.sgml, src/backend/commands/copy.c, src/backend/commands/typecmds.c, src/backend/tcop/fastpath.c, src/backend/tcop/postgres.c, src/backend/utils/adt/arrayfuncs.c, src/backend/utils/adt/date.c, src/backend/utils/adt/numeric.c, src/backend/utils/adt/rowtypes.c, src/backend/utils/adt/timestamp.c, src/backend/utils/adt/varbit.c, src/backend/utils/adt/varchar.c, src/backend/utils/adt/varlena.c, src/backend/utils/mb/mbutils.c, src/include/catalog/catversion.h, src/include/catalog/pg_proc.h, src/test/regress/expected/type_sanity.out, src/test/regress/sql/type_sanity.sql: Change typreceive function API so that receive functions get the same optional arguments as text input functions, ie, typioparam OID and atttypmod. Make all the datatypes that use typmod enforce it the same way in typreceive as they do in typinput. This fixes a problem with failure to enforce length restrictions during COPY FROM BINARY. It was rather obvious, given that the first chunk of the patch backed up the file's CVS version stamp from 1.247 to 1.246 :-( regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] COPY FROM performance improvements
Hi Mark, I improved the data *parsing* capabilities of COPY, and didn't touch the data conversion or data insertion parts of the code. The parsing improvement will vary largely depending on the ratio of parsing -to- converting and inserting. Therefore, the speed increase really depends on the nature of your data: 100GB file with long data rows (lots of parsing) Small number of columns (small number of attr conversions per row) less rows (less tuple insertions) Will show the best performance improvements. However, same file size 100GB with Short data rows (minimal parsing) large number of columns (large number of attr conversions per row) AND/OR more rows (more tuple insertions) Will show improvements but not as significant. In general I'll estimate 40%-95% improvement in load speed for the 1st case and 10%-40% for the 2nd. But that also depends on the hardware, disk speed etc... This is for TEXT format. As for CSV, it may be faster but not as much as I specified here. BINARY will stay the same as before. HTH Alon. On 7/19/05 12:54 PM, Mark Wong [EMAIL PROTECTED] wrote: On Thu, 14 Jul 2005 17:22:18 -0700 Alon Goldshuv [EMAIL PROTECTED] wrote: I revisited my patch and removed the code duplications that were there, and added support for CSV with buffered input, so CSV now runs faster too (although it is not as optimized as the TEXT format parsing). So now TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file. Hi Alon, I'm curious, what kind of system are you testing this on? I'm trying to load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm interested in the results you would expect. Mark ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] COPY FROM performance improvements
Mark, Thanks for the info. Yes, isolating indexes out of the picture is a good idea for this purpose. I can't really give a guess to how fast the load rate should be. I don't know how your system is configured, and all the hardware characteristics (and even if I knew that info I may not be able to guess...). I am pretty confident that the load will be faster than before, I'll risk that ;-) Looking into your TPC-H size and metadata I'll estimate that partsupp,customer and orders will have the most significant increase in load rate. You could start with those. I guess the only way to really know is to try... Load several times with the existing PG-COPY and then load several times with the patched COPY and compare. I'll be curious to hear your results. Thx, Alon. On 7/19/05 2:37 PM, Mark Wong [EMAIL PROTECTED] wrote: Hi Alon, Yeah, that helps. I just need to break up my scripts a little to just load the data and not build indexes. Is the following information good enough to give a guess about the data I'm loading, if you don't mind? ;) Here's a link to my script to create tables: http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f 7f23437e432path=scripts/pgsql/create_tables.sh.in File sizes: -rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 customer.tbl -rw-r--r-- 1 markw 50 74G Jul 8 15:03 lineitem.tbl -rw-r--r-- 1 markw 50 2.1K Jul 8 15:03 nation.tbl -rw-r--r-- 1 markw 50 17G Jul 8 15:03 orders.tbl -rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 part.tbl -rw-r--r-- 1 markw 50 12G Jul 8 15:03 partsupp.tbl -rw-r--r-- 1 markw 50 391 Jul 8 15:03 region.tbl -rw-r--r-- 1 markw 50 136M Jul 8 15:03 supplier.tbl Number of rows: # wc -l *.tbl 1500 customer.tbl 600037902 lineitem.tbl 25 nation.tbl 15000 orders.tbl 2000 part.tbl 8000 partsupp.tbl 5 region.tbl 100 supplier.tbl Thanks, Mark On Tue, 19 Jul 2005 14:05:56 -0700 Alon Goldshuv [EMAIL PROTECTED] wrote: Hi Mark, I improved the data *parsing* capabilities of COPY, and didn't touch the data conversion or data insertion parts of the code. The parsing improvement will vary largely depending on the ratio of parsing -to- converting and inserting. Therefore, the speed increase really depends on the nature of your data: 100GB file with long data rows (lots of parsing) Small number of columns (small number of attr conversions per row) less rows (less tuple insertions) Will show the best performance improvements. However, same file size 100GB with Short data rows (minimal parsing) large number of columns (large number of attr conversions per row) AND/OR more rows (more tuple insertions) Will show improvements but not as significant. In general I'll estimate 40%-95% improvement in load speed for the 1st case and 10%-40% for the 2nd. But that also depends on the hardware, disk speed etc... This is for TEXT format. As for CSV, it may be faster but not as much as I specified here. BINARY will stay the same as before. HTH Alon. On 7/19/05 12:54 PM, Mark Wong [EMAIL PROTECTED] wrote: On Thu, 14 Jul 2005 17:22:18 -0700 Alon Goldshuv [EMAIL PROTECTED] wrote: I revisited my patch and removed the code duplications that were there, and added support for CSV with buffered input, so CSV now runs faster too (although it is not as optimized as the TEXT format parsing). So now TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file. Hi Alon, I'm curious, what kind of system are you testing this on? I'm trying to load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm interested in the results you would expect. Mark ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] COPY fast parse patch
Neil, Right; I really dislike the idea of having two separate code paths for COPY. When you say this approach is temporary, are you suggesting that you intend to reimplement your changes as improvements/replacements of the existing COPY code path rather than as a parallel code path? My thoughts were -- see how the responses are, and if people think that this is a better way to go than replace the COPY parsing logic to the new one. The whole escaping discussion that goes on is something else, escapes could be implemented in either way, but the important thing I am trying to show is that there is a much faster way to parse the data instead of doing a char-by-char petch-examine-load. As a part of submitting this patch I also presented an argument for a use of a LOAD DATA command (in the NOLOGGING option thread). The points I made there are closely related to this message. There may be a valid argument that most of the points I raised could be implemented in the COPY code instead of a LOAD DATA command I'm definitely not keen to see a new LOAD DATA command. It seems that most people don't :-) I can see valid arguments to both having it and not having it. But that may not be a good idea for some and will also be problematic for backwards compatiability. In what way would the performance improvements to COPY be backward incompatible with the existing COPY behaviour? That comment was in respect to the escape logic. You can regard it as irrelevant for now as long as the escape discussion goes on in parallel. Alon. -Neil ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])