Re: [PATCHES] Libpq COPY optimization patch

2006-01-25 Thread Alon Goldshuv
Libpq copy speedup patch attached. No input buffer logic involved, just
removing the expensive PQconsumeInput call per putCopyData call, and leaving
parseInput as is, as discussed before.

Alon.



libpq_copy.patch
Description: Binary data

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[PATCHES] Libpq COPY optimization patch

2006-01-24 Thread Alon Goldshuv
Here is a patch against today's code 1/24. As discussed in -hackers
consumeInput/parse is removed from being called every single time. It's
still there for only when the data is sent to the server.

Alon.



pq_put_copy_data.patch
Description: Binary data

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] Libpq COPY optimization patch

2006-01-24 Thread Alon Goldshuv
Tom,

 Here is a patch against today's code 1/24. As discussed in -hackers
 consumeInput/parse is removed from being called every single time. It's
 still there for only when the data is sent to the server.
 
 This appears to be the exact same patch you sent before.  Did you
 test my suggestion of simply removing the PQconsumeInput call?
 I see no reason to add it inside the loop.

My mistake. I'll make the correction.

I guess that although parseInput is cheap we could still use a conditional
to see when data was sent and only then call it (without PQconsumeInput)
instead of calling it every single time PQputCopyData is called. Any
objection to that?

Alon.



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Libpq COPY optimization patch

2006-01-24 Thread Alon Goldshuv
 You mean something like
 
 if (input-buffer-not-empty)
 parseInput();
 
 ?  This still bothers me a bit since it's a mixing of logic levels;
 PQputCopyData is an output routine, it shouldn't be getting its fingers
 dirty with input buffer contents.  I'm willing to tolerate this if it
 can be demonstrated that it provides a useful performance gain compared
 to the unconditional parseInput call, but let's see some numbers.

Yes, I understand. We'll see what the performance gain is like and see if
it's worth it, I'll report back.

Alon.



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] COPY FROM performance improvements

2005-08-09 Thread Alon Goldshuv
I did some performance checks after the recent code commit.

The good news is that the parsing speed of COPY is now MUCH faster, which is
great. It is about 5 times faster - about 100MB/sec on my machine
(previously 20MB/sec at best, usually less).

The better news is that my original patch parsing speed reaches 120MB/sec,
about 20MB/sec faster than the code that's now in CVS. This can be
significant for the long scheme of things and for large data sets. Maybe we
can improve the current code a bit more to reach this number.

I performed those measurement by executing *only the parsing logic* of the
COPY pipeline. All data conversion (functioncall3(string...)) and tuple
handling (form_heaptuple etc...) and insertion were manually disabled. So
the only code measured is reading from disk and parsing to the attribute
level.

Cheers,
Alon.





On 8/7/05 1:21 AM, Luke Lonergan [EMAIL PROTECTED] wrote:

 Tom,
 
 On 8/6/05 9:08 PM, Tom Lane [EMAIL PROTECTED] wrote:
 
 Luke Lonergan [EMAIL PROTECTED] writes:
 I had some difficulty in generating test cases that weren't largely
 I/O-bound, but AFAICT the patch as applied is about the same speed
 as what you submitted.
 
 You achieve the important objective of knocking the parsing stage down a
 lot, but your parsing code is actually about 20% slower than Alon's.
 
 I would like to see the exact test case you are using to make this
 claim; the tests I did suggested my code is the same speed or faster.
 
 I showed mine - you show yours :-)  Apparently our e-mail crossed.
   
 As best I can tell, my version of CopyReadAttributes is significantly
 quicker than Alon's, approximately balancing out the fact that my
 version of CopyReadLine is slower.  I did the latter first, and would
 now be tempted to rewrite it in the same style as CopyReadAttributes,
 ie one pass of memory-to-memory copy using pointers rather than buffer
 indexes.
 
 See previous timings - looks like Alon's parsing is substantially faster.
 However, I'd like him to confirm by running with the shunt placed at
 different stages, in this case between parse and attribute conversion (not
 attribute parse).
  
 BTW, late today I figured out a way to get fairly reproducible
 non-I/O-bound numbers about COPY FROM: use a trigger that suppresses
 the actual inserts, thus:
 
 create table foo ...
 create function noway() returns trigger as
 'begin return null; end' language plpgsql;
 create trigger noway before insert on foo
   for each row execute procedure noway();
 then repeat:
 copy foo from '/tmp/foo.data';
 
 Cool!  That's a better way than hacking code and inserting shunts.
  
 Alon will likely hit this tomorrow.
 
 - Luke
 



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] [PATCHES] COPY FROM performance improvements

2005-08-02 Thread Alon Goldshuv
Tom,

Thanks for pointing it out. I made the small required modifications to match
copy.c version 1.247 and sent it to -patches list. New patch is V16.

Alon.


On 8/1/05 7:51 PM, Tom Lane [EMAIL PROTECTED] wrote:

 Alon Goldshuv [EMAIL PROTECTED] writes:
 This patch appears to reverse out the most recent committed changes in
 copy.c.
 
 Which changes do you refer to? I thought I accommodated all the recent
 changes (I recall some changes to the tupletable/tupleslot interface, HEADER
 in cvs, and hex escapes and maybe one or 2 more). What did I miss?
 
 The latest touch of copy.c, namely this patch:
 
 2005-07-10 17:13  tgl
 
 * doc/src/sgml/ref/create_type.sgml, src/backend/commands/copy.c,
 src/backend/commands/typecmds.c, src/backend/tcop/fastpath.c,
 src/backend/tcop/postgres.c, src/backend/utils/adt/arrayfuncs.c,
 src/backend/utils/adt/date.c, src/backend/utils/adt/numeric.c,
 src/backend/utils/adt/rowtypes.c,
 src/backend/utils/adt/timestamp.c, src/backend/utils/adt/varbit.c,
 src/backend/utils/adt/varchar.c, src/backend/utils/adt/varlena.c,
 src/backend/utils/mb/mbutils.c, src/include/catalog/catversion.h,
 src/include/catalog/pg_proc.h,
 src/test/regress/expected/type_sanity.out,
 src/test/regress/sql/type_sanity.sql: Change typreceive function
 API so that receive functions get the same optional arguments as
 text input functions, ie, typioparam OID and atttypmod.  Make all
 the datatypes that use typmod enforce it the same way in typreceive
 as they do in typinput.  This fixes a problem with failure to
 enforce length restrictions during COPY FROM BINARY.
 
 It was rather obvious, given that the first chunk of the patch backed up
 the file's CVS version stamp from 1.247 to 1.246 :-(
 
 regards, tom lane
 



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] COPY FROM performance improvements

2005-07-19 Thread Alon Goldshuv
Hi Mark,

I improved the data *parsing* capabilities of COPY, and didn't touch the
data conversion or data insertion parts of the code. The parsing improvement
will vary largely depending on the ratio of parsing -to- converting and
inserting. 

Therefore, the speed increase really depends on the nature of your data:

100GB file with
long data rows (lots of parsing)
Small number of columns (small number of attr conversions per row)
less rows (less tuple insertions)

Will show the best performance improvements.

However, same file size 100GB with
Short data rows (minimal parsing)
large number of columns (large number of attr conversions per row)
AND/OR
more rows (more tuple insertions)

Will show improvements but not as significant.
In general I'll estimate 40%-95% improvement in load speed for the 1st case
and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
etc... This is for TEXT format. As for CSV, it may be faster but not as much
as I specified here. BINARY will stay the same as before.

HTH
Alon.






On 7/19/05 12:54 PM, Mark Wong [EMAIL PROTECTED] wrote:

 On Thu, 14 Jul 2005 17:22:18 -0700
 Alon Goldshuv [EMAIL PROTECTED] wrote:
 
 I revisited my patch and removed the code duplications that were there, and
 added support for CSV with buffered input, so CSV now runs faster too
 (although it is not as optimized as the TEXT format parsing). So now
 TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file.
 
 Hi Alon,
 
 I'm curious, what kind of system are you testing this on?  I'm trying to
 load 100GB of data in our dbt3 workload on a 4-way itanium2.  I'm
 interested in the results you would expect.
 
 Mark
 



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] COPY FROM performance improvements

2005-07-19 Thread Alon Goldshuv
Mark,

Thanks for the info.

Yes, isolating indexes out of the picture is a good idea for this purpose.

I can't really give a guess to how fast the load rate should be. I don't
know how your system is configured, and all the hardware characteristics
(and even if I knew that info I may not be able to guess...). I am pretty
confident that the load will be faster than before, I'll risk that ;-)
Looking into your TPC-H size and metadata I'll estimate that
partsupp,customer and orders will have the most significant increase in load
rate. You could start with those.

I guess the only way to really know is to try... Load several times with the
existing PG-COPY and then load several times with the patched COPY and
compare. I'll be curious to hear your results.

Thx,
Alon.

 


On 7/19/05 2:37 PM, Mark Wong [EMAIL PROTECTED] wrote:

 Hi Alon,
 
 Yeah, that helps.  I just need to break up my scripts a little to just
 load the data and not build indexes.
 
 Is the following information good enough to give a guess about the data
 I'm loading, if you don't mind? ;)  Here's a link to my script to create
 tables:
 http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f
 7f23437e432path=scripts/pgsql/create_tables.sh.in
 
 File sizes:
 -rw-r--r--  1 markw 50 2.3G Jul  8 15:03 customer.tbl
 -rw-r--r--  1 markw 50  74G Jul  8 15:03 lineitem.tbl
 -rw-r--r--  1 markw 50 2.1K Jul  8 15:03 nation.tbl
 -rw-r--r--  1 markw 50  17G Jul  8 15:03 orders.tbl
 -rw-r--r--  1 markw 50 2.3G Jul  8 15:03 part.tbl
 -rw-r--r--  1 markw 50  12G Jul  8 15:03 partsupp.tbl
 -rw-r--r--  1 markw 50  391 Jul  8 15:03 region.tbl
 -rw-r--r--  1 markw 50 136M Jul  8 15:03 supplier.tbl
 
 Number of rows:
 # wc -l *.tbl
 1500 customer.tbl
600037902 lineitem.tbl
   25 nation.tbl
15000 orders.tbl
 2000 part.tbl
 8000 partsupp.tbl
5 region.tbl
  100 supplier.tbl
 
 Thanks,
 Mark
 
 On Tue, 19 Jul 2005 14:05:56 -0700
 Alon Goldshuv [EMAIL PROTECTED] wrote:
 
 Hi Mark,
 
 I improved the data *parsing* capabilities of COPY, and didn't touch the
 data conversion or data insertion parts of the code. The parsing improvement
 will vary largely depending on the ratio of parsing -to- converting and
 inserting. 
 
 Therefore, the speed increase really depends on the nature of your data:
 
 100GB file with
 long data rows (lots of parsing)
 Small number of columns (small number of attr conversions per row)
 less rows (less tuple insertions)
 
 Will show the best performance improvements.
 
 However, same file size 100GB with
 Short data rows (minimal parsing)
 large number of columns (large number of attr conversions per row)
 AND/OR
 more rows (more tuple insertions)
 
 Will show improvements but not as significant.
 In general I'll estimate 40%-95% improvement in load speed for the 1st case
 and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
 etc... This is for TEXT format. As for CSV, it may be faster but not as much
 as I specified here. BINARY will stay the same as before.
 
 HTH
 Alon.
 
 
 
 
 
 
 On 7/19/05 12:54 PM, Mark Wong [EMAIL PROTECTED] wrote:
 
 On Thu, 14 Jul 2005 17:22:18 -0700
 Alon Goldshuv [EMAIL PROTECTED] wrote:
 
 I revisited my patch and removed the code duplications that were there, and
 added support for CSV with buffered input, so CSV now runs faster too
 (although it is not as optimized as the TEXT format parsing). So now
 TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original
 file.
 
 Hi Alon,
 
 I'm curious, what kind of system are you testing this on?  I'm trying to
 load 100GB of data in our dbt3 workload on a 4-way itanium2.  I'm
 interested in the results you would expect.
 
 Mark
 
 
 



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] COPY fast parse patch

2005-06-01 Thread Alon Goldshuv
Neil,

Right; I really dislike the idea of having two separate code paths for
COPY. When you say this approach is temporary, are you suggesting that
you intend to reimplement your changes as improvements/replacements of
the existing COPY code path rather than as a parallel code path?


My thoughts were -- see how the responses are, and if people think that this is 
a better way to go than replace the COPY parsing logic to the new one. The 
whole escaping discussion that goes on is something else, escapes could be 
implemented in either way, but the important thing I am trying to show is that 
there is a much faster way to parse the data instead of doing a char-by-char 
petch-examine-load.


 As a part of submitting this patch I also presented an argument for a use of
 a LOAD DATA command (in the NOLOGGING option thread). The points I made
 there are closely related to this message. There may be a valid argument
 that most of the points I raised could be implemented in the COPY code
 instead of a LOAD DATA command

I'm definitely not keen to see a new LOAD DATA command.


It seems that most people don't :-) I can see valid arguments to both having it 
and not having it.



 But that may not be a good idea for some and will also be problematic
 for backwards compatiability.

In what way would the performance improvements to COPY be backward
incompatible with the existing COPY behaviour?


That comment was in respect to the escape logic. You can regard it as 
irrelevant for now as long as the escape discussion goes on in parallel.

Alon.


-Neil






---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])