Andrew Dunstan wrote:
Heikki Linnakangas wrote:
It looks like strpbrk() performs poorly:
Yes, not surprising. I just looked at the implementation in glibc, which
I assume you are using, and it seemed rather basic. The one in NetBSD's
libc looks much more efficient.
See
http://sources.redh
Heikki Linnakangas wrote:
Andrew Dunstan wrote:
Another question that occurred to me - did you try using strpbrk() to
look for the next interesting character rather than your homegrown
searcher gadget? If so, how did that perform?
It looks like strpbrk() performs poorly:
Yes, not surpris
Andrew Dunstan wrote:
Another question that occurred to me - did you try using strpbrk() to
look for the next interesting character rather than your homegrown
searcher gadget? If so, how did that perform?
It looks like strpbrk() performs poorly:
unpatched:
testname | min duration
--
Andrew Dunstan wrote:
Heikki Linnakangas wrote:
Andrew Dunstan wrote:
I'm still a bit worried about applying it unless it gets some
adaptive behaviour or something so that we don't cause any serious
performance regressions in some cases.
I'll try to come up with something. At the most cons
Heikki Linnakangas wrote:
Andrew Dunstan wrote:
I'm still a bit worried about applying it unless it gets some
adaptive behaviour or something so that we don't cause any serious
performance regressions in some cases.
I'll try to come up with something. At the most conservative end, we
could
Greg Smith wrote:
On Thu, 6 Mar 2008, Heikki Linnakangas wrote:
At the most conservative end, we could fall back to the current
method on the first escape, quote or backslash character.
I would just count the number of escaped/quote characters on each
line, and then at the end of the line
On Thu, 6 Mar 2008, Heikki Linnakangas wrote:
At the most conservative end, we could fall back to the current method
on the first escape, quote or backslash character.
I would just count the number of escaped/quote characters on each line,
and then at the end of the line switch modes between
Andrew Dunstan wrote:
Heikki Linnakangas wrote:
Andrew Dunstan wrote:
I'm still a bit worried about applying it unless it gets some
adaptive behaviour or something so that we don't cause any serious
performance regressions in some cases.
I'll try to come up with something. At the most conser
Heikki Linnakangas wrote:
Andrew Dunstan wrote:
I'm still a bit worried about applying it unless it gets some
adaptive behaviour or something so that we don't cause any serious
performance regressions in some cases.
I'll try to come up with something. At the most conservative end, we
could
Andrew Dunstan wrote:
I'm still a bit worried about applying it unless it gets some adaptive
behaviour or something so that we don't cause any serious performance
regressions in some cases.
I'll try to come up with something. At the most conservative end, we
could fall back to the current met
Tom Lane wrote:
BTW, I notice that the code allows CSV escape and quote characters that
have the high bit set (in single-byte server encodings that is). Is
this a good idea? It seems like such are extremely unlikely to be the
same in two different encodings. Maybe we should restrict to the ASC
Heikki Linnakangas wrote:
Andrew Dunstan wrote:
Heikki Linnakangas wrote:
Another update attached: It occurred to me that the memchr approach is
only safe for server encodings, where the non-first bytes of a
multi-byte character always have the hi-bit set.
We currently make the following
Tom Lane wrote:
BTW, I notice that the code allows CSV escape and quote characters that
have the high bit set (in single-byte server encodings that is). Is
this a good idea? It seems like such are extremely unlikely to be the
same in two different encodings. Maybe we should restrict to the A
"Heikki Linnakangas" <[EMAIL PROTECTED]> writes:
> Andrew Dunstan wrote:
>> We currently make the following assumption in the code:
>>
>> * These four characters, and the CSV escape and quote characters, are
>> * assumed the same in frontend and backend encodings.
>>
>> The four characters are th
Andrew Dunstan wrote:
Heikki Linnakangas wrote:
Another update attached: It occurred to me that the memchr approach is
only safe for server encodings, where the non-first bytes of a
multi-byte character always have the hi-bit set.
We currently make the following assumption in the code:
Heikki Linnakangas wrote:
Heikki Linnakangas wrote:
Heikki Linnakangas wrote:
Attached is a patch that modifies CopyReadLineText so that it uses
memchr to speed up the scan. The nice thing about memchr is that we
can take advantage of any clever optimizations that might be in libc
or compil
Heikki Linnakangas wrote:
So the overhead of using memchr slows us down if there's a lot of
escape or quote characters. The breakeven point seems to be about 1 in
8 characters. I'm not sure if that's a good tradeoff or not...
How about we test the first buffer read in from the file and
Heikki Linnakangas wrote:
Heikki Linnakangas wrote:
Attached is a patch that modifies CopyReadLineText so that it uses
memchr to speed up the scan. The nice thing about memchr is that we
can take advantage of any clever optimizations that might be in libc
or compiler.
Here's an updated versi
Heikki Linnakangas wrote:
I still need to test the worst-case performance, with input that has a
lot of escapes.
Ok, I've done some more performance testing with this. I tested COPY
FROM with a table with a single "text" column. There was a million rows
in the table, with a 1000 character lon
Your patch has been added to the PostgreSQL unapplied patches list at:
http://momjian.postgresql.org/cgi-bin/pgpatches
It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.
---
He
Heikki Linnakangas wrote:
Attached is a patch that modifies CopyReadLineText so that it uses
memchr to speed up the scan. The nice thing about memchr is that we can
take advantage of any clever optimizations that might be in libc or
compiler.
Here's an updated version of the patch. The princi
esql.org
> Subject: [PATCHES] CopyReadLineText optimization
>
> The purpose of CopyReadLineText is to scan the input buffer,
> and find the next newline, taking into account any escape
> characters. It currently operates in a loop, one byte at a
> time, searching for LF, CR, o
The purpose of CopyReadLineText is to scan the input buffer, and find
the next newline, taking into account any escape characters. It
currently operates in a loop, one byte at a time, searching for LF, CR,
or a backslash. That's a bit slow: I've been running oprofile on COPY,
and I've seen Copy
23 matches
Mail list logo