Re: [PATCHES] COPY FROM performance improvements

2005-08-06 Thread Luke Lonergan
Tom,

On 8/6/05 9:08 PM, "Tom Lane" <[EMAIL PROTECTED]> wrote:

> "Luke Lonergan" <[EMAIL PROTECTED]> writes:
>>> I had some difficulty in generating test cases that weren't largely
>>> I/O-bound, but AFAICT the patch as applied is about the same speed
>>> as what you submitted.
> 
>> You achieve the important objective of knocking the parsing stage down a
>> lot, but your parsing code is actually about 20% slower than Alon's.
> 
> I would like to see the exact test case you are using to make this
> claim; the tests I did suggested my code is the same speed or faster.

I showed mine - you show yours :-)  Apparently our e-mail crossed.
  
> As best I can tell, my version of CopyReadAttributes is significantly
> quicker than Alon's, approximately balancing out the fact that my
> version of CopyReadLine is slower.  I did the latter first, and would
> now be tempted to rewrite it in the same style as CopyReadAttributes,
> ie one pass of memory-to-memory copy using pointers rather than buffer
> indexes.

See previous timings - looks like Alon's parsing is substantially faster.
However, I'd like him to confirm by running with the "shunt" placed at
different stages, in this case between parse and attribute conversion (not
attribute parse).
 
> BTW, late today I figured out a way to get fairly reproducible
> non-I/O-bound numbers about COPY FROM: use a trigger that suppresses
> the actual inserts, thus:
> 
> create table foo ...
> create function noway() returns trigger as
> 'begin return null; end' language plpgsql;
> create trigger noway before insert on foo
>   for each row execute procedure noway();
> then repeat:
> copy foo from '/tmp/foo.data';

Cool!  That's a better way than hacking code and inserting shunts.
 
Alon will likely hit this tomorrow.

- Luke



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] COPY FROM performance improvements

2005-08-06 Thread Tom Lane
"Luke Lonergan" <[EMAIL PROTECTED]> writes:
>> I had some difficulty in generating test cases that weren't largely
>> I/O-bound, but AFAICT the patch as applied is about the same speed
>> as what you submitted.

> You achieve the important objective of knocking the parsing stage down a
> lot, but your parsing code is actually about 20% slower than Alon's.

I would like to see the exact test case you are using to make this
claim; the tests I did suggested my code is the same speed or faster.
The particular test case I was using was the "tenk1" data from the
regression database, duplicated out to about 600K rows so as to run
long enough to measure with some degree of repeatability.

As best I can tell, my version of CopyReadAttributes is significantly
quicker than Alon's, approximately balancing out the fact that my
version of CopyReadLine is slower.  I did the latter first, and would
now be tempted to rewrite it in the same style as CopyReadAttributes,
ie one pass of memory-to-memory copy using pointers rather than buffer
indexes.

BTW, late today I figured out a way to get fairly reproducible
non-I/O-bound numbers about COPY FROM: use a trigger that suppresses
the actual inserts, thus:

create table foo ...
create function noway() returns trigger as
'begin return null; end' language plpgsql;
create trigger noway before insert on foo
  for each row execute procedure noway();
then repeat:
copy foo from '/tmp/foo.data';

If the source file is not too large to fit in kernel disk cache, then
after the first iteration there is no I/O at all.  I got numbers
that were reproducible within less than 1%, as opposed to 5% or more
variation when the thing was partially I/O bound.  Pretty useless in the
real world, of course, but great for timing COPY's data-pushing.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] COPY FROM performance improvements

2005-08-06 Thread Luke Lonergan
Tom,

My direct e-mails to you are apparently blocked, so I'll send this to the
list.

I've attached the case we use for load performance testing, with the data
generator modified to produce a single row version of the dataset.

I do believe that you/we will need to invert the processing loop to get the
maximum parsing speed.  We will be implementing much higher loading speeds
which require it to compete with Oracle, Netezza, Teradata, so we'll have to
work this out for the best interests of our users.

- Luke



IVP.tgz
Description: Binary data

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] COPY FROM performance improvements

2005-08-06 Thread Luke Lonergan
Tom,

The previous timings were for a table with 15 columns of mixed type.  We
also test with 1 column to make the parsing overhead more apparent.  In the
case of 1 text column with 145MB of input data:

Your patch:
  Time: 6612.599 ms

Alon's patch:
  Time: 6119.244 ms


Alon's patch is 7.5% faster here, where it was only 3% faster on the 15
column case.  This is consistent with a large difference in parsing speed
between your approach and Alon's.

I'm pretty sure that the "mistake" you refer to is responsible for the speed
improvement, and was deliberately chosen to minimize memory copies, etc.
Given that we're looking ahead to getting much higher speeds, approaching
current high performance disk speeds, we've been looking more closely at the
parsing speed.  It comes down to a tradeoff between elegant code and speed.

We'll prove it in lab tests soon, where we measure the parsing rate
directly, but these experiments show it clearly, though indirectly.

- Luke



On 8/6/05 2:04 PM, "Tom Lane" <[EMAIL PROTECTED]> wrote:

> "Alon Goldshuv" <[EMAIL PROTECTED]> writes:
>> New patch attached. It includes very minor changes. These are changes that
>> were committed to CVS 3 weeks ago (copy.c 1.247) which I missed in the
>> previous patch.
> 
> I've applied this with (rather extensive) revisions.  I didn't like what
> you had done with the control structure --- loading the input buffer
> only at the outermost loop level was a bad design choice IMHO.  You had
> sprinkled the code with an unreasonable number of special cases in order
> to try to cope with the effects of that mistake, but there were lots
> of problems still left.  Some of the bugs I noticed:
> 
> * Broke old-protocol COPY, since that has no provision for stopping at
> the EOF marker except by parsing the data carefully to start with.  The
> backend would just hang up unless the total data size chanced to be a
> multiple of 64K.
> 
> * Subtle change in interpretation of \. EOF marker (the existing code
> will recognize it even when not at start of line).
> 
> * Seems to have thrown away detection of newline format discrepancies.
> 
> * Fails for zero-column tables.
> 
> * Broke display of column values during error context callback (would
> always show the first column contents no matter which one is being
> complained of).
> 
> * DetectLineEnd mistakenly assumes CR mode if very last character of first
> bufferload is CR; need to reserve judgment until next char is available.
> 
> * DetectLineEnd fails to account for backslashed control characters,
> so it will e.g. accept \ followed by \n as determining the newline
> style.
> 
> * Fails to apply encoding conversion if first line exceeds copy buf
> size, because when DetectLineEnd fails the quick-exit path doesn't do
> it.
> 
> * There seem to be several bugs associated with the fact that input_buf[]
> always has 64K of data in it even when EOF has been reached on the
> input.  One example:
> echo -n 123 >zzz1
> psql> create temp table t1(f1 text);
> psql> copy t1 from '/home/tgl/zzz1';
> psql> select * from t1;
> hmm ... where'd that 64K of whitespace come from?
> 
> I rewrote the patch in a way that retained most of the speedups without
> changing the basic control structure (except for replacing multiple
> CopyReadAttribute calls with one CopyReadAttributes call per line).
> 
> I had some difficulty in generating test cases that weren't largely
> I/O-bound, but AFAICT the patch as applied is about the same speed
> as what you submitted.
> 
> regards, tom lane
> 
> ---(end of broadcast)---
> TIP 2: Don't 'kill -9' the postmaster
> 



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] COPY FROM performance improvements

2005-08-06 Thread Luke Lonergan
Tom,

Thanks for finding the bugs and reworking things.

> I had some difficulty in generating test cases that weren't largely
> I/O-bound, but AFAICT the patch as applied is about the same speed
> as what you submitted.

You achieve the important objective of knocking the parsing stage down a
lot, but your parsing code is actually about 20% slower than Alon's.

Before your patch:
  Time: 14205.606 ms

With your patch:
  Time: 10565.374 ms

With Alon's patch:
  Time: 10289.845 ms

The parsing part of the code in your version is slower, but as a percentage
of the total it's hidden. The loss of 0.3 seconds on 143MB means:

- If parsing takes a total of 0.9 seconds, the parsing rate is 160MB/s
(143/0.9)

- If we add another 0.3 seconds to parsing to bring it to 1.2, then the
parsing rate becomes 120MB/s

When we improve the next stages of the processing (attribute conversion,
write-to disk), this will stand out a lot more.  Our objective is to get the
COPY rate *much* faster than the current poky rate of 14MB/s (after this
patch).

- Luke

On 8/6/05 2:04 PM, "Tom Lane" <[EMAIL PROTECTED]> wrote:

> "Alon Goldshuv" <[EMAIL PROTECTED]> writes:
>> New patch attached. It includes very minor changes. These are changes that
>> were committed to CVS 3 weeks ago (copy.c 1.247) which I missed in the
>> previous patch.
> 
> I've applied this with (rather extensive) revisions.  I didn't like what
> you had done with the control structure --- loading the input buffer
> only at the outermost loop level was a bad design choice IMHO.  You had
> sprinkled the code with an unreasonable number of special cases in order
> to try to cope with the effects of that mistake, but there were lots
> of problems still left.  Some of the bugs I noticed:
> 
> * Broke old-protocol COPY, since that has no provision for stopping at
> the EOF marker except by parsing the data carefully to start with.  The
> backend would just hang up unless the total data size chanced to be a
> multiple of 64K.
> 
> * Subtle change in interpretation of \. EOF marker (the existing code
> will recognize it even when not at start of line).
> 
> * Seems to have thrown away detection of newline format discrepancies.
> 
> * Fails for zero-column tables.
> 
> * Broke display of column values during error context callback (would
> always show the first column contents no matter which one is being
> complained of).
> 
> * DetectLineEnd mistakenly assumes CR mode if very last character of first
> bufferload is CR; need to reserve judgment until next char is available.
> 
> * DetectLineEnd fails to account for backslashed control characters,
> so it will e.g. accept \ followed by \n as determining the newline
> style.
> 
> * Fails to apply encoding conversion if first line exceeds copy buf
> size, because when DetectLineEnd fails the quick-exit path doesn't do
> it.
> 
> * There seem to be several bugs associated with the fact that input_buf[]
> always has 64K of data in it even when EOF has been reached on the
> input.  One example:
> echo -n 123 >zzz1
> psql> create temp table t1(f1 text);
> psql> copy t1 from '/home/tgl/zzz1';
> psql> select * from t1;
> hmm ... where'd that 64K of whitespace come from?
> 
> I rewrote the patch in a way that retained most of the speedups without
> changing the basic control structure (except for replacing multiple
> CopyReadAttribute calls with one CopyReadAttributes call per line).
> 
> I had some difficulty in generating test cases that weren't largely
> I/O-bound, but AFAICT the patch as applied is about the same speed
> as what you submitted.
> 
> regards, tom lane
> 
> ---(end of broadcast)---
> TIP 2: Don't 'kill -9' the postmaster
> 



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] COPY FROM performance improvements

2005-08-06 Thread Tom Lane
"Alon Goldshuv" <[EMAIL PROTECTED]> writes:
> New patch attached. It includes very minor changes. These are changes that
> were committed to CVS 3 weeks ago (copy.c 1.247) which I missed in the
> previous patch.

I've applied this with (rather extensive) revisions.  I didn't like what
you had done with the control structure --- loading the input buffer
only at the outermost loop level was a bad design choice IMHO.  You had
sprinkled the code with an unreasonable number of special cases in order
to try to cope with the effects of that mistake, but there were lots
of problems still left.  Some of the bugs I noticed:

* Broke old-protocol COPY, since that has no provision for stopping at
the EOF marker except by parsing the data carefully to start with.  The
backend would just hang up unless the total data size chanced to be a
multiple of 64K.

* Subtle change in interpretation of \. EOF marker (the existing code
will recognize it even when not at start of line).

* Seems to have thrown away detection of newline format discrepancies.

* Fails for zero-column tables.

* Broke display of column values during error context callback (would
always show the first column contents no matter which one is being
complained of).

* DetectLineEnd mistakenly assumes CR mode if very last character of first
bufferload is CR; need to reserve judgment until next char is available.

* DetectLineEnd fails to account for backslashed control characters,
so it will e.g. accept \ followed by \n as determining the newline
style.

* Fails to apply encoding conversion if first line exceeds copy buf
size, because when DetectLineEnd fails the quick-exit path doesn't do
it.

* There seem to be several bugs associated with the fact that input_buf[]
always has 64K of data in it even when EOF has been reached on the
input.  One example:
echo -n 123 >zzz1
psql> create temp table t1(f1 text);
psql> copy t1 from '/home/tgl/zzz1';
psql> select * from t1;
hmm ... where'd that 64K of whitespace come from?

I rewrote the patch in a way that retained most of the speedups without
changing the basic control structure (except for replacing multiple
CopyReadAttribute calls with one CopyReadAttributes call per line).

I had some difficulty in generating test cases that weren't largely
I/O-bound, but AFAICT the patch as applied is about the same speed
as what you submitted.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-08-06 Thread Mark Wong
Here are comments that Daniel McNeil made earlier, which I've neglected
to forward earlier.  I've cc'ed him and Mark Havercamp, which some of
you got to meet the other day.

Mark

-

With O_DIRECT on Linux, when the write() returns the i/o has been
transferred to the disk.  

Normally, this i/o will be DMAed directly from user-space to the
device.  The current exception is when doing an O_DIRECT write to a 
hole in a file.  (If an program does a truncate() or lseek()/write()
that makes a file larger, the file system does not allocated space
between the old end of file and the new end of file.)  An O_DIRECT
write to hole like this, requires the file system to allocated space,
but there is a race condition between the O_DIRECT write doing the
allocate and then write to initialized the newly allocated data and
any other process that attempts a buffered (page cache) read of the
same area in the file -- it was possible for the read to data from
the allocated region before the O_DIRECT write().  The fix in Linux
is for the O_DIRECT write() to fall back to use buffer i/o to do
the write() and flush the data from the page cache to the disk.

A write() with O_DIRECT only means the data has been transferred to
the disk.   Depending on the file system and mount options, it does
not mean the meta data for the file has been written to disk (see
fsync man page).  Fsync() will guarantee the data and metadata have
been written to disk.

Lastly, if a disk has a write back cache, an O_DIRECT write() does not
guarantee that the disk has put the data on the physical media.
I think some of the journal file systems now support i/o barriers
on commit which will flush the disk write back cache.  (I'm still
looking the kernel code to see how this is done).

Conclusion:

O_DIRECT + fsync() can make sense.  It avoids the copying of data
to the page cache before being written and will also guarantee
that the file's metadata is also written to disk.  It also
prevents the page cache from filling up with write data that
will never be read (I assume it is only read if a recovery
is necessary - which should be rare).  It can also
helps disks with write back cache when using the journaling
file system that use i/o barriers.  You would want to use
large writes, since the kernel page cache won't be writing
multiple pages for you.

I need to look at the kernel code more to comment on O_DIRECT with
O_SYNC.

Questions:

Does the database transaction logger preallocate the log file?

Does the logger care about the order in which each write hits the disk?

Now someone else can comment on my comments.

Daniel

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] default tablespace for roles

2005-08-06 Thread Tom Lane
Laszlo Hornyak <[EMAIL PROTECTED]> writes:
> It would be nice to have tablespaces for each users. This is a small 
> pathc that does the job.

Isn't this entirely redundant with the existing default_tablespace GUC
variable, ie,

ALTER USER foo SET default_tablespace = whatever;

The patch's behavior of overriding session-local settings of
default_tablespace seems quite undesirable in any case.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[PATCHES] default tablespace for roles

2005-08-06 Thread Laszlo Hornyak

Hi!

It would be nice to have tablespaces for each users. This is a small 
pathc that does the job.
- gramar file: "alter|create user ... with default tablespace 'tblspc'" 
added;

- new column in pg_authid: roltblspc, an Oid referring to the pg_tablespace
- at alter/add role neccessary default tablepsace modifications+checking 
if it exists
- pg_dumall outputs script that alters users with default tablespace 
after dumping users and tablespaces


(for 8.2, if you like it)
Regards,
Laszlo



Index: src/include/catalog/pg_authid.h
===
RCS file: /projects/cvsroot/pgsql/src/include/catalog/pg_authid.h,v
retrieving revision 1.3
diff -r1.3 pg_authid.h
51a52
> Oidroltblspc;/* default tablespace oid */
74c75
< #define Natts_pg_authid11
---
> #define Natts_pg_authid12
82,85c83,87
< #define Anum_pg_authid_rolconnlimit8
< #define Anum_pg_authid_rolpassword9
< #define Anum_pg_authid_rolvaliduntil10
< #define Anum_pg_authid_rolconfig11
---
> #define Anum_pg_authid_roltblspc8
> #define Anum_pg_authid_rolconnlimit9
> #define Anum_pg_authid_rolpassword10
> #define Anum_pg_authid_rolvaliduntil11
> #define Anum_pg_authid_rolconfig12
94c96
< DATA(insert OID = 10 ( "POSTGRES" t t t t t t -1 _null_ _null_ _null_ ));
---
> DATA(insert OID = 10 ( "POSTGRES" t t t t t t 1663 -1 _null_ _null_ 
_null_ ));

Index: src/bin/pg_dump/pg_dumpall.c
===
RCS file: /projects/cvsroot/pgsql/src/bin/pg_dump/pg_dumpall.c,v
retrieving revision 1.66
diff -r1.66 pg_dumpall.c
718a719,735
>
> res = executeQuery(conn, "SELECT rolname, spcname from 
pg_catalog.pg_authid inner join pg_catalog.pg_tablespace on 
pg_tablespace.oid = pg_authid.roltblspc");

> if (PQntuples(res) > 0)
> printf("--\n-- User default tablespaces\n--\n\n");
>
> for (i = 0; i < PQntuples(res); i++)
> {
> char*rolname = PQgetvalue(res, i, 0);
> char*spcname = PQgetvalue(res, i, 1);
>
> printf("ALTER ROLE %s WITH DEFAULT TABLESPACE '%s';\n", 
rolname, spcname);

>
> }
>
> PQclear(res);
> printf("\n\n");
>
Index: src/backend/parser/gram.y
===
RCS file: /projects/cvsroot/pgsql/src/backend/parser/gram.y,v
retrieving revision 2.507
diff -r2.507 gram.y
690a691,694
> | DEFAULT TABLESPACE Sconst
> {
> $$ = makeDefElem("roltblspc", (Node 
*)makeString($3));

> }
Index: src/backend/commands/user.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/commands/user.c,v
retrieving revision 1.160
diff -r1.160 user.c
93a94
> char   *tblspc = NULL;
104a106
> DefElem*dtblspc = NULL;
123a126,133
> if(strcmp(defel->defname, "roltblspc") == 0){
> if(dtblspc)
> ereport(ERROR,
> (errcode(ERRCODE_SYNTAX_ERROR),
> errmsg("conflicting or redundant options")));
> dtblspc = defel;
> } else
>
227a238,239
> if (dtblspc)
> tblspc = strVal(dtblspc -> arg);
307a320,328
> if(tblspc) {
> Oid userTblSpc;
> userTblSpc = get_tablespace_oid(tblspc);
> if(!OidIsValid(userTblSpc))
> elog(ERROR, "Tablespace %s does not exist", tblspc);
> new_record[Anum_pg_authid_roltblspc -1] = 
DatumGetObjectId(userTblSpc);

> } else {
> new_record_nulls[Anum_pg_authid_roltblspc -1] = 'n';
> }
419a441
> char   *tblspc = NULL;
429a452
> DefElem*dtblspc = NULL;
435a459,466
> if(strcmp(defel->defname, "roltblspc") == 0){
> if(dtblspc)
> ereport(ERROR,
> (errcode(ERRCODE_SYNTAX_ERROR),
> errmsg("conflicting or redundant options")));
> dtblspc = defel;
> } else
>
538c569,571
<
---
> if(dtblspc) {
> tblspc = strVal(dtblspc -> arg);
> }
653a687,692
> if(tblspc)
> {
> new_record[Anum_pg_authid_roltblspc -1] = 
get_tablespace_oid(tblspc);

> new_record_repl[Anum_pg_authid_roltblspc -1] = 'r';
> }
>
Index: src/backend/commands/tablespace.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/commands/tablespace.c,v
retrieving revision 1.26
diff -r1.26 tablespace.c
56a57
> #include "catalog/pg_authid.h"
888a890,902
> OiduserDefTblSpc;
> HeapTupleauthId;
> OidsessionUser;
>
> /* Get the session users default tablespace */
> sessionUser = GetSessionUserId();
> authId = SearchSysCache(AUTHOID, ObjectIdGetDatum(sessionUser), 
0, 0, 0);

> userDefTblSpc = ((Form_pg_authid) GETSTRUCT(