[PATCHES] to_date() validation

2008-08-29 Thread Brendan Jurd
Hi all,

Well, it's been a long time coming, but I've finally got a patch to
improve the error handling of to_date() and to_timestamp(), as
previously discussed on -hackers. [1][2]

The patch is against HEAD, and compiles cleanly and passes all
regression tests on gentoo amd64.

I did some testing to see whether the extra error checks had an
adverse effect on performance.  It turns out that my patch actually
performs materially *better*, somehow.  About 14.5% better over 1M
calls to to_date in a single statement.  Go figure (C compiler
optimisations are completely arcane to me).

What the patch does


With this patch I've tried to make DCH_from_char a bit smarter with
regard to parsing and validation of the user input.  The goal here is
that if the user makes a mistake and enters something invalid,
to_date() will throw an meaningful error instead of just acting like
everything is okay and then producing an absurd answer.

I've modified the code to error out usefully in a few different scenarios:

 * Non-integer value for an integer field
 * Out-of-range value for an integer field
 * Not enough characters in the input string to provide for all the
format fields
 * Two conflicting values given for a field
 * Illegal mixture of date conventions (ISO week date and Gregorian)

I've included new regression tests to check that these errors are
occurring when appropriate.

How I went about it


In order to detect an illegal mixture of Gregorian and ISO week date
conventions, the code needs to know which fomatting fields are
Gregorian, which are ISO WD and which are date-convention-neutral.  To
get this working I added a new enum called FromCharDateMode, and added
a field to both KeyWord and TmFromChar to hold the date mode.  In
KeyWord, it tells you which convention the keyword belongs to ( is
Gregorian, IYYY is ISO WD and HH24 has nothing to do with dates).  In
TmFromChar it keeps track of which convention is currently in use.

Because TmFromChar now tracks the date mode independently, it is not
necessary to maintain separate fields for the ISO WD values, so
'iyear' and 'iw' are removed.

In order to get some consistent treatment on pulling integers out of a
string and stuffing them into a TmFromChar, I had to refactor
DCH_from_char somewhat.  The function is basically a giant switch
statement, with branches for each of the formatting fields.  In HEAD
these branches are highly duplicative.  The integer fields all perform
a sscanf() on the source string, putting the result straight into the
TmFromChar struct, and then skipping over the number of characters
consumed by sscanf().  I moved this logic into functions:

 * from_char_parse_int() uses strtol() rather than sscanf() to parse
an integer from the string, and produces meaningful errors if
something is wrong with the input,
 * from_char_set_int() saves an integer into a TmFromChar field, but
throws an error if the field has already been set to some other
non-zero value.

What the patch doesn't do


It doesn't consider to_number() at all.  I don't have any interest in
the number formatting stuff.

There is plenty of room to educate do_to_timestamp() further about
dates; it's still very naive.  Currently you can do something like
to_date('2008-50', '-MM') and it won't even bat an eyelid.  It'll
just advance the date by 50 months.  I suppose you could consider that
a bug or a feature, depending on your point of view.  Personally, I
think of it as a bug.  If you're giving a month value outside of 1-12
to to_date(), chances are good that it's a user error in the query,
not a deliberate attempt to perform some date arithmetic.

do_to_timestamp() also doesn't try to detect logical mistakes like
'-MM-DD DDD' where the DDD part conflicts with the MM-DD part.  I
have some thoughts about how I would improve on that, but I think it's
best left for a separate patch.

The patch has been added to the September commitfest.  Thanks for your time.

Cheers,
BJ

[1] http://archives.postgresql.org/pgsql-hackers/2007-02/msg00915.php
[2] http://archives.postgresql.org/message-id/[EMAIL PROTECTED]


to-date-validation-1.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


[PATCHES] fixing bug in combocid.c

2008-08-29 Thread Karl Schnaitter

This patch is for a bug in backend/utils/time/combocid.c.

In HeapTupleHeaderAdjustCmax, the code makes the incorrect assumption 
that the raw command id corresponds to cmin, when the raw command id can 
in fact be a combo cid. It needs to be fixed to call 
HeapTupleHeaderGetCmin to get cmin. That's all there is to my patch, but 
read on for more details.


One thing that can happen is that multiple subtransactions may delete a 
tuple and each abort. Each subtransaction (except the first one) will 
store a combo cid that uses another combo cid as cmin. Then when cmin is 
accessed by a later operation, HeapTupleHeaderGetCmin will return the 
previous combo cid. In unlucky situations, the bogus cmin will be so 
high that it looks like the tuple was inserted in the future.


I have pasted an example below to show how things can go wrong. The 
second SELECT leaves out some tuples because their stored cmin is 
actually a large combo cid. The third SELECT gives correct results 
because the inserting transaction committed, so the command id is 
ignored. The script file with these commands in in the attachment, and I 
think it could make for a valuable regression test.


Side note: Another idea for a fix would be to have tqual.c change a 
combocid to a simple cmin when it does a time qual check and sees that 
the deleting subtransaction aborted. This is no good because it requires 
a write lock to change two header fields atomically, namely the 
HEAP_COMBOCID hint and t_cid. There are probably other issues, but this 
seems like a show-stopper.


The one-line patch is in the attachment. It gives correct behavior on my 
test case. This is my first patch; I hope I did it right!


Thanks!
Karl Schnaitter

test=# CREATE TABLE tab(a INT);
CREATE TABLE
test=# BEGIN;
BEGIN
test=# INSERT INTO tab VALUES(1);
INSERT 0 1
test=# INSERT INTO tab VALUES(2);
INSERT 0 1
test=# INSERT INTO tab VALUES(3);
INSERT 0 1
test=# SAVEPOINT s; DELETE FROM tab; ROLLBACK TO SAVEPOINT s;
SAVEPOINT
DELETE 3
ROLLBACK
test=# SELECT a FROM tab;
a
---
1
2
3
(3 rows)

test=# SAVEPOINT s; DELETE FROM tab; ROLLBACK TO SAVEPOINT s;
SAVEPOINT
DELETE 3
ROLLBACK
test=# SAVEPOINT s; DELETE FROM tab; ROLLBACK TO SAVEPOINT s;
SAVEPOINT
DELETE 3
ROLLBACK
test=# SAVEPOINT s; DELETE FROM tab; ROLLBACK TO SAVEPOINT s;
SAVEPOINT
DELETE 3
ROLLBACK
test=# SELECT a FROM tab;
a
---
1
(1 row)

test=# COMMIT;
COMMIT
test=# SELECT a FROM tab;
a
---
1
2
3
(3 rows)

test=# DROP TABLE tab;
DROP TABLE



combocid.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] updated hash functions for postgresql v1

2008-08-29 Thread Kenneth Marshall
On Sun, Oct 28, 2007 at 08:06:58PM +, Simon Riggs wrote:
> On Sun, 2007-10-28 at 13:05 -0500, Kenneth Marshall wrote:
> > On Sun, Oct 28, 2007 at 05:27:38PM +, Simon Riggs wrote:
> > > On Sat, 2007-10-27 at 15:15 -0500, Kenneth Marshall wrote:
> > > > Its features include a better and faster hash function.
> > > 
> > > Looks very promising. Do you have any performance test results to show
> > > it really is faster, when compiled into Postgres? Better probably needs
> > > some definition also; in what way are the hash functions better?
> > >  
> > > -- 
> > >   Simon Riggs
> > >   2ndQuadrant  http://www.2ndQuadrant.com
> > > 
> > The new hash function is roughly twice as fast as the old function in
> > terms of straight CPU time. It uses the same design as the current
> > hash but provides code paths for aligned and unaligned access as well
> > as separate mixing functions for different blocks in the hash run
> > instead of having one general purpose block. I think the speed will
> > not be an obvious win with smaller items, but will be very important
> > when hashing larger items (up to 32kb).
> > 
> > Better in this case means that the new hash mixes more thoroughly
> > which results in less collisions and more even bucket distribution.
> > There is also a 64-bit varient which is still faster since it can
> > take advantage of the 64-bit processor instruction set.
> 
> Ken, I was really looking for some tests that show both of the above
> were true. We've had some trouble proving the claims of other algorithms
> before, so I'm less inclined to take those things at face value.
> 
> I'd suggest tests with Integers, BigInts, UUID, CHAR(20) and CHAR(100).
> Others may have different concerns.
> 
> -- 
>   Simon Riggs
>   2ndQuadrant  http://www.2ndQuadrant.com
> 
Hi,

I have finally had a chance to do some investigation on
the performance of the old hash mix() function versus
the updated mix()/final() in the new hash function. Here
is a table of my current results for both the old and the
new hash function. In this case cracklib refers to the
cracklib-dict containing 1648379 unique words massaged
in various ways to generate input strings for the hash
functions. The result is the number of collisions in the
hash values generated.

hash inputoldnew
--------
cracklib  338316
cracklib x 2 (i.e. clibclib)  305319
cracklib x 3 (clibclibclib)   323329
cracklib x 10 302310
cracklib x 100350335
cracklib x 1000   314309
cracklib x 100 truncated to char(100) 311327

uint32 from 1-1648379 309319
(uint32 1-1948379)*256309314
(uint32 1-1948379)*16 310314
"a"uint32 (i.e. a1,a0002...)  320321

uint32uint32 (i.e. uint64)321287

In addition to these tests, the new mixing functions allow
the hash to pass the frog2.c test by Bob Jenkins. Here is
his comment from http://burtleburtle.net/bob/hash/doobs.html:

  lookup3.c does a much more thorough job of mixing than any
  of my previous hashes (lookup2.c, lookup.c, One-at-a-time).
  All my previous hashes did a more thorough job of mixing
  than Paul Hsieh's hash. Paul's hash does a good enough job
  of mixing for most practical purposes.

  The most evil set of keys I know of are sets of keys that are
  all the same length, with all bytes zero, except with a few
  bits set. This is tested by frog.c.. To be even more evil, I
  had my hashes return b and c instead of just c, yielding a
  64-bit hash value. Both lookup.c and lookup2.c start seeing
  collisions after 2^53 frog.c keypairs. Paul Hsieh's hash sees
  collisions after 2^17 keypairs, even if we take two hashes with
  different seeds. lookup3.c is the only one of the batch that
  passes this test. It gets its first collision somewhere beyond
  2^63 keypairs, which is exactly what you'd expect from a completely
  random mapping to 64-bit values.

If anyone has any other data for me to test with, please let me
know. I think this is a reasonable justification for including the
new mixing process (mix() and final()) as well as the word-at-a-time
processing in our hash function. I will be putting a small patch
together to add the new mixing process back in to the updated hash
function this weekend in time for the September commit-fest unless
there are objections. Both the old and the new hash functions meet
the strict avalanche conditions as well.
(http://home.comcast.net/~bretm/hash/3.html)

I have used an Inline::C perl driver for these tests and can post
it if others would like to use it as a testbed.

Regards,
Ken
avalance

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] psql command setting

2008-08-29 Thread Alvaro Herrera
Simon Riggs wrote:

> At the very least we should document the two toggles that are already
> settable, with the attached patch.

Applied.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches