RE: [sqlite] SQLite character comparisons

Fowler, Jeff Mon, 21 Jan 2008 09:32:50 -0800

Hi Dennis,

I may very well be mistaken - wouldn't be the first time!! I tried to
access the actual specification from ANSI (www.ansi.org), but you either
have to purchase them or access one of the "drafts" (which a couple
folks in this thread have done), but the draft may or may not match the
actual standard. While Dr. Hipp's new RTRIM collation makes this a moot
point, the source for my statement came from this article:
http://support.microsoft.com/kb/316626



As I'd mentioned, we have both SQL Server 2005 and Oracle 10g and they
default to ignore trailing spaces on character compares, although it may
be possible to turn this off for those who don't like it. I believe (but
am not sure) that MYSql and PostGreSQL ignore them by default also. I
understand many of the arguments folks have against this and there's not
much point to continuing the discussion; I won't change their mind and
they won't change mine! But I guess that's why we have Republicans and
Democrats :-)

But all is well - the RTRIM collation option is a neat and simple
solution. My thanks to Dr. Hipp for that, and also to everyone who has
shown an interest in this issue.

- Jeff


-----Original Message-----
From: Dennis Cote [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 21, 2008 11:46 AM
To: sqlite-users@sqlite.org
Subject: Re: [sqlite] SQLite character comparisons

Fowler, Jeff wrote:
> Hello All,
>  
> Not trying to be antagonistic, but I'm curious to know how many of you
agree with Darren's sentiments on this issue. To restate briefly, ANSI
SQL-92 specifies that when comparing two character fields, trailing
spaces should be ignored. Correct me if I'm wrong Darren, but you feel
this is a bad decision, and in fact SQLite's implementation of character
comparison (respecting trailing spaces) is superior to ANSI's specs.
Keep in mind this is not some obscure issue that can be subject to
different interpretations by different vendors; it's very clearly
stated: "The ANSI standard requires padding for the character strings
used in comparisons so that their lengths match before comparing them."
>   
Jeff,

I think you are mistaken about what the ANSI spec says.

There are two string types in ANSI SQL, character strings (which come is
several subtypes), and binary strings. The following excerpts are taken
from the SQL:1999 spec.

Section 4.2.1 Character Strings and Collations describes the operations
on character strings. It describes comparisons as
> Given a collating sequence, two character strings are identical if and

> only if they are equal in accordance with the comparison rules 
> specified in Subclause 8.2, ''<comparison predicate>''. The collating 
> sequence used for a particular comparison is determined as in 
> Subclause 4.2.3, ''Rules determining collating sequence usage''.
Binary strings are defined in Section 4.3 as;
> A binary string is a sequence of octets that does not have either a 
> character set or collation associated with it.
And their comparison is detailed in 4.3.1 as;
> All binary strings are mutually comparable. A binary string is 
> identical to another binary string if and only if it is equal to that 
> binary string in accordance with the comparison rules specified in 
> Subclause 8.2, ''<comparison predicate>''.

General Rules 3 and 4 of section 8.2 <comparison predicate> describe the
comparison of these strings. I have copied these sections below.

> 3) The comparison of two character strings is determined as follows:
>
> a) Let CS be the collating sequence indicated in Subclause 4.2.3, 
> ''Rules determining collating sequence usage'', based on the declared 
> types of the two character strings.
>
> b) If the length in characters of X is not equal to the length in 
> characters of Y, then the shorter string is effectively replaced, for 
> the purposes of comparison, with a copy of itself that has been 
> extended to the length of the longer string by concatenation on the 
> right of one or more pad characters, where the pad character is chosen

> based on CS. If CS has the NO PAD characteristic, then the pad 
> character is an implementation-dependent character different from any 
> character in the character set of X and Y that collates less than any 
> string under CS. Otherwise, the pad character is a <space>.
>
> c) The result of the comparison of X and Y is given by the collating 
> sequence CS.
>
> d) Depending on the collating sequence, two strings may compare as 
> equal even if they are of different lengths or contain different 
> sequences of characters.
> When any of the operations
> MAX, MIN, and DISTINCT reference a grouping column, and the UNION, 
> EXCEPT, and INTERSECT operators refer to character strings, the 
> specific value selected by these operations from a set of such equal 
> values is implementation-dependent.
>
> NOTE 129 - If the coercibility characteristic of the comparison is 
> Coercible, then the collating sequence used is the default defined for

> the character repertoire. See also other Syntax Rules in this 
> Subclause, Subclause 10.6, ''<character set specification>'', and 
> Subclause 11.30, ''<character set definition>''.
>
> 4) The comparison of two binary string values, X and Y, is determined 
> by comparison of their octets with the same ordinal position. If Xi 
> and Yi are the values of the i-th octets of X and Y, respectively, and

> if Lx is the length in octets of X AND Ly is the length in octets of 
> Y, then X is equal to Y if and only if Ly = Ly and if Xi = Yi for all 
> i.
I note that there is a typo in rule 4 for binary strings; Ly = Ly should
be Lx = Ly, since binary strings can only be compared for equality.

Rule 3.b details how strings of unequal length are to be compared. It
allows exactly the operation performed by SQLite, since it allows
collating sequences to have a NO PAD characteristic which results in the
shorter string comparing less than the longer string.

This distinction also appears in section 4.12 which discusses type
conversions and mixing of data types. It says;
> Values corresponding to the data types CHARACTER, CHARACTER VARYING, 
> and CHARACTER LARGE OBJECT are mutually assignable if and only if they

> are taken from the same character repertoire. If they are from 
> different character repertoires, then the value of the source of the 
> assignment must be translated to the character repertoire of the 
> target before an assignment is possible. Such translation may be 
> implementation-defined and implicitly performed, in which case the two

> character data types are also mutually assignable. If a store 
> assignment would result in the loss of non-<space> characters due to 
> truncation, then an exception condition is raised. If a retrieval 
> assignment would result in the loss of characters due to truncation, 
> then a warning condition is raised. The values are mutually comparable

> only if they are mutually assignable and can be coerced to have the 
> same collation. The comparison of two character strings depends on the

> collating sequence used for the comparison (see Table 3, ''Collating 
> sequence usage for comparisons''). When values of unequal length are 
> compared, if the collating sequence for the comparison has the NO PAD 
> characteristic and the shorter value is equal to a prefix of the 
> longer value, then the shorter value is considered less than the 
> longer value. If the collating sequence for the comparison has the PAD

> SPACE characteristic, for the purposes of the comparison, the shorter 
> value is effectively extended to the length of the longer by 
> concatenation of <space>s on the right.
>
> Values corresponding to the binary data type are mutually assignable. 
> If a store assignment would
> result in the loss of non-zero octets due to truncation, then an 
> exception condition is raised. If a retrieval assignment would result 
> in the loss of octets due to truncation, then a warning condition is 
> raised. When binary string values are compared, they must have exactly

> the same length (in octets) to be considered equal. Binary string 
> values can only be compared for equality.
>
Which again explains that a collating sequence can have a NO PAD
property which prevents padding the shorter string for comparison, and
that binary strings can only be compared for equality.

The only place in the standard that I can find any explicit mention of
removing spaces is the description of casting a string to a numeric
value. In this case the leading and trailing spaces are to be removed
from the string before it is converted.

So, while the standard does allow the operation you describe (actually
it does the opposite, it pads the shorter string with spaces, instead of
removing trailing spaces from the longer string), it also allows the
operation SQLite performs. It is simply the case that all of SQLite's
collations have the NO PAD characteristic.

HTH
Dennis Cote



------------------------------------------------------------------------
-----
To unsubscribe, send email to [EMAIL PROTECTED]
------------------------------------------------------------------------
-----


-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

RE: [sqlite] SQLite character comparisons

Reply via email to