Fowler, Jeff wrote:
Hello All,
Not trying to be antagonistic, but I'm curious to know how many of you agree with Darren's sentiments on this issue. To restate briefly, ANSI SQL-92 specifies that when comparing two character fields, trailing spaces should be ignored. Correct me if I'm wrong Darren, but you feel this is a bad decision, and in fact SQLite's implementation of character comparison (respecting trailing spaces) is superior to ANSI's specs. Keep in mind this is not some obscure issue that can be subject to different interpretations by different vendors; it's very clearly stated: "The ANSI standard requires padding for the character strings used in comparisons so that their lengths match before comparing them."
Jeff,
I think you are mistaken about what the ANSI spec says.
There are two string types in ANSI SQL, character strings (which come is
several subtypes), and binary strings. The following excerpts are taken
from the SQL:1999 spec.
Section 4.2.1 Character Strings and Collations describes the operations
on character strings. It describes comparisons as
Given a collating sequence, two character strings are identical if and
only if they are equal in
accordance with the comparison rules specified in Subclause 8.2,
‘‘<comparison predicate>’’. The
collating sequence used for a particular comparison is determined as
in Subclause 4.2.3, ‘‘Rules
determining collating sequence usage’’.
Binary strings are defined in Section 4.3 as;
A binary string is a sequence of octets that does not have either a
character set or collation associated
with it.
And their comparison is detailed in 4.3.1 as;
All binary strings are mutually comparable. A binary string is
identical to another binary string
if and only if it is equal to that binary string in accordance with
the comparison rules specified in
Subclause 8.2, ‘‘<comparison predicate>’’.
General Rules 3 and 4 of section 8.2 <comparison predicate> describe the
comparison of these strings. I have copied these sections below.
3) The comparison of two character strings is determined as follows:
a) Let CS be the collating sequence indicated in Subclause 4.2.3,
‘‘Rules determining collating
sequence usage’’, based on the declared types of the two character
strings.
b) If the length in characters of X is not equal to the length in
characters of Y, then the shorter
string is effectively replaced, for the purposes of comparison, with a
copy of itself that has
been extended to the length of the longer string by concatenation on
the right of one or more
pad characters, where the pad character is chosen based on CS. If CS
has the NO PAD
characteristic, then the pad character is an implementation-dependent
character different
from any character in the character set of X and Y that collates less
than any string under
CS. Otherwise, the pad character is a <space>.
c) The result of the comparison of X and Y is given by the collating
sequence CS.
d) Depending on the collating sequence, two strings may compare as
equal even if they are
of different lengths or contain different sequences of characters.
When any of the operations
MAX, MIN, and DISTINCT reference a grouping column, and the UNION, EXCEPT,
and INTERSECT operators refer to character strings, the specific value
selected by these
operations from a set of such equal values is implementation-dependent.
NOTE 129 – If the coercibility characteristic of the comparison is
Coercible, then the collating sequence
used is the default defined for the character repertoire. See also
other Syntax Rules in this Subclause,
Subclause 10.6, ‘‘<character set specification>’’, and Subclause
11.30, ‘‘<character set definition>’’.
4) The comparison of two binary string values, X and Y, is determined
by comparison of their
octets with the same ordinal position. If Xi and Yi are the values of
the i-th octets of X and Y,
respectively, and if Lx is the length in octets of X AND Ly is the
length in octets of Y, then X is
equal to Y if and only if Ly = Ly and if Xi = Yi for all i.
I note that there is a typo in rule 4 for binary strings; Ly = Ly should
be Lx = Ly, since binary strings can only be compared for equality.
Rule 3.b details how strings of unequal length are to be compared. It
allows exactly the operation performed by SQLite, since it allows
collating sequences to have a NO PAD characteristic which results in the
shorter string comparing less than the longer string.
This distinction also appears in section 4.12 which discusses type
conversions and mixing of data types. It says;
Values corresponding to the data types CHARACTER, CHARACTER VARYING,
and CHARACTER
LARGE OBJECT are mutually assignable if and only if they are taken
from the same character
repertoire. If they are from different character repertoires, then the
value of the source of the
assignment must be translated to the character repertoire of the
target before an assignment is
possible. Such translation may be implementation-defined and
implicitly performed, in which case
the two character data types are also mutually assignable. If a store
assignment would result
in the loss of non-<space> characters due to truncation, then an
exception condition is raised. If
a retrieval assignment would result in the loss of characters due to
truncation, then a warning
condition is raised. The values are mutually comparable only if they
are mutually assignable
and can be coerced to have the same collation. The comparison of two
character strings depends
on the collating sequence used for the comparison (see Table 3,
‘‘Collating sequence usage for
comparisons’’). When values of unequal length are compared, if the
collating sequence for the
comparison has the NO PAD characteristic and the shorter value is
equal to a prefix of the longer
value, then the shorter value is considered less than the longer
value. If the collating sequence for
the comparison has the PAD SPACE characteristic, for the purposes of
the comparison, the shorter
value is effectively extended to the length of the longer by
concatenation of <space>s on the right.
Values corresponding to the binary data type are mutually assignable.
If a store assignment would
result in the loss of non-zero octets due to truncation, then an
exception condition is raised. If a
retrieval assignment would result in the loss of octets due to
truncation, then a warning condition is
raised. When binary string values are compared, they must have exactly
the same length (in octets)
to be considered equal. Binary string values can only be compared for
equality.
Which again explains that a collating sequence can have a NO PAD
property which prevents padding the shorter string for comparison, and
that binary strings can only be compared for equality.
The only place in the standard that I can find any explicit mention of
removing spaces is the description of casting a string to a numeric
value. In this case the leading and trailing spaces are to be removed
from the string before it is converted.
So, while the standard does allow the operation you describe (actually
it does the opposite, it pads the shorter string with spaces, instead of
removing trailing spaces from the longer string), it also allows the
operation SQLite performs. It is simply the case that all of SQLite's
collations have the NO PAD characteristic.
HTH
Dennis Cote
-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------