Re: [sqlite] SQLite character comparisons

Dennis Cote Mon, 21 Jan 2008 08:46:59 -0800

Fowler, Jeff wrote:

Hello All,
Not trying to be antagonistic, but I'm curious to know how many of you agree with Darren's sentiments on this issue. To restate briefly, ANSI SQL-92 specifies that when comparing two character fields, trailing spaces should be ignored. Correct me if I'm wrong Darren, but you feel this is a bad decision, and in fact SQLite's implementation of character comparison (respecting trailing spaces) is superior to ANSI's specs. Keep in mind this is not some obscure issue that can be subject to different interpretations by different vendors; it's very clearly stated: "The ANSI standard requires padding for the character strings used in comparisons so that their lengths match before comparing them."

Jeff,


I think you are mistaken about what the ANSI spec says.

There are two string types in ANSI SQL, character strings (which come isseveral subtypes), and binary strings. The following excerpts are takenfrom the SQL:1999 spec.

Section 4.2.1 Character Strings and Collations describes the operationson character strings. It describes comparisons as

Given a collating sequence, two character strings are identical if andonly if they are equal inaccordance with the comparison rules specified in Subclause 8.2,‘‘<comparison predicate>’’. Thecollating sequence used for a particular comparison is determined asin Subclause 4.2.3, ‘‘Rules
determining collating sequence usage’’.

Binary strings are defined in Section 4.3 as;

A binary string is a sequence of octets that does not have either acharacter set or collation associated
with it.

And their comparison is detailed in 4.3.1 as;

All binary strings are mutually comparable. A binary string isidentical to another binary stringif and only if it is equal to that binary string in accordance withthe comparison rules specified in
Subclause 8.2, ‘‘<comparison predicate>’’.

General Rules 3 and 4 of section 8.2 <comparison predicate> describe thecomparison of these strings. I have copied these sections below.

3) The comparison of two character strings is determined as follows:
a) Let CS be the collating sequence indicated in Subclause 4.2.3,‘‘Rules determining collatingsequence usage’’, based on the declared types of the two characterstrings.
b) If the length in characters of X is not equal to the length incharacters of Y, then the shorterstring is effectively replaced, for the purposes of comparison, with acopy of itself that hasbeen extended to the length of the longer string by concatenation onthe right of one or morepad characters, where the pad character is chosen based on CS. If CShas the NO PADcharacteristic, then the pad character is an implementation-dependentcharacter differentfrom any character in the character set of X and Y that collates lessthan any string under
CS. Otherwise, the pad character is a <space>.
c) The result of the comparison of X and Y is given by the collatingsequence CS.
d) Depending on the collating sequence, two strings may compare asequal even if they areof different lengths or contain different sequences of characters.When any of the operations
MAX, MIN, and DISTINCT reference a grouping column, and the UNION, EXCEPT,
and INTERSECT operators refer to character strings, the specific valueselected by these
operations from a set of such equal values is implementation-dependent.
NOTE 129 – If the coercibility characteristic of the comparison isCoercible, then the collating sequenceused is the default defined for the character repertoire. See alsoother Syntax Rules in this Subclause,Subclause 10.6, ‘‘<character set specification>’’, and Subclause11.30, ‘‘<character set definition>’’.
4) The comparison of two binary string values, X and Y, is determinedby comparison of theiroctets with the same ordinal position. If Xi and Yi are the values ofthe i-th octets of X and Y,respectively, and if Lx is the length in octets of X AND Ly is thelength in octets of Y, then X is
equal to Y if and only if Ly = Ly and if Xi = Yi for all i.

I note that there is a typo in rule 4 for binary strings; Ly = Ly shouldbe Lx = Ly, since binary strings can only be compared for equality.

Rule 3.b details how strings of unequal length are to be compared. Itallows exactly the operation performed by SQLite, since it allowscollating sequences to have a NO PAD characteristic which results in theshorter string comparing less than the longer string.

This distinction also appears in section 4.12 which discusses typeconversions and mixing of data types. It says;

Values corresponding to the data types CHARACTER, CHARACTER VARYING,and CHARACTERLARGE OBJECT are mutually assignable if and only if they are takenfrom the same characterrepertoire. If they are from different character repertoires, then thevalue of the source of theassignment must be translated to the character repertoire of thetarget before an assignment ispossible. Such translation may be implementation-defined andimplicitly performed, in which casethe two character data types are also mutually assignable. If a storeassignment would resultin the loss of non-<space> characters due to truncation, then anexception condition is raised. Ifa retrieval assignment would result in the loss of characters due totruncation, then a warningcondition is raised. The values are mutually comparable only if theyare mutually assignableand can be coerced to have the same collation. The comparison of twocharacter strings dependson the collating sequence used for the comparison (see Table 3,‘‘Collating sequence usage forcomparisons’’). When values of unequal length are compared, if thecollating sequence for thecomparison has the NO PAD characteristic and the shorter value isequal to a prefix of the longervalue, then the shorter value is considered less than the longervalue. If the collating sequence forthe comparison has the PAD SPACE characteristic, for the purposes ofthe comparison, the shortervalue is effectively extended to the length of the longer byconcatenation of <space>s on the right.
Values corresponding to the binary data type are mutually assignable.If a store assignment wouldresult in the loss of non-zero octets due to truncation, then anexception condition is raised. If aretrieval assignment would result in the loss of octets due totruncation, then a warning condition israised. When binary string values are compared, they must have exactlythe same length (in octets)to be considered equal. Binary string values can only be compared forequality.

Which again explains that a collating sequence can have a NO PADproperty which prevents padding the shorter string for comparison, andthat binary strings can only be compared for equality.

The only place in the standard that I can find any explicit mention ofremoving spaces is the description of casting a string to a numericvalue. In this case the leading and trailing spaces are to be removedfrom the string before it is converted.

So, while the standard does allow the operation you describe (actuallyit does the opposite, it pads the shorter string with spaces, instead ofremoving trailing spaces from the longer string), it also allows theoperation SQLite performs. It is simply the case that all of SQLite'scollations have the NO PAD characteristic.


HTH
Dennis Cote



-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Re: [sqlite] SQLite character comparisons

Reply via email to