>So, for example, if one wanted to find all rows where myNormalColumn
>ENDS WITH 'fi c d', one could search myFlippedColumn like this:
>
>select * from LEXICON where myFlippedColumn LIKE 'd c if%' --
>allows index use
Make this
select * from LEXICON where myFlippedColumn LIKE flip('fi c d') || '%'
and you get rid of _this_ issue.
But if you happen to have the decomposed A grave 'À' Igor examplified
stored as a single codepoint (or vice-versa) or with any spacing
modifier (or half an infinity of them!) then you're loosing any chance
to match. Also as Igor just replies, collation wouldn't work nicely.
>This doesn't really require combining-form intelligence on the part of
>the developer's code either. As long as the search-term on the RHS gets
>flipped codepoint-by-codepoint and no attempt is made to "be
>intelligent" about the combining form, everything will be honky-dory.
That seems to me as another good instance for "know you data"
thing. The best bet for a given proprietary base would be to work with
string conforming to some set of well defined rules and stick with
them, at least for data subject to comparison. The rules don't even
have to be one of the "Normalized" form and can be any consistent
invariant that fits the needs, the simpler the better of course. If
collation is needed, then a much more complex flipping is required in
the general case.
Anyway, since the vast majority of DB applications appear to be in the
business area, is there a common need to work with anything else than
the most compact and easy to handle Norm C strings (and possibly filter
out exotic spacing or modifiers) at the DB storage level? Saying so, I
mean for the "typical" data one is likely to index, search, compare in
most applications.
BTW, this raises a side question. Without hijacking the thread, I for
one would be interested to know how other major RDBMS handle Unicode
data in their predefined fixed-size CHAR(25)? I wild guess that the
filtering layers apply a severe filter to every input field to avoid
having 12 significant characters represented by a 453 codepoint
sequence and truncated to the first 25 including several
non-informational codepoints.
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users