On Wed, Mar 17, 2010 at 03:02:59PM -0500, Jay A. Kreibich wrote:
> On Wed, Mar 17, 2010 at 04:42:59PM -0300, Israel Lins Albuquerque scratched 
> on the wall:
> > are you right the call to strpos("??????????", "??") are returning 5 and 
> > not 3 
> > 
> > I'm looking for this... 
> 
>   You can't do pointer math with values returned by strlen().

Hmmm.  On Solaris strlen() returns the number of _bytes_, not
characters, in the given string.  On Linux the manpage claims that
strlen() returns the number of characters in the string, but doesn't
define "character", but then, other glibc manpages actually have
examples of using strlen() in pointer arithmetic (see utmpname(3), for
example).  I can't imagine the glibc strlen() counting characters in the
Unicode sense, only in the old C sense (char), but then, I've not looked
at its source code.  In any case, there's a lot of code out there that
uses strlen() for pointer arithmetic.  And if strlen() really did count
characters rather than bytes, that wouldn't cause buffer overruns (since
number of characters in a string is necessarily less than or equal to
the number of bytes) though it would cause other bugs, some potentially
security bugs, such that I believe it quite unsafe for strlen() to do
anything other than count bytes in a string (not including the NUL
terminator).

As to strpos(), one should not naively implement it or anything like
strstr() or strrstr().  The relevant Unicode concepts are: codepoint,
character (composed of codepoints) and glyph (composed of characters and
codepoints).  Even if you support only codepoints you have to be mindful
of multi-byte encodings in UTF-8 and UTF-16.  Multi-character glyphs are
harder to deal with than multi-codepoint characters since you can easily
determine whether a codepoint is a combining codepoint (well, you have
to map the codepoint to various codepoint ranges, so that this is not a
cheap operation).  Normalization also affects strstr()-like functions.

On the plus side you can optimize such functions whenever you see two or
more contiguous US-ASCII codepoints.

Cheers,

Nico
-- 
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to