On Wed, Mar 17, 2010 at 03:02:59PM -0500, Jay A. Kreibich wrote: > On Wed, Mar 17, 2010 at 04:42:59PM -0300, Israel Lins Albuquerque scratched > on the wall: > > are you right the call to strpos("??????????", "??") are returning 5 and > > not 3 > > > > I'm looking for this... > > You can't do pointer math with values returned by strlen().
Hmmm. On Solaris strlen() returns the number of _bytes_, not characters, in the given string. On Linux the manpage claims that strlen() returns the number of characters in the string, but doesn't define "character", but then, other glibc manpages actually have examples of using strlen() in pointer arithmetic (see utmpname(3), for example). I can't imagine the glibc strlen() counting characters in the Unicode sense, only in the old C sense (char), but then, I've not looked at its source code. In any case, there's a lot of code out there that uses strlen() for pointer arithmetic. And if strlen() really did count characters rather than bytes, that wouldn't cause buffer overruns (since number of characters in a string is necessarily less than or equal to the number of bytes) though it would cause other bugs, some potentially security bugs, such that I believe it quite unsafe for strlen() to do anything other than count bytes in a string (not including the NUL terminator). As to strpos(), one should not naively implement it or anything like strstr() or strrstr(). The relevant Unicode concepts are: codepoint, character (composed of codepoints) and glyph (composed of characters and codepoints). Even if you support only codepoints you have to be mindful of multi-byte encodings in UTF-8 and UTF-16. Multi-character glyphs are harder to deal with than multi-codepoint characters since you can easily determine whether a codepoint is a combining codepoint (well, you have to map the codepoint to various codepoint ranges, so that this is not a cheap operation). Normalization also affects strstr()-like functions. On the plus side you can optimize such functions whenever you see two or more contiguous US-ASCII codepoints. Cheers, Nico -- _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users