If you you have the same set of combining characters in a different order, is the result still considered the same character for string matching purposes?
Does a regex . wildcard eat a unicode character _and_ trailing combining characters, or do you need a seperate . for each code point whether or not it displays? (Do you wind up just ignoring the combining characters and matching only the characters with a width, or do you just match each unicode point which must occur in sequence? I'm assuming none of the combining characters are changed via towupper()?) Rob P.S. For the moment in my attempts to speed up grep I'm just treating "has a byte > 127 in it" as "feed it to the regex engine and let REG_ICASE deal with it". That's not what the BSD one did, but in the absence of use cases where that comes up that I need to accelerate. My only use case being hex digits theoretically means I could have used a hash bucket size of 16, but I'm assuming that's not what real test data is doing... _______________________________________________ Toybox mailing list [email protected] http://lists.landley.net/listinfo.cgi/toybox-landley.net
