Re: [exim-dev] UTF-8 and Exim string operations
On 08/18/2018 08:38 AM, Heiko Schlittermann via Exim-dev wrote: > And a new addtional main option > > string_encoding = ascii | utf8 (default: ascii) > > which can then switch ${strlen:…} to be equivalent to ${ustrlen:…} I'm not particularly happy about global mode-switches. Too much scope for "oops!". Doing it as an option would look like ${strlen/utf8:...} modulo choice of the option separator. We don't currently have any expansion operators that use options, which is a mark against going that way. -- Cheers, Jeremy -- ## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim details at http://www.exim.org/ ##
Re: [exim-dev] UTF-8 and Exim string operations
Heiko Schlittermann (Sa 18 Aug 2018 09:29:50 CEST): > > This. > > > > Add new operators, or options on current ones; don't > > change how they currently work (barring bugs). > > +1 After a little bit more thinking ${astrlen:Ötzi} yields 5 ${ustrlen:Ötzi} yields 4 ${strlen:…} is equivalent to ${atrlen:…} And a new addtional main option string_encoding = ascii | utf8 (default: ascii) which can then switch ${strlen:…} to be equivalent to ${ustrlen:…} -- Heiko signature.asc Description: PGP signature -- ## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim details at http://www.exim.org/ ##
Re: [exim-dev] UTF-8 and Exim string operations
Jeremy Harris via Exim-dev (Fr 17 Aug 2018 13:03:33 CEST): > On 08/17/2018 05:03 AM, Phil Pennock via Exim-dev wrote: > > Anyone have strong feelings on how Exim should handle UTF-8 with > > operators such as ${length_1:STR} ? > > > > Document that the current operators work on bytes > > This. > > Add new operators, or options on current ones; don't > change how they currently work (barring bugs). +1 -- Heiko signature.asc Description: PGP signature -- ## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim details at http://www.exim.org/ ##
Re: [exim-dev] UTF-8 and Exim string operations
On 2018-08-17 at 10:36 -, Jasen Betts via Exim-dev wrote: > > and add ulength_1 for being UTF-8 aware? > > Would also need utf8-aware also substr and strlen. Yes, I was using length as an exemplar, not as an exhaustive list. :) I favored ulength too, but didn't want to just add a slew of new expansion operators, items and conditions without at least mentioning it somewhere first. > is it going to count code-points or glyphs? Code-points. Exim has no business knowing about how a layout engine might or might not choose to render code-points to glyphs. I could see a possibility for normalization handling as another function, for correct SASLprep for authentication. I'd really rather not, though. Exim is setuid root and the main system for handling such things, ICU, does lots of tricky sensitive stuff with a history of security problems. > > Look at the top-bit being set and assume UTF-8, or > > will that break too much with all the places which are still ISO-8859-1? > > Just looking at that bit won't tell you enough to count code-points or > glyphs. I know, this was a suggestion for determining if the string should be treated as UTF-8 for changing the current expansion o/i/c features; it sucks but it was the only viable alternative I could think of and I wanted to at least present an _idea_ of something else, for inciting feedback. I know a fair bit about UTF-8 internals and how to work with the various aspects in multiple programming languages. :) > parts of ${utf8clean can probably be re-used. Yes, I thought of that, when pondering a new `utf8valid` expansion condition. > "${lc" "${uc" and "${if eqi" need consideraton too Only if we go the ICU route and include normalization forms. Which ... is more bloat than I'm happy with in Exim's current architecture. -Phil -- ## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim details at http://www.exim.org/ ##
Re: [exim-dev] UTF-8 and Exim string operations
On 08/17/2018 05:03 AM, Phil Pennock via Exim-dev wrote: > Anyone have strong feelings on how Exim should handle UTF-8 with > operators such as ${length_1:STR} ? > > Document that the current operators work on bytes This. Add new operators, or options on current ones; don't change how they currently work (barring bugs). -- Cheers, Jeremy -- ## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim details at http://www.exim.org/ ##
Re: [exim-dev] UTF-8 and Exim string operations
On 2018-08-17, Phil Pennock via Exim-dev wrote: > Anyone have strong feelings on how Exim should handle UTF-8 with > operators such as ${length_1:STR} ? > > Document that the current operators work on bytes Yeah stay with treating srings as nul terminated arrays of octets. The same unit the RFCs use to define email and SMTP. > and add ulength_1 for being UTF-8 aware? Would also need utf8-aware also substr and strlen. is it going to count code-points or glyphs? > Look at the top-bit being set and assume UTF-8, or > will that break too much with all the places which are still ISO-8859-1? Just looking at that bit won't tell you enough to count code-points or glyphs. you need to then group the octets together, and you need to do something when you hit a non-valid octet parts of ${utf8clean can probably be re-used. "${lc" "${uc" and "${if eqi" need consideraton too -- ت -- ## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim details at http://www.exim.org/ ##
[exim-dev] UTF-8 and Exim string operations
Anyone have strong feelings on how Exim should handle UTF-8 with operators such as ${length_1:STR} ? Document that the current operators work on bytes and add ulength_1 for being UTF-8 aware? Look at the top-bit being set and assume UTF-8, or will that break too much with all the places which are still ISO-8859-1? -Phil -- ## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim details at http://www.exim.org/ ##