Re: [exim-dev] UTF-8 and Exim string operations

2018-08-18 Thread Jeremy Harris via Exim-dev
On 08/18/2018 08:38 AM, Heiko Schlittermann via Exim-dev wrote:
> And a new addtional main option
> 
> string_encoding = ascii | utf8  (default: ascii)
> 
> which can then switch ${strlen:…} to be equivalent to ${ustrlen:…}

I'm not particularly happy about global mode-switches.  Too much
scope for "oops!".


Doing it as an option would look like ${strlen/utf8:...}
modulo choice of the option separator.  We don't currently
have any expansion operators that use options, which is
a mark against going that way.
-- 
Cheers,
  Jeremy

-- 
## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim 
details at http://www.exim.org/ ##


Re: [exim-dev] UTF-8 and Exim string operations

2018-08-18 Thread Heiko Schlittermann via Exim-dev
Heiko Schlittermann  (Sa 18 Aug 2018 09:29:50 CEST):
> > This.
> >
> > Add new operators, or options on current ones; don't
> > change how they currently work (barring bugs).
>
> +1

After a little bit more thinking

${astrlen:Ötzi} yields 5
${ustrlen:Ötzi} yields 4


${strlen:…} is equivalent to ${atrlen:…}

And a new addtional main option

string_encoding = ascii | utf8  (default: ascii)

which can then switch ${strlen:…} to be equivalent to ${ustrlen:…}

--
Heiko


signature.asc
Description: PGP signature
-- 
## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim 
details at http://www.exim.org/ ##


Re: [exim-dev] UTF-8 and Exim string operations

2018-08-18 Thread Heiko Schlittermann via Exim-dev
Jeremy Harris via Exim-dev  (Fr 17 Aug 2018 13:03:33 CEST):
> On 08/17/2018 05:03 AM, Phil Pennock via Exim-dev wrote:
> > Anyone have strong feelings on how Exim should handle UTF-8 with
> > operators such as ${length_1:STR} ?
> >
> > Document that the current operators work on bytes
>
> This.
>
> Add new operators, or options on current ones; don't
> change how they currently work (barring bugs).

+1

--
Heiko


signature.asc
Description: PGP signature
-- 
## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim 
details at http://www.exim.org/ ##


Re: [exim-dev] UTF-8 and Exim string operations

2018-08-17 Thread Phil Pennock via Exim-dev
On 2018-08-17 at 10:36 -, Jasen Betts via Exim-dev wrote:
> > and add ulength_1 for being UTF-8 aware?
>
> Would also need utf8-aware also substr and strlen.

Yes, I was using length as an exemplar, not as an exhaustive list.  :)

I favored ulength too, but didn't want to just add a slew of new
expansion operators, items and conditions without at least mentioning it
somewhere first.

> is it going to count code-points or glyphs?

Code-points.  Exim has no business knowing about how a layout engine
might or might not choose to render code-points to glyphs.  I could see
a possibility for normalization handling as another function, for
correct SASLprep for authentication.

I'd really rather not, though.  Exim is setuid root and the main system
for handling such things, ICU, does lots of tricky sensitive stuff with
a history of security problems.

> > Look at the top-bit being set and assume UTF-8, or
> > will that break too much with all the places which are still ISO-8859-1?
> 
> Just looking at that bit won't tell you enough to count code-points or
> glyphs.

I know, this was a suggestion for determining if the string should be
treated as UTF-8 for changing the current expansion o/i/c features; it
sucks but it was the only viable alternative I could think of and I
wanted to at least present an _idea_ of something else, for inciting
feedback.

I know a fair bit about UTF-8 internals and how to work with the various
aspects in multiple programming languages. :)

> parts of ${utf8clean can probably be re-used.

Yes, I thought of that, when pondering a new `utf8valid` expansion
condition.

> "${lc" "${uc" and "${if eqi" need consideraton too

Only if we go the ICU route and include normalization forms.  Which ...
is more bloat than I'm happy with in Exim's current architecture.

-Phil

-- 
## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim 
details at http://www.exim.org/ ##


Re: [exim-dev] UTF-8 and Exim string operations

2018-08-17 Thread Jeremy Harris via Exim-dev
On 08/17/2018 05:03 AM, Phil Pennock via Exim-dev wrote:
> Anyone have strong feelings on how Exim should handle UTF-8 with
> operators such as ${length_1:STR} ?
> 
> Document that the current operators work on bytes

This.

Add new operators, or options on current ones; don't
change how they currently work (barring bugs).
-- 
Cheers,
  Jeremy



-- 
## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim 
details at http://www.exim.org/ ##


Re: [exim-dev] UTF-8 and Exim string operations

2018-08-17 Thread Jasen Betts via Exim-dev
On 2018-08-17, Phil Pennock via Exim-dev  wrote:
> Anyone have strong feelings on how Exim should handle UTF-8 with
> operators such as ${length_1:STR} ?
>
> Document that the current operators work on bytes

Yeah stay with treating srings as nul terminated arrays of octets.
The same unit the RFCs use to define email and SMTP.

> and add ulength_1 for being UTF-8 aware?

Would also need utf8-aware also substr and strlen. 
is it going to count code-points or glyphs? 

> Look at the top-bit being set and assume UTF-8, or
> will that break too much with all the places which are still ISO-8859-1?

Just looking at that bit won't tell you enough to count code-points or
glyphs. you need to then group the octets together, and you need to do
something when you hit a non-valid octet
parts of ${utf8clean can probably be re-used.

"${lc" "${uc" and "${if eqi" need consideraton too

-- 
 ت

-- 
## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim 
details at http://www.exim.org/ ##


[exim-dev] UTF-8 and Exim string operations

2018-08-16 Thread Phil Pennock via Exim-dev
Anyone have strong feelings on how Exim should handle UTF-8 with
operators such as ${length_1:STR} ?

Document that the current operators work on bytes and add ulength_1 for
being UTF-8 aware?  Look at the top-bit being set and assume UTF-8, or
will that break too much with all the places which are still ISO-8859-1?

-Phil

-- 
## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim 
details at http://www.exim.org/ ##