Re: Proposed alternative encoding for stray UTF-8 bytes in strings

2023-11-28 Thread felix . winkelmann
> Yes, this is precisely my point - 'one or more'. The string-length with > invalid embedded sequences is not guaranteed to be consistent, which seems > like a problem. Doing a decode to ensure all points are valid - even if in > the undefined sequences - seems to be a good idea to prevent

Re: Proposed alternative encoding for stray UTF-8 bytes in strings

2023-11-27 Thread elf
Yes, this is precisely my point - 'one or more'. The string-length with invalid embedded sequences is not guaranteed to be consistent, which seems like a problem. Doing a decode to ensure all points are valid - even if in the undefined sequences - seems to be a good idea to prevent secondary

Re: Proposed alternative encoding for stray UTF-8 bytes in strings

2023-11-27 Thread felix . winkelmann
> Question: if there is no translation at all, won't the invalid chars cause > issues with things like string-length and string-copy procs? That is, since > the number of octets can't be correctly translated to a number of glyphs, > there will be some unpleasant side effects. Converting a

Re: Proposed alternative encoding for stray UTF-8 bytes in strings

2023-11-27 Thread elf
Question: if there is no translation at all, won't the invalid chars cause issues with things like string-length and string-copy procs? That is, since the number of octets can't be correctly translated to a number of glyphs, there will be some unpleasant side effects. -elf On 27 November 2023

Re: Proposed alternative encoding for stray UTF-8 bytes in strings

2023-11-27 Thread felix . winkelmann
> From the unicode-transition page: > > The strategy that I favor in the moment is to handle all string data > > injected into the system transparently, the actual bytes are unchanged and > > unexpected UTF-8 bytes are decoded and marked as a U+DC80 - U+DCFF (low, > > trailing) UTF-16 surrogate

Proposed alternative encoding for stray UTF-8 bytes in strings

2023-11-23 Thread John Cowan
(If this is too late in the process, I understand. I think the required code changes will be small and localized.) >From the unicode-transition page: The strategy that I favor in the moment is to handle all string data > injected into the system transparently, the actual bytes are unchanged and