Re: Proposed alternative encoding for stray UTF-8 bytes in strings

2023-11-27 Thread elf
Yes, this is precisely my point - 'one or more'. The string-length with invalid embedded sequences is not guaranteed to be consistent, which seems like a problem. Doing a decode to ensure all points are valid - even if in the undefined sequences - seems to be a good idea to prevent secondary

Re: Proposed alternative encoding for stray UTF-8 bytes in strings

2023-11-27 Thread felix . winkelmann
> Question: if there is no translation at all, won't the invalid chars cause > issues with things like string-length and string-copy procs? That is, since > the number of octets can't be correctly translated to a number of glyphs, > there will be some unpleasant side effects. Converting a

Re: Proposed alternative encoding for stray UTF-8 bytes in strings

2023-11-27 Thread elf
Question: if there is no translation at all, won't the invalid chars cause issues with things like string-length and string-copy procs? That is, since the number of octets can't be correctly translated to a number of glyphs, there will be some unpleasant side effects. -elf On 27 November 2023

Re: Proposed alternative encoding for stray UTF-8 bytes in strings

2023-11-27 Thread felix . winkelmann
> From the unicode-transition page: > > The strategy that I favor in the moment is to handle all string data > > injected into the system transparently, the actual bytes are unchanged and > > unexpected UTF-8 bytes are decoded and marked as a U+DC80 - U+DCFF (low, > > trailing) UTF-16 surrogate