Yeah, so I would go with the full coverage route. If you want to provide a
PR, it will be welcome. Thank you and sorry for the delay!

On Tue, Oct 31, 2023 at 11:48 PM Cameron Duley <cameron.dule...@gmail.com>
wrote:

> I'd originally looked for tests in the spec, browsers, and other languages
> to compare against.
>
> W3C's current test suite doesn't seem comprehensive:
>
> https://github.com/web-platform-tests/wpt/blob/master/encoding/replacement-encodings.any.js
>
> Go has a few token tests that are relatively intelligible:
>
> https://cs.opensource.google/go/go/+/refs/tags/go1.21.3:src/bytes/bytes_test.go;l=1157
>
> On Tue, Oct 31, 2023 at 2:09 PM José Valim <jose.va...@dashbit.co> wrote:
>
>> Does the specification provide tests for us to include? Otherwise we can
>> include enough tests for full line coverage and a “brute force”/property
>> test commented out.
>>
>> I would say the name “replace_invalid” is excellent.
>>
>> On Tue, Oct 31, 2023 at 18:52 Cameron Duley <cameron.dule...@gmail.com>
>> wrote:
>>
>>> This was the final version I'd landed on for UTF-8:
>>>
>>> https://github.com/elixir-unicode/unicode/blob/main/lib/unicode/validation/utf8.ex
>>>
>>> Along with the following modules for testing:
>>>
>>> https://github.com/elixir-unicode/unicode/blob/main/test/support/unicode_validation_helpers.ex
>>>
>>> https://github.com/elixir-unicode/unicode/blob/main/test/unicode_validation_test.exs
>>>
>>> I think it's ideal functionality to have in the String module, and the
>>> implementation's "reasonable enough" until a native solution's available in
>>> OTP.
>>>
>>> Testing is my only uncertainty - How much is prudent and in what manner?
>>> On Tuesday, October 31, 2023 at 12:35:48 PM UTC-4 José Valim wrote:
>>>
>>>> Folks, I am following up on this, where did we land?
>>>>
>>>> The new implementation is roughly ~70LOC for UTF-8, so at first I don't
>>>> see an issue with adding it to Elixir. However, the Elixir version would be
>>>> UTF-8 only (part of the String module).
>>>>
>>>> Thoughts?
>>>>
>>>> On Saturday, October 7, 2023 at 10:51:34 PM UTC+2 Kip wrote:
>>>>
>>>>> Cameron, I think this would be a useful addition to the Unicode
>>>>> library <https://github.com/elixir-unicode/unicode> I maintain. If
>>>>> that works for you, please open an issue there and we can collaborate.  I
>>>>> think it being part of the Erlang `:unicode` module makes good sense too 
>>>>> as
>>>>> José says but that's a longer "sales" and implementation cycle.
>>>>>
>>>>> On Saturday, October 7, 2023 at 7:40:50 PM UTC+11 José Valim wrote:
>>>>>
>>>>>> Hi Cameron,
>>>>>>
>>>>>> If the goal is to include this handling for UTF-16 and UTF-32, I
>>>>>> suggest proposing this to Erlang/OTP as new functions in the "unicode"
>>>>>> module. Otherwise, Elixir only has facilities to deal with UTF-8. You 
>>>>>> could
>>>>>> propose such a feature in their issues tracker.
>>>>>>
>>>>>> Also note that "rolling your own" or "depending on packages" is
>>>>>> usually not enough reasons for adding features to Elixir. Otherwise, one
>>>>>> could easily argue Decimal and Jason would be more important additions to
>>>>>> the language. :) We do describe which features we would consider part of
>>>>>> the language here: https://elixir-lang.org/development.html
>>>>>>
>>>>>> Other than that, awesome job on the library and benchmarks. :)
>>>>>>
>>>>>> On Sat, Oct 7, 2023 at 1:03 AM Kip <kipc...@gmail.com> wrote:
>>>>>>
>>>>>>> Your implementation is definitely fast and memory efficient so I
>>>>>>> retract my implementation comments. Now that I've run the benchmarking
>>>>>>> script and tested out a few different approaches leveraging the std lib 
>>>>>>> I
>>>>>>> understand better why you've taken the approach you have. Nice work.
>>>>>>>
>>>>>>> On Saturday, October 7, 2023 at 9:26:37 AM UTC+11 Kip wrote:
>>>>>>>
>>>>>>>> Cameron, I think this is a useful proposal.  Elixir has means to
>>>>>>>> check validity (String.valid?/1) and a mechanism to split valid and 
>>>>>>>> invalid
>>>>>>>> code points (String.chunk/2 with the :valid trait). But there isn't, 
>>>>>>>> to my
>>>>>>>> knowledge, a means to coerce validity.  A couple of thoughts:
>>>>>>>>
>>>>>>>> 1. Since Elixir strings are, by definition, UTF8, I don't know that
>>>>>>>> special handling of UTF16 and UTF32 code points makes much sense - 
>>>>>>>> although
>>>>>>>> I accept this may be more Unicode compliant.
>>>>>>>> 2. What would the function be called? Since we have String.valid?/1
>>>>>>>> maybe String.validate/2 with an option `replace_invalid: utf8_string`. 
>>>>>>>> The
>>>>>>>> default `:replace_invalid` could be U+FFFD or it could be `nil`.
>>>>>>>> If the default is `nil` then there could also be a `String.validate!/2`
>>>>>>>> that raises if there is no `:replace_invalid` option.
>>>>>>>> 3. I think the implementation could leverage the code of
>>>>>>>> `String.chunk/2` which uses `String.next_codepoint/1`. That would 
>>>>>>>> simplify
>>>>>>>> implementation and be more consistent in code style.
>>>>>>>>
>>>>>>>> On Friday, October 6, 2023 at 12:24:28 PM UTC+11
>>>>>>>> cameron...@gmail.com wrote:
>>>>>>>>
>>>>>>>>> As far as I can tell, neither Elixir nor Erlang have a built in
>>>>>>>>> function for replacing invalid sequences in Unicode. There's a 
>>>>>>>>> suggested
>>>>>>>>> method on this page
>>>>>>>>> <https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf#page=153>
>>>>>>>>> of the Unicode standard for handling this. Several other languages (
>>>>>>>>> Go <https://pkg.go.dev/bytes#ToValidUTF8>, Python
>>>>>>>>> <https://docs.python.org/3/library/stdtypes.html#bytes.decode>, C#
>>>>>>>>> <https://github.com/dotnet/docs/issues/13547>, etc) now follow
>>>>>>>>> this spec.
>>>>>>>>>
>>>>>>>>> Invalid Unicode's encountered frequently enough that I think it's
>>>>>>>>> worth incorporating a solution into Elixir itself.
>>>>>>>>>
>>>>>>>>> Present alternatives to handling invalid unicode (and json by
>>>>>>>>> extension <https://github.com/michalmuskala/jason/issues/174>)
>>>>>>>>> are:
>>>>>>>>>
>>>>>>>>>    - Crashing (not ideal in many cases)
>>>>>>>>>    - Roll your own (lot of overhead for accidental complexity)
>>>>>>>>>    - Depend on a package (+1 package towards dependency hell)
>>>>>>>>>
>>>>>>>>> This is my college try
>>>>>>>>> <https://github.com/Moosieus/UniRecover/tree/main>, but I'm
>>>>>>>>> certain there's a performant and far cleaner solution to be had in 
>>>>>>>>> pure
>>>>>>>>> Elixir. If not, perhaps this is a request for OTP.
>>>>>>>>>
>>>>>>>>
>>>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "elixir-lang-core" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to elixir-lang-co...@googlegroups.com.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/elixir-lang-core/197620a2-6a96-41c6-a6e7-5da03e351080n%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/elixir-lang-core/197620a2-6a96-41c6-a6e7-5da03e351080n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "elixir-lang-core" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elixir-lang-core+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elixir-lang-core/26e4c761-caf0-46ea-ab08-a33407febdb8n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/elixir-lang-core/26e4c761-caf0-46ea-ab08-a33407febdb8n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elixir-lang-core" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elixir-lang-core+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elixir-lang-core/CAGnRm4LKEOb%3D7WMeUhots27dj%3DAwvhLF3o%2Bo7Q2%2BhZXMNQi8YA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/elixir-lang-core/CAGnRm4LKEOb%3D7WMeUhots27dj%3DAwvhLF3o%2Bo7Q2%2BhZXMNQi8YA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>
>
> --
> Thanks,
>
> Cameron Duley
>
> --
> You received this message because you are subscribed to the Google Groups
> "elixir-lang-core" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elixir-lang-core+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elixir-lang-core/CAH7jZqe%3D%3DKUPG6O3v_%3DSr%3DAjzv28X5x1QvpF-bqJ96XWOVPu5w%40mail.gmail.com
> <https://groups.google.com/d/msgid/elixir-lang-core/CAH7jZqe%3D%3DKUPG6O3v_%3DSr%3DAjzv28X5x1QvpF-bqJ96XWOVPu5w%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elixir-lang-core+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elixir-lang-core/CAGnRm4KkipvSpP%3DaP7UKcXMs_BAXgCf6N%3D8OUVsP7Qv6Zp1Xaw%40mail.gmail.com.

Reply via email to