Folks, I am following up on this, where did we land? The new implementation is roughly ~70LOC for UTF-8, so at first I don't see an issue with adding it to Elixir. However, the Elixir version would be UTF-8 only (part of the String module).
Thoughts? On Saturday, October 7, 2023 at 10:51:34 PM UTC+2 Kip wrote: > Cameron, I think this would be a useful addition to the Unicode library > <https://github.com/elixir-unicode/unicode> I maintain. If that works for > you, please open an issue there and we can collaborate. I think it being > part of the Erlang `:unicode` module makes good sense too as José says but > that's a longer "sales" and implementation cycle. > > On Saturday, October 7, 2023 at 7:40:50 PM UTC+11 José Valim wrote: > >> Hi Cameron, >> >> If the goal is to include this handling for UTF-16 and UTF-32, I suggest >> proposing this to Erlang/OTP as new functions in the "unicode" module. >> Otherwise, Elixir only has facilities to deal with UTF-8. You could propose >> such a feature in their issues tracker. >> >> Also note that "rolling your own" or "depending on packages" is usually >> not enough reasons for adding features to Elixir. Otherwise, one could >> easily argue Decimal and Jason would be more important additions to the >> language. :) We do describe which features we would consider part of the >> language here: https://elixir-lang.org/development.html >> >> Other than that, awesome job on the library and benchmarks. :) >> >> On Sat, Oct 7, 2023 at 1:03 AM Kip <kipc...@gmail.com> wrote: >> >>> Your implementation is definitely fast and memory efficient so I retract >>> my implementation comments. Now that I've run the benchmarking script and >>> tested out a few different approaches leveraging the std lib I understand >>> better why you've taken the approach you have. Nice work. >>> >>> On Saturday, October 7, 2023 at 9:26:37 AM UTC+11 Kip wrote: >>> >>>> Cameron, I think this is a useful proposal. Elixir has means to check >>>> validity (String.valid?/1) and a mechanism to split valid and invalid code >>>> points (String.chunk/2 with the :valid trait). But there isn't, to my >>>> knowledge, a means to coerce validity. A couple of thoughts: >>>> >>>> 1. Since Elixir strings are, by definition, UTF8, I don't know that >>>> special handling of UTF16 and UTF32 code points makes much sense - >>>> although >>>> I accept this may be more Unicode compliant. >>>> 2. What would the function be called? Since we have String.valid?/1 >>>> maybe String.validate/2 with an option `replace_invalid: utf8_string`. The >>>> default `:replace_invalid` could be U+FFFD or it could be `nil`. If >>>> the default is `nil` then there could also be a `String.validate!/2` that >>>> raises if there is no `:replace_invalid` option. >>>> 3. I think the implementation could leverage the code of >>>> `String.chunk/2` which uses `String.next_codepoint/1`. That would simplify >>>> implementation and be more consistent in code style. >>>> >>>> On Friday, October 6, 2023 at 12:24:28 PM UTC+11 cameron...@gmail.com >>>> wrote: >>>> >>>>> As far as I can tell, neither Elixir nor Erlang have a built in >>>>> function for replacing invalid sequences in Unicode. There's a suggested >>>>> method on this page >>>>> <https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf#page=153> >>>>> >>>>> of the Unicode standard for handling this. Several other languages (Go >>>>> <https://pkg.go.dev/bytes#ToValidUTF8>, Python >>>>> <https://docs.python.org/3/library/stdtypes.html#bytes.decode>, C# >>>>> <https://github.com/dotnet/docs/issues/13547>, etc) now follow this >>>>> spec. >>>>> >>>>> Invalid Unicode's encountered frequently enough that I think it's >>>>> worth incorporating a solution into Elixir itself. >>>>> >>>>> Present alternatives to handling invalid unicode (and json by >>>>> extension <https://github.com/michalmuskala/jason/issues/174>) are: >>>>> >>>>> - Crashing (not ideal in many cases) >>>>> - Roll your own (lot of overhead for accidental complexity) >>>>> - Depend on a package (+1 package towards dependency hell) >>>>> >>>>> This is my college try >>>>> <https://github.com/Moosieus/UniRecover/tree/main>, but I'm certain >>>>> there's a performant and far cleaner solution to be had in pure Elixir. >>>>> If >>>>> not, perhaps this is a request for OTP. >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elixir-lang-core" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to elixir-lang-co...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/elixir-lang-core/197620a2-6a96-41c6-a6e7-5da03e351080n%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/elixir-lang-core/197620a2-6a96-41c6-a6e7-5da03e351080n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- You received this message because you are subscribed to the Google Groups "elixir-lang-core" group. To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-core+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/fbacf493-6538-42c4-933b-083a0fdb57c7n%40googlegroups.com.