[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library
[ https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128462#comment-17128462 ] Wes McKinney commented on ARROW-8961: - unilib's license (MPL 2.0) isn't ideal, see https://www.apache.org/legal/resolved.html#weak-copyleft-licenses. I'd prefer to only depend on MPL 2.0 libraries as a last resort. > [C++] Vendor utf8proc library > - > > Key: ARROW-8961 > URL: https://issues.apache.org/jira/browse/ARROW-8961 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This is a minimal MIT-licensed library for UTF-8 data processing originally > developed for use in Julia > https://github.com/JuliaStrings/utf8proc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library
[ https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128420#comment-17128420 ] Antoine Pitrou commented on ARROW-8961: --- I've compiled both libraries: * {{utf8proc}} weighs around 300 kB (mostly static data) * the weight of {{unilib}} depends on which functionality is being used, as it's header only; for example a test executable that uses property lookup and conversion, but not codepoint combining weighs around 120 kB > [C++] Vendor utf8proc library > - > > Key: ARROW-8961 > URL: https://issues.apache.org/jira/browse/ARROW-8961 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This is a minimal MIT-licensed library for UTF-8 data processing originally > developed for use in Julia > https://github.com/JuliaStrings/utf8proc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library
[ https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128386#comment-17128386 ] Antoine Pitrou commented on ARROW-8961: --- Also, {{unilib}} uses similar a lookup scheme, so it's unlikely to be significantly faster (it's actually a bit more complicated, because it seems it tries to compress the data tables more, at the expense of slightly more complicated lookup). A concern about {{unilib}}, though, would be that it has had a single contributor over its 6 years of existence. > [C++] Vendor utf8proc library > - > > Key: ARROW-8961 > URL: https://issues.apache.org/jira/browse/ARROW-8961 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This is a minimal MIT-licensed library for UTF-8 data processing originally > developed for use in Julia > https://github.com/JuliaStrings/utf8proc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library
[ https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128379#comment-17128379 ] Antoine Pitrou commented on ARROW-8961: --- What algorithms would we use in {{utf8proc}} ? If it's just tolower() and friends, the implementation seems simple and fast to me (and I doubt other libraries would be significantly faster). > [C++] Vendor utf8proc library > - > > Key: ARROW-8961 > URL: https://issues.apache.org/jira/browse/ARROW-8961 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This is a minimal MIT-licensed library for UTF-8 data processing originally > developed for use in Julia > https://github.com/JuliaStrings/utf8proc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library
[ https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127531#comment-17127531 ] Uwe Korn commented on ARROW-8961: - We should definitely run benchmarks as in the utf8proc issue tracker they mention that {{icu}} seems to be significantly faster than {{utf8proc}}. Still, {{icu}} is much fatter than {{utf8proc}} and we probably need exact the functionality that is part of {{utf8proc}}, not more. > [C++] Vendor utf8proc library > - > > Key: ARROW-8961 > URL: https://issues.apache.org/jira/browse/ARROW-8961 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This is a minimal MIT-licensed library for UTF-8 data processing originally > developed for use in Julia > https://github.com/JuliaStrings/utf8proc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library
[ https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118713#comment-17118713 ] Maarten Breddels commented on ARROW-8961: - FWIW, in Vaex i've relied on [https://github.com/ufal/unilib] which is a very minimal/barebone library, I have no strong opinions about this though (unless benchmarks tell me otherwise). > [C++] Vendor utf8proc library > - > > Key: ARROW-8961 > URL: https://issues.apache.org/jira/browse/ARROW-8961 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This is a minimal MIT-licensed library for UTF-8 data processing originally > developed for use in Julia > https://github.com/JuliaStrings/utf8proc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library
[ https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118061#comment-17118061 ] Wes McKinney commented on ARROW-8961: - Ah great. I see that utf8proc includes a 1.5 MB data file, so we shouldn't be too cavalier about vendoring it. If utf8proc is only required when {{-DARROW_COMPUTE=ON}} then perhaps we can just add it as a normal thirdparty toolchain library > [C++] Vendor utf8proc library > - > > Key: ARROW-8961 > URL: https://issues.apache.org/jira/browse/ARROW-8961 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This is a minimal MIT-licensed library for UTF-8 data processing originally > developed for use in Julia > https://github.com/JuliaStrings/utf8proc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library
[ https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117942#comment-17117942 ] Uwe Korn commented on ARROW-8961: - It's already there, named {{libutf8proc}}. > [C++] Vendor utf8proc library > - > > Key: ARROW-8961 > URL: https://issues.apache.org/jira/browse/ARROW-8961 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This is a minimal MIT-licensed library for UTF-8 data processing originally > developed for use in Julia > https://github.com/JuliaStrings/utf8proc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library
[ https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117932#comment-17117932 ] Wes McKinney commented on ARROW-8961: - [~uwe] I would say it would be worth going ahead and adding utf8proc to conda-forge if it is not there already. > [C++] Vendor utf8proc library > - > > Key: ARROW-8961 > URL: https://issues.apache.org/jira/browse/ARROW-8961 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This is a minimal MIT-licensed library for UTF-8 data processing originally > developed for use in Julia > https://github.com/JuliaStrings/utf8proc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library
[ https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117907#comment-17117907 ] Antoine Pitrou commented on ARROW-8961: --- I'll take a look sometimes if you don't beat me to it. > [C++] Vendor utf8proc library > - > > Key: ARROW-8961 > URL: https://issues.apache.org/jira/browse/ARROW-8961 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This is a minimal MIT-licensed library for UTF-8 data processing originally > developed for use in Julia > https://github.com/JuliaStrings/utf8proc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library
[ https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117446#comment-17117446 ] Uwe Korn commented on ARROW-8961: - For conda-forge and other distributions that can handle binary dependencies, we want to have use the system one. So we definitely need a ARROW_USE_SYSTEM_UTF8PROC option if we vendor. > [C++] Vendor utf8proc library > - > > Key: ARROW-8961 > URL: https://issues.apache.org/jira/browse/ARROW-8961 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This is a minimal MIT-licensed library for UTF-8 data processing originally > developed for use in Julia > https://github.com/JuliaStrings/utf8proc -- This message was sent by Atlassian Jira (v8.3.4#803005)