[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library

2020-06-08 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128462#comment-17128462
 ] 

Wes McKinney commented on ARROW-8961:
-

unilib's license (MPL 2.0) isn't ideal, see 
https://www.apache.org/legal/resolved.html#weak-copyleft-licenses. I'd prefer 
to only depend on MPL 2.0 libraries as a last resort. 

> [C++] Vendor utf8proc library
> -
>
> Key: ARROW-8961
> URL: https://issues.apache.org/jira/browse/ARROW-8961
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This is a minimal MIT-licensed library for UTF-8 data processing originally 
> developed for use in Julia
> https://github.com/JuliaStrings/utf8proc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library

2020-06-08 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128420#comment-17128420
 ] 

Antoine Pitrou commented on ARROW-8961:
---

I've compiled both libraries:
 * {{utf8proc}} weighs around 300 kB (mostly static data)
 * the weight of {{unilib}} depends on which functionality is being used, as 
it's header only; for example a test executable that uses property lookup and 
conversion, but not codepoint combining weighs around 120 kB

> [C++] Vendor utf8proc library
> -
>
> Key: ARROW-8961
> URL: https://issues.apache.org/jira/browse/ARROW-8961
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This is a minimal MIT-licensed library for UTF-8 data processing originally 
> developed for use in Julia
> https://github.com/JuliaStrings/utf8proc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library

2020-06-08 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128386#comment-17128386
 ] 

Antoine Pitrou commented on ARROW-8961:
---

Also, {{unilib}} uses similar a lookup scheme, so it's unlikely to be 
significantly faster (it's actually a bit more complicated, because it seems it 
tries to compress the data tables more, at the expense of slightly more 
complicated lookup).

A concern about {{unilib}}, though, would be that it has had a single 
contributor over its 6 years of existence.

> [C++] Vendor utf8proc library
> -
>
> Key: ARROW-8961
> URL: https://issues.apache.org/jira/browse/ARROW-8961
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This is a minimal MIT-licensed library for UTF-8 data processing originally 
> developed for use in Julia
> https://github.com/JuliaStrings/utf8proc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library

2020-06-08 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128379#comment-17128379
 ] 

Antoine Pitrou commented on ARROW-8961:
---

What algorithms would we use in {{utf8proc}} ? If it's just tolower() and 
friends, the implementation seems simple and fast to me (and I doubt other 
libraries would be significantly faster).

> [C++] Vendor utf8proc library
> -
>
> Key: ARROW-8961
> URL: https://issues.apache.org/jira/browse/ARROW-8961
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This is a minimal MIT-licensed library for UTF-8 data processing originally 
> developed for use in Julia
> https://github.com/JuliaStrings/utf8proc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library

2020-06-07 Thread Uwe Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127531#comment-17127531
 ] 

Uwe Korn commented on ARROW-8961:
-

We should definitely run benchmarks as in the utf8proc issue tracker they 
mention that {{icu}} seems to be significantly faster than {{utf8proc}}. Still, 
{{icu}} is much fatter than {{utf8proc}} and we probably need exact the 
functionality that is part of {{utf8proc}}, not more.

> [C++] Vendor utf8proc library
> -
>
> Key: ARROW-8961
> URL: https://issues.apache.org/jira/browse/ARROW-8961
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This is a minimal MIT-licensed library for UTF-8 data processing originally 
> developed for use in Julia
> https://github.com/JuliaStrings/utf8proc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library

2020-05-28 Thread Maarten Breddels (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118713#comment-17118713
 ] 

Maarten Breddels commented on ARROW-8961:
-

FWIW, in Vaex i've relied on [https://github.com/ufal/unilib] which is a very 
minimal/barebone library, I have no strong opinions about this though (unless 
benchmarks tell me otherwise).

> [C++] Vendor utf8proc library
> -
>
> Key: ARROW-8961
> URL: https://issues.apache.org/jira/browse/ARROW-8961
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This is a minimal MIT-licensed library for UTF-8 data processing originally 
> developed for use in Julia
> https://github.com/JuliaStrings/utf8proc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library

2020-05-27 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118061#comment-17118061
 ] 

Wes McKinney commented on ARROW-8961:
-

Ah great. I see that utf8proc includes a 1.5 MB data file, so we shouldn't be 
too cavalier about vendoring it. If utf8proc is only required when 
{{-DARROW_COMPUTE=ON}} then perhaps we can just add it as a normal thirdparty 
toolchain library

> [C++] Vendor utf8proc library
> -
>
> Key: ARROW-8961
> URL: https://issues.apache.org/jira/browse/ARROW-8961
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This is a minimal MIT-licensed library for UTF-8 data processing originally 
> developed for use in Julia
> https://github.com/JuliaStrings/utf8proc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library

2020-05-27 Thread Uwe Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117942#comment-17117942
 ] 

Uwe Korn commented on ARROW-8961:
-

It's already there, named {{libutf8proc}}.

> [C++] Vendor utf8proc library
> -
>
> Key: ARROW-8961
> URL: https://issues.apache.org/jira/browse/ARROW-8961
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This is a minimal MIT-licensed library for UTF-8 data processing originally 
> developed for use in Julia
> https://github.com/JuliaStrings/utf8proc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library

2020-05-27 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117932#comment-17117932
 ] 

Wes McKinney commented on ARROW-8961:
-

[~uwe] I would say it would be worth going ahead and adding utf8proc to 
conda-forge if it is not there already. 

> [C++] Vendor utf8proc library
> -
>
> Key: ARROW-8961
> URL: https://issues.apache.org/jira/browse/ARROW-8961
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This is a minimal MIT-licensed library for UTF-8 data processing originally 
> developed for use in Julia
> https://github.com/JuliaStrings/utf8proc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library

2020-05-27 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117907#comment-17117907
 ] 

Antoine Pitrou commented on ARROW-8961:
---

I'll take a look sometimes if you don't beat me to it.

> [C++] Vendor utf8proc library
> -
>
> Key: ARROW-8961
> URL: https://issues.apache.org/jira/browse/ARROW-8961
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This is a minimal MIT-licensed library for UTF-8 data processing originally 
> developed for use in Julia
> https://github.com/JuliaStrings/utf8proc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library

2020-05-27 Thread Uwe Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117446#comment-17117446
 ] 

Uwe Korn commented on ARROW-8961:
-

For conda-forge and other distributions that can handle binary dependencies, we 
want to have use the system one. So we definitely need a 
ARROW_USE_SYSTEM_UTF8PROC option if we vendor.

> [C++] Vendor utf8proc library
> -
>
> Key: ARROW-8961
> URL: https://issues.apache.org/jira/browse/ARROW-8961
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This is a minimal MIT-licensed library for UTF-8 data processing originally 
> developed for use in Julia
> https://github.com/JuliaStrings/utf8proc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)