[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-30 Thread GitBox


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651717472


   Phew. It worked. RTools 4.0 is still broken, but there doesn't seem to be 
anything we can do, except perhaps disable that job. I'm gonna merge and leave 
the R cleanup to someone else.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651366993


   > The version installed is compiled with gcc 8. RTools 35 uses gcc 4.9
   
   What difference does it make? This is plain C.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651353338


   > This means there also needs to be a PKGBUILD 
   
   Why? `libutf8proc` is installed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651338264


   @xhochy Could you help on the utf8proc issue on RTools 3.5?
   See here: 
https://github.com/apache/arrow/pull/7449/checks?check_run_id=819772618#step:10:169
   
   It seems that `UTF8PROC_STATIC` would need to be defined when building 
Arrow. But it's not set by `Findutf8proc.cmake`.
   Also, `libutf8proc.pc.in` added in 
https://github.com/r-windows/rtools-packages/pull/124 doesn't set it either.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651316656


   I pushed a commit that raises an error on invalid UTF8. It does not seem to 
make the benchmarks slower.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651282959


   Main point remaining is whether we raise an error on invalid UTF8 input. I 
see no reason not too (an Arrow string array has to be valid UTF8 as per the 
spec, just like a Python unicode string cannot contain characters outside of 
the unicode code range).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651282415


   > Having a benchmark run on non-ascii codepoints (I think we want to do this 
separate from this PR, but important point).
   
   Yes, I think we can defer that to a separate PR.
   
   > The existing decoder based on 
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ was new to me. Very interesting 
work, but unfortunately led to a performance regression (~50->30 M/s), which 
I'm surprised about actually. Maybe worth comparing again when we have a 
benchmark with non-ascii codepoints.
   
   Yes, too. The main point of this state-machine-based decoder is that it's 
branchless, and so it will perform roughly as well on non-Ascii data with 
unpredictable branching. On pure Ascii data, a branch-based decoder may be 
faster since the branches will always be predicted right.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-27 Thread GitBox


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-650606456


   @wesm I can also take this since you already have quite a bit on your plate.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-17 Thread GitBox


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-645389395


   I think it would be more acceptable to inline the relevant utf8proc 
functions.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-17 Thread GitBox


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-645382201


   Let's step back a bit: why do we care about micro-optimizing this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-17 Thread GitBox


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-645278822


   I don't know how important it is to get good performance on non-ASCII data. 
Note that the ASCII fast path could well be applied to subsets of the array.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org