[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-30 Thread GitBox
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651796902 You're welcome. Thanks all for your help. Impressed by the project, setup (CI/CMake), and people, and happy with the results:

[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651322087 I just concluded the same :) This is an automated message from the Apache Git Service. To respond to the

[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651289874 @pitrou your size commit made the benchmark go from `52->60 M/s`  > Yes, too. The main point of this state-machine-based decoder is that it's branchless, and so

[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651257793 We still have 2 failures, one might need a restart (travis / no output), the other is still a linker error: ```

[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651165626 @pitrou many thanks for the review. I've implemented all you suggestions except: * Raising an error on invalid utf8 data (see comment) * Having a benchmark run on

[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-22 Thread GitBox
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-647711867 @xhochy I see at least an `undefined reference to `_imp__utf8proc_toupper'` in the MinGW builds, hope that helps.

[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-22 Thread GitBox
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-647572424 Validating the utf8 string made the results slightly slower, but still much better then the initial results. Invalid utf8 characters are now replaced by a '?', as

[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-22 Thread GitBox
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-647429536 > I'd like to treat in-kernel type promotions as an anti-pattern in general. There are upsides and downsides to it. The downside is that users of the Arrow library

[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-19 Thread GitBox
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-646605458 Note that the unitests should fail, the PR isn't done, but the tests seem to run  This is an automated

[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-17 Thread GitBox
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-645546979 I've added my own utf encode/decode for now. With lookup tables I now get: ``` Utf8Lower_median18414820 ns 18408392 ns3

[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-17 Thread GitBox
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-645388555 I want to move Vaex from using its own string functions to using Arrow. If I make all functions at least faster, I'm more than happy to scrap my own code. I don't like to

[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-17 Thread GitBox
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-645381472 Would a lookup table in the order of 256kb (generated at runtime, not in the binary) per case mapping be acceptable for Arrow?

[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-17 Thread GitBox
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-645276441 I used valgrind/callgrind to see how where time spent: ![image](https://user-images.githubusercontent.com/1765949/84880814-57509e80-b08d-11ea-9563-f711986f3964.png)

[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-16 Thread GitBox
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-644949978 It's not *that* slow, it was 40% of Vaex' performance (single threaded), so I think there is a bit more to be gained still. But I have added an optimization that tries