[GitHub] [arrow] houqp commented on pull request #7466: ARROW-9158: [Rust][Datafusion] projection physical plan compilation should preserve nullability
houqp commented on pull request #7466: URL: https://github.com/apache/arrow/pull/7466#issuecomment-645187978 Since this PR added a test that creates mutable execution context, it needs to be rebased and updated after https://github.com/apache/arrow/pull/7464 get merged. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7466: ARROW-9158: [Rust][Datafusion] projection physical plan compilation should preserve nullability
github-actions[bot] commented on pull request #7466: URL: https://github.com/apache/arrow/pull/7466#issuecomment-645185474 https://issues.apache.org/jira/browse/ARROW-9158 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] houqp opened a new pull request #7466: ARROW-9158: [Rust][Datafusion] projection physical plan compilation should preserve nullability
houqp opened a new pull request #7466: URL: https://github.com/apache/arrow/pull/7466 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7465: PARQUET-1877: [C++] Reconcile thrift limits
github-actions[bot] commented on pull request #7465: URL: https://github.com/apache/arrow/pull/7465#issuecomment-645152229 https://issues.apache.org/jira/browse/PARQUET-1877 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield edited a comment on pull request #7465: PARQUET-1877: [C++] Reconcile thrift limits
emkornfield edited a comment on pull request #7465: URL: https://github.com/apache/arrow/pull/7465#issuecomment-645150494 CC @wesm @pitrou I would assume 1MM elements is still sufficient for any reasonable parquet file, but let me know if you think differently. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #7465: PARQUET-1877: [C++] Reconcile thrift limits
emkornfield commented on pull request #7465: URL: https://github.com/apache/arrow/pull/7465#issuecomment-645150494 CC @wesm @pitrou This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield opened a new pull request #7465: PARQUET-1877: [C++] Reconcile thrift limits
emkornfield opened a new pull request #7465: URL: https://github.com/apache/arrow/pull/7465 Sets container size limit to have an upper bound memory footprint at the same order of magnitude as string size limits. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7464: ARROW-9157: [Rust][Datafusion] create_physical_plan should take self as immutable reference
github-actions[bot] commented on pull request #7464: URL: https://github.com/apache/arrow/pull/7464#issuecomment-645143002 https://issues.apache.org/jira/browse/ARROW-9157 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] dhirschfeld commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
dhirschfeld commented on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645141127 > Not to beat a dead horse about ARROW-9155 The bot is fine - I guess it links to whatever JIRA is listed in the title. That doesn't help if someone mentions a JIRA in a comment. If the autolink was configured for this repo that reference would have *automatically* been converted to [`ARROW-9155`](https://issues.apache.org/jira/browse/ARROW-9155). It's just a *very* minor thing, but it does help the DX. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] houqp opened a new pull request #7464: ARROW-9157: [Rust][Datafusion] create_physical_plan should take self as immutable reference
houqp opened a new pull request #7464: URL: https://github.com/apache/arrow/pull/7464 Since it's not mutating self, mutable reference is not necessary. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel
ursabot commented on pull request #7314: URL: https://github.com/apache/arrow/pull/7314#issuecomment-645135876 [AMD64 Ubuntu 18.04 C++ Benchmark (#112762)](https://ci.ursalabs.org/#builders/73/builds/79) builder has been succeeded. Revision: 525caea882fe49c0248932fff77df6bcd3f2f477 ```diff === = === benchmarkbaseline contenderchange === = === SumKernelFloat/32768/1 5.958 GiB/sec 10.208 GiB/sec 71.337% SumKernelInt64/32768/1 7.938 GiB/sec 10.783 GiB/sec 35.851% SumKernelInt32/32768/0 4.958 GiB/sec 10.832 GiB/sec 118.486% SumKernelInt64/32768/0 9.704 GiB/sec 14.764 GiB/sec 52.146% SumKernelFloat/32768/2 2.389 GiB/sec 5.767 GiB/sec141.401% - SumKernelInt16/32768/2 2.953 GiB/sec 1.696 GiB/sec-42.565% SumKernelDouble/32768/0 9.462 GiB/sec 13.540 GiB/sec 43.102% SumKernelInt32/32768/2 5.074 GiB/sec 6.403 GiB/sec26.201% - SumKernelInt16/32768/1 4.520 GiB/sec 1.693 GiB/sec-62.552% SumKernelFloat/32768/100 4.644 GiB/sec 5.807 GiB/sec25.051% SumKernelInt64/32768/1 9.778 GiB/sec 14.784 GiB/sec 51.203% SumKernelDouble/32768/1 7.595 GiB/sec 10.131 GiB/sec 33.400% SumKernelFloat/32768/1 3.920 GiB/sec 5.938 GiB/sec51.490% SumKernelInt64/32768/10 8.461 GiB/sec 10.548 GiB/sec 24.655% - SumKernelInt8/32768/10 1.680 GiB/sec 674.552 MiB/sec -60.783% SumKernelDouble/32768/10 7.748 GiB/sec 9.813 GiB/sec26.645% - SumKernelInt8/32768/11.643 GiB/sec 678.806 MiB/sec -59.643% SumKernelInt8/32768/01.745 GiB/sec 4.709 GiB/sec169.817% SumKernelInt32/32768/1 4.959 GiB/sec 10.884 GiB/sec 119.497% SumKernelInt64/32768/100 9.294 GiB/sec 10.545 GiB/sec 13.466% - SumKernelInt8/32768/100 2.242 GiB/sec 674.145 MiB/sec -70.638% SumKernelInt64/32768/2 7.674 GiB/sec 10.553 GiB/sec 37.529% SumKernelInt16/32768/0 2.850 GiB/sec 7.599 GiB/sec166.670% SumKernelInt32/32768/10 5.754 GiB/sec 6.423 GiB/sec11.627% SumKernelInt32/32768/1 5.213 GiB/sec 6.577 GiB/sec26.172% SumKernelDouble/32768/1008.731 GiB/sec 9.829 GiB/sec12.578% SumKernelFloat/32768/0 5.960 GiB/sec 10.156 GiB/sec 70.394% - SumKernelInt16/32768/100 4.099 GiB/sec 1.694 GiB/sec-58.674% - SumKernelInt8/32768/12.580 GiB/sec 674.639 MiB/sec -74.465% SumKernelDouble/32768/2 6.864 GiB/sec 9.912 GiB/sec44.416% SumKernelInt32/32768/100 6.731 GiB/sec 6.439 GiB/sec-4.334% - SumKernelInt8/32768/21.604 GiB/sec 674.265 MiB/sec -58.959% - SumKernelInt16/32768/10 3.491 GiB/sec 1.694 GiB/sec-51.473% SumKernelFloat/32768/10 3.924 GiB/sec 5.794 GiB/sec47.673% SumKernelDouble/32768/1 9.531 GiB/sec 13.404 GiB/sec 40.636% - SumKernelInt16/32768/1 3.021 GiB/sec 1.710 GiB/sec-43.392% === = === ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel
jianxind commented on pull request #7314: URL: https://github.com/apache/arrow/pull/7314#issuecomment-645130206 @ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645129920 Well my theory about greater/less didn't hold. The other relevant change was moving things into the anonymous namespace. It's possible that anonymous namespaces impact inlining somehow This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
ursabot commented on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645120596 [AMD64 Ubuntu 18.04 C++ Benchmark (#112729)](https://ci.ursalabs.org/#builders/73/builds/78) builder has been succeeded. Revision: 74caaae25e3bd95d57f3f6d9b835c2610639ab41 ```diff == == benchmark baselinecontender change == == GreaterArrayScalarString/32768/2 130.267m items/sec 224.418m items/sec 72.276% GreaterArrayScalarString/32768/10 261.021m items/sec 498.491m items/sec 90.977% - GreaterArrayArrayInt64/32768/2361.493m items/sec 335.383m items/sec -7.223% GreaterArrayScalarInt64/32768/10 565.966m items/sec 690.931m items/sec 22.080% GreaterArrayScalarInt64/32768/100 555.331m items/sec 668.238m items/sec 20.331% - GreaterArrayArrayInt64/32768/1364.755m items/sec 338.493m items/sec -7.200% - GreaterArrayArrayInt64/32768/1362.806m items/sec 336.071m items/sec -7.369% GreaterArrayScalarInt64/32768/1 573.242m items/sec 704.449m items/sec 22.889% GreaterArrayArrayString/32768/1 54.921m items/sec 73.848m items/sec 34.463% GreaterArrayArrayString/32768/2 52.950m items/sec 92.643m items/sec 74.964% GreaterArrayScalarString/32768/1 361.437m items/sec 653.553m items/sec 80.821% GreaterArrayScalarInt64/32768/0 555.044m items/sec 701.556m items/sec 26.396% GreaterArrayScalarInt64/32768/1 575.762m items/sec 682.164m items/sec 18.480% - GreaterArrayArrayInt64/32768/100 361.216m items/sec 338.015m items/sec -6.423% GreaterArrayScalarString/32768/0 339.144m items/sec 707.998m items/sec 108.760% GreaterArrayArrayString/32768/1 127.597m items/sec 193.087m items/sec 51.326% GreaterArrayArrayString/32768/10 50.585m items/sec 69.612m items/sec 37.613% GreaterArrayArrayString/32768/100 54.563m items/sec 73.054m items/sec 33.889% GreaterArrayScalarInt64/32768/2 558.941m items/sec 697.482m items/sec 24.786% GreaterArrayScalarString/32768/100331.214m items/sec 669.210m items/sec 102.048% - GreaterArrayArrayInt64/32768/10 361.791m items/sec 336.834m items/sec -6.898% GreaterArrayArrayString/32768/0 54.614m items/sec 73.456m items/sec 34.499% GreaterArrayScalarString/32768/1 342.276m items/sec 705.190m items/sec 106.030% - GreaterArrayArrayInt64/32768/0363.455m items/sec 336.738m items/sec -7.351% == == ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7463: ARROW-9145: [C++] Implement BooleanArray::true_count and false_count, add Python bindings
github-actions[bot] commented on pull request #7463: URL: https://github.com/apache/arrow/pull/7463#issuecomment-645116913 https://issues.apache.org/jira/browse/ARROW-9145 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645116862 Well we have these bot comments, is it not sufficient? https://github.com/apache/arrow/pull/7461#issuecomment-645086851 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7462: ARROW-7068: [C++] Add ListArray::offsets and LargeListArray::offsets returning boxed version of offsets as Int32Array/Int64Array
wesm commented on pull request #7462: URL: https://github.com/apache/arrow/pull/7462#issuecomment-645116463 I'm a bit stumped on the MinGW failure ``` [ 64%] Linking CXX executable ../../release/arrow-array-test.exe C:/msys64/mingw32/bin/../lib/gcc/i686-w64-mingw32/10.1.0/../../../../i686-w64-mingw32/bin/ld.exe: CMakeFiles/arrow-array-test.dir/objects.a(unity_0_cxx.cxx.obj):unity_0_cxx.cxx:(.text$_ZN5arrow13TestListArrayINS_8ListTypeEE22ValidateBasicListArrayEPKNS_9ListArrayERKSt6vectorIsSaIsEERKS6_IhSaIhEE[__ZN5arrow13TestListArrayINS_8ListTypeEE22ValidateBasicListArrayEPKNS_9ListArrayERKSt6vectorIsSaIsEERKS6_IhSaIhEE]+0x803): undefined reference to `_imp___ZN5arrow6detail9CTypeImplINS_9Int32TypeENS_11IntegerTypeELNS_4Type4typeE7EiE7type_idE' C:/msys64/mingw32/bin/../lib/gcc/i686-w64-mingw32/10.1.0/../../../../i686-w64-mingw32/bin/ld.exe: CMakeFiles/arrow-array-test.dir/objects.a(unity_0_cxx.cxx.obj):unity_0_cxx.cxx:(.text$_ZN5arrow13TestListArrayINS_13LargeListTypeEE22ValidateBasicListArrayEPKNS_14LargeListArrayERKSt6vectorIsSaIsEERKS6_IhSaIhEE[__ZN5arrow13TestListArrayINS_13LargeListTypeEE22ValidateBasicListArrayEPKNS_14LargeListArrayERKSt6vectorIsSaIsEERKS6_IhSaIhEE]+0x83c): undefined reference to `_imp___ZN5arrow6detail9CTypeImplINS_9Int64TypeENS_11IntegerTypeELNS_4Type4typeE9ExE7type_idE' collect2.exe: error: ld returned 1 exit status make[2]: *** [src/arrow/CMakeFiles/arrow-array-test.dir/build.make:118: release/arrow-array-test.exe] Error 1 make[1]: *** [CMakeFiles/Makefile2:1680: src/arrow/CMakeFiles/arrow-array-test.dir/all] Error 2 make[1]: *** Waiting for unfinished jobs ``` I'm reimplementing the methods in a way that hopefully won't trigger this mess. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] dhirschfeld commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
dhirschfeld commented on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645115775 To help follow along it would be handy if references to JIRA could be autolinked: https://help.github.com/en/github/administering-a-repository/configuring-autolinks-to-reference-external-resources That would save the lurker from having to search for it or the poster having to bother about pasting an actual URL. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645113493 Not to beat a dead horse about ARROW-9155, but the turnaround time for simple benchmarks isn't great This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645113285 @ursabot benchmark --benchmark-filter=Greater 18e559b This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645111941 Ah! It's because Greater is not implemented using Less. Let me switch things around This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm edited a comment on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm edited a comment on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645111941 Ah! It's because Greater is now implemented using Less. Let me switch things around so things are based on Greater/GreaterEqual instead This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
ursabot commented on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645111207 [AMD64 Ubuntu 18.04 C++ Benchmark (#112703)](https://ci.ursalabs.org/#builders/73/builds/77) builder has been succeeded. Revision: 301ffa539e634f2c464ca072cd5c543f1407f1f7 ```diff == == benchmark baselinecontender change == == GreaterArrayArrayString/32768/2 53.694m items/sec 93.892m items/sec 74.864% GreaterArrayScalarInt64/32768/2 616.025m items/sec 681.971m items/sec 10.705% GreaterArrayScalarString/32768/2 132.267m items/sec 216.365m items/sec 63.583% - GreaterArrayArrayInt64/32768/1365.572m items/sec 339.713m items/sec -7.074% GreaterArrayArrayString/32768/100 53.635m items/sec 74.222m items/sec 38.384% - GreaterArrayArrayInt64/32768/100 365.263m items/sec 339.865m items/sec -6.953% GreaterArrayScalarString/32768/1 341.705m items/sec 547.337m items/sec 60.178% - GreaterArrayArrayInt64/32768/1365.608m items/sec 340.598m items/sec -6.841% GreaterArrayArrayString/32768/0 55.046m items/sec 74.164m items/sec 34.733% GreaterArrayScalarInt64/32768/1 551.861m items/sec 679.554m items/sec 23.139% GreaterArrayScalarString/32768/1 360.301m items/sec 724.313m items/sec 101.030% GreaterArrayScalarInt64/32768/10 618.148m items/sec 669.015m items/sec 8.229% - GreaterArrayArrayInt64/32768/2365.552m items/sec 340.107m items/sec -6.961% GreaterArrayScalarString/32768/100332.135m items/sec 529.815m items/sec 59.518% - GreaterArrayArrayInt64/32768/0367.549m items/sec 341.912m items/sec -6.975% - GreaterArrayArrayInt64/32768/10 365.734m items/sec 339.651m items/sec -7.131% GreaterArrayScalarInt64/32768/100 604.015m items/sec 667.332m items/sec 10.483% GreaterArrayScalarString/32768/10 258.597m items/sec 411.304m items/sec 59.052% GreaterArrayScalarInt64/32768/0 622.436m items/sec 671.837m items/sec 7.937% GreaterArrayArrayString/32768/1 130.420m items/sec 196.548m items/sec 50.704% GreaterArrayScalarInt64/32768/1 622.602m items/sec 699.200m items/sec 12.303% GreaterArrayArrayString/32768/10 51.206m items/sec 70.119m items/sec 36.935% GreaterArrayScalarString/32768/0 337.556m items/sec 549.035m items/sec 62.650% GreaterArrayArrayString/32768/1 54.218m items/sec 74.885m items/sec 38.117% == == ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7463: ARROW-9145: [C++] Implement BooleanArray::true_count and false_count, add Python bindings
wesm commented on pull request #7463: URL: https://github.com/apache/arrow/pull/7463#issuecomment-645110679 FWIW `BooleanArray::true_count()` should be what's used for the `sum(boolean)` kernel This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm opened a new pull request #7463: ARROW-9145: [C++] Implement BooleanArray::true_count and false_count, add Python bindings
wesm opened a new pull request #7463: URL: https://github.com/apache/arrow/pull/7463 This seemed like a reasonable place to put this, and it seems like it may come in handy. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on a change in pull request #7461: URL: https://github.com/apache/arrow/pull/7461#discussion_r441240680 ## File path: cpp/src/arrow/compute/kernels/codegen_internal.h ## @@ -121,18 +123,34 @@ struct ArrayIterator> { template struct ArrayIterator> { - int64_t position = 0; - typename TypeTraits::ArrayType arr; - explicit ArrayIterator(const ArrayData& data) : arr(data.Copy()) {} - util::string_view operator()() { return arr.GetView(position++); } + using offset_type = typename Type::offset_type; + const ArrayData& arr; + const offset_type* offsets; + offset_type cur_offset; + const char* data; + int64_t position; + explicit ArrayIterator(const ArrayData& arr) + : arr(arr), +offsets(reinterpret_cast(arr.buffers[1]->data()) + +arr.offset), +cur_offset(offsets[0]), +data(reinterpret_cast(arr.buffers[2]->data())), +position(0) {} + + util::string_view operator()() { +offset_type next_offset = offsets[position++ + 1]; +auto result = util::string_view(data + cur_offset, next_offset - cur_offset); +cur_offset = next_offset; +return result; + } Review comment: @pitrou anecdotally it appears this is meaningfully faster than the prior version which used `GetView` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645104527 @ursabot benchmark --benchmark-filter=Greater 18e559b This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on pull request #7287: ARROW-8771: [C++] Add boost/process library to build support
liyafan82 commented on pull request #7287: URL: https://github.com/apache/arrow/pull/7287#issuecomment-645103962 > @liyafan82 I rebuilt the boost bundle and uploaded to bintray. Can you re-run whichever tests you have that failed because of this before and see if they work now? If they're good now, then we can merge this. @nealrichardson Thanks for your effort, I will re-run the build today. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645102793 There weren't any C++ unit tests for comparisons of primitive types so I addressed that, and also added comparisons for Time and Duration types (which on account of this patch are basically "free") This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7462: ARROW-7068: [C++] Add ListArray::offsets and LargeListArray::offsets returning boxed version of offsets as Int32Array/Int64Array
github-actions[bot] commented on pull request #7462: URL: https://github.com/apache/arrow/pull/7462#issuecomment-645098523 https://issues.apache.org/jira/browse/ARROW-7068 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm opened a new pull request #7462: ARROW-7068: [C++] Add ListArray::offsets and LargeListArray::offsets returning boxed version of offsets as Int32Array/Int64Array
wesm opened a new pull request #7462: URL: https://github.com/apache/arrow/pull/7462 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
ursabot commented on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645087761 [AMD64 Ubuntu 18.04 C++ Benchmark (#112653)](https://ci.ursalabs.org/#builders/73/builds/76) builder has been succeeded. Revision: 53671af32c338fcca1edc40732c4c5fd1ad7585e ```diff == == benchmark baselinecontender change == == GreaterArrayArrayString/32768/2 53.657m items/sec 94.257m items/sec 75.665% - GreaterArrayArrayInt64/32768/2363.591m items/sec 337.022m items/sec -7.307% GreaterArrayScalarInt64/32768/100 598.659m items/sec 701.841m items/sec 17.236% GreaterArrayScalarString/32768/0 341.707m items/sec 463.243m items/sec 35.567% GreaterArrayArrayString/32768/100 54.363m items/sec 75.156m items/sec 38.249% - GreaterArrayArrayInt64/32768/1364.615m items/sec 337.861m items/sec -7.338% - GreaterArrayArrayInt64/32768/0365.779m items/sec 338.697m items/sec -7.404% GreaterArrayArrayString/32768/1 54.734m items/sec 75.788m items/sec 38.465% GreaterArrayScalarString/32768/1 364.271m items/sec 569.761m items/sec 56.411% GreaterArrayScalarInt64/32768/2 596.674m items/sec 681.189m items/sec 14.164% - GreaterArrayArrayInt64/32768/1364.433m items/sec 338.039m items/sec -7.242% GreaterArrayArrayString/32768/1 127.275m items/sec 195.144m items/sec 53.325% GreaterArrayScalarInt64/32768/10 614.042m items/sec 697.835m items/sec 13.646% GreaterArrayScalarInt64/32768/1 613.585m items/sec 692.598m items/sec 12.877% - GreaterArrayArrayInt64/32768/100 360.385m items/sec 337.343m items/sec -6.394% GreaterArrayScalarString/32768/100331.392m items/sec 447.519m items/sec 35.042% GreaterArrayScalarInt64/32768/0 624.860m items/sec 705.244m items/sec 12.864% - GreaterArrayArrayInt64/32768/10 363.073m items/sec 337.156m items/sec -7.138% GreaterArrayScalarInt64/32768/1 608.171m items/sec 678.010m items/sec 11.483% GreaterArrayScalarString/32768/10 256.765m items/sec 352.403m items/sec 37.247% GreaterArrayScalarString/32768/1 335.759m items/sec 461.788m items/sec 37.535% GreaterArrayArrayString/32768/0 54.130m items/sec 75.873m items/sec 40.168% GreaterArrayScalarString/32768/2 130.823m items/sec 194.535m items/sec 48.701% GreaterArrayArrayString/32768/10 51.190m items/sec 71.220m items/sec 39.127% == == ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
github-actions[bot] commented on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645086851 https://issues.apache.org/jira/browse/ARROW-8969 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou closed pull request #7430: ARROW-9126: [C++] Fix building trimmed Boost bundle on Windows
kou closed pull request #7430: URL: https://github.com/apache/arrow/pull/7430 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on pull request #7430: ARROW-9126: [C++] Fix building trimmed Boost bundle on Windows
kou commented on pull request #7430: URL: https://github.com/apache/arrow/pull/7430#issuecomment-645084134 > Done. Thanks! Now, I'm a member of https://bintray.com/ursalabs . (Please ignore my meaningless join request...) > We could I guess, though I don't think I have sufficient permissions on the org to do that. You can join https://bintray.com/apache/arrow/ by requesting it to INFRA like https://issues.apache.org/jira/browse/INFRA-18698 . This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645083255 @ursabot benchmark --benchmark-filter=Greater 18e559b This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645083142 @ursabot benchmark --benchmark_filter=Greater 18e559b This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
ursabot commented on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645083146 ``` no such option: --benchmark_filter ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm opened a new pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm opened a new pull request #7461: URL: https://github.com/apache/arrow/pull/7461 With clang-8 on Linux this unit is now 654KB down from 1257KB. A few strategies: * Use same binary code for Less/Greater and LessEqual/GreaterEqual with arguments flipped * Reuse kernels for primitive types represented as integers * Reuse kernels for binary/string and largebinary/largestring On the latter matter, I had to rewrite some of the codegen_internal.h routines where they assumed they would receive the specific logical type as the compile-time parameter (possibly causing a conflict when trying to put a BinaryArray box around a StringArray's ArrayData). This makes kernel execution against string data faster in the places where `ArrayIterator` is used. Will run benchmarks As an aside, in general these kernel code generators will likely have lower and lower level code over time. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7460: ARROW-9154: [Developer] Use GitHub issue templates better
github-actions[bot] commented on pull request #7460: URL: https://github.com/apache/arrow/pull/7460#issuecomment-645058491 https://issues.apache.org/jira/browse/ARROW-9154 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson opened a new pull request #7460: ARROW-9154: [Developer] Use GitHub issue templates better
nealrichardson opened a new pull request #7460: URL: https://github.com/apache/arrow/pull/7460 To check it out on my fork, go to https://github.com/nealrichardson/arrow/issues and click New Issue This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7459: ARROW-6800: [C++] Support building libraries targeting C++14 or higher, disable GNU CXX extensions
github-actions[bot] commented on pull request #7459: URL: https://github.com/apache/arrow/pull/7459#issuecomment-645049497 https://issues.apache.org/jira/browse/ARROW-6800 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7459: ARROW-6800: [C++] Support building libraries targeting C++14 or higher, disable GNU CXX extensions
wesm commented on pull request #7459: URL: https://github.com/apache/arrow/pull/7459#issuecomment-645049112 OK I guess I'm a bit out of my depth on the CMAKE_CXX_EXTENSIONS issue. If anyone has ideas about what to do let me know This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7459: ARROW-6800: [C++] Support building libraries targeting C++14 or higher, disable GNU CXX extensions
wesm commented on pull request #7459: URL: https://github.com/apache/arrow/pull/7459#issuecomment-645047309 Well this is weird ``` CMake Error in src/plasma/CMakeLists.txt: Target "plasma-external-store-tests" requires the language dialect "CXXOFF" , but CMake does not know the compile flags to use to enable it. CMake Error in src/plasma/CMakeLists.txt: Target "plasma-client-tests" requires the language dialect "CXXOFF" , but CMake does not know the compile flags to use to enable it. ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7459: ARROW-6800: [C++] Support building libraries targeting C++14 or higher, disable GNU CXX extensions
wesm commented on pull request #7459: URL: https://github.com/apache/arrow/pull/7459#issuecomment-645047223 @kou @brills @kszucs could use your advice on possible concerns regarding the -std=gnu++11 to -std=c++11 change when using gcc This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm opened a new pull request #7459: ARROW-6800: [C++] Support building libraries targeting C++14 or higher, disable GNU CXX extensions
wesm opened a new pull request #7459: URL: https://github.com/apache/arrow/pull/7459 This seemed pretty simple. The C++ standard targeted by default is still C++11 but some users might want to target a higher standard (e.g. to get C++17 `std::string_view`). I also disabled use of `-std=gnu++11`. Not sure what are the implications of this but the test suite seemed fine locally. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #7455: ARROW-9031: [R] Implement conversion from Type::UINT64 to R vector
nealrichardson closed pull request #7455: URL: https://github.com/apache/arrow/pull/7455 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7455: ARROW-9031: [R] Implement conversion from Type::UINT64 to R vector
nealrichardson commented on pull request #7455: URL: https://github.com/apache/arrow/pull/7455#issuecomment-645045336 Deferring to https://jira.apache.org/jira/browse/ARROW-9083 questions of whether there are better ways to convert. This at least no longer errors and gives something usable. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson edited a comment on pull request #7287: ARROW-8771: [C++] Add boost/process library to build support
nealrichardson edited a comment on pull request #7287: URL: https://github.com/apache/arrow/pull/7287#issuecomment-645042129 @liyafan82 I rebuilt the boost bundle and uploaded to bintray. Can you re-run whichever tests you have that failed because of this before and see if they work now? If they're good now, then we can merge this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #7287: ARROW-8771: [C++] Add boost/process library to build support
nealrichardson commented on a change in pull request #7287: URL: https://github.com/apache/arrow/pull/7287#discussion_r441176530 ## File path: cpp/build-support/trim-boost.sh ## @@ -32,12 +32,12 @@ set -eu : ${BOOST_URL:=https://dl.bintray.com/boostorg/release/${BOOST_VERSION}/source/${BOOST_FILE}.tar.gz} # Arrow tests require these -BOOST_LIBS="system.hpp filesystem.hpp" +BOOST_LIBS="process.hpp filesystem.hpp" Review comment: I wasn't sure if it was safe to remove system.hpp because it is named explicitly in cmake and don't have time to track this down right now, so leaving a note to future self. ```suggestion # TODO: is system still required? We declare `--with-libraries=filesystem,regex,system` # in cmake, but there is no #include in our source anymore BOOST_LIBS="system.hpp process.hpp filesystem.hpp" ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7287: ARROW-8771: [C++] Add boost/process library to build support
nealrichardson commented on pull request #7287: URL: https://github.com/apache/arrow/pull/7287#issuecomment-645042129 @liyafan82 I rebuilt the boost bundle and uploaded to bintray. Can you re-run whichever tests you have that failed because of this before and see if they work now? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on a change in pull request #7436: ARROW-9094: [Python] Bump versions of compiled dependencies in manylinux wheels
kou commented on a change in pull request #7436: URL: https://github.com/apache/arrow/pull/7436#discussion_r441174856 ## File path: python/manylinux1/scripts/build_boost.sh ## @@ -16,12 +16,12 @@ # specific language governing permissions and limitations # under the License. -BOOST_VERSION=1.68.0 +BOOST_VERSION=1.73.0 BOOST_VERSION_UNDERSCORE=${BOOST_VERSION//\./_} NCORES=$(($(grep -c ^processor /proc/cpuinfo) + 1)) -curl -sL https://sourceforge.net/projects/boost/files/boost/${BOOST_VERSION}/boost_${BOOST_VERSION_UNDERSCORE}.tar.gz -o /boost_${BOOST_VERSION_UNDERSCORE}.tar.gz -tar xf boost_${BOOST_VERSION_UNDERSCORE}.tar.gz +curl -sL https://dl.bintray.com/boostorg/release/${BOOST_VERSION}/source/boost_${BOOST_VERSION_UNDERSCORE}.tar.bz2 -o /boost_${BOOST_VERSION_UNDERSCORE}.tar.bz2 Review comment: The Bintray limitation is bandwidth: https://github.com/boostorg/boost/issues/299#issuecomment-605462600 > The problem is not the reliability of the servers. The problem is that we have a large, but limited amount of (donated) bandwidth (about 30 TB/month). Someone could set up a set of mirrors, but who is going to provide (pay for) the service? It may be better that we mirror Boost at our Bintray and use our mirror. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7430: ARROW-9126: [C++] Fix building trimmed Boost bundle on Windows
nealrichardson commented on pull request #7430: URL: https://github.com/apache/arrow/pull/7430#issuecomment-645041382 Uploaded and re-triggered the failing build for confirmation that it's working now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7430: ARROW-9126: [C++] Fix building trimmed Boost bundle on Windows
nealrichardson commented on pull request #7430: URL: https://github.com/apache/arrow/pull/7430#issuecomment-645041249 @github-actions crossbow submit test-r-rstudio-r-base-3.6-bionic This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7458: ARROW-9122: [C++] Properly handle sliced arrays in ascii_lower, ascii_upper kernels
github-actions[bot] commented on pull request #7458: URL: https://github.com/apache/arrow/pull/7458#issuecomment-645039862 https://issues.apache.org/jira/browse/ARROW-9122 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7445: ARROW-8583: [C++][Doc] Undocumented parameter in Dataset namespace
wesm closed pull request #7445: URL: https://github.com/apache/arrow/pull/7445 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm opened a new pull request #7458: ARROW-9122: [C++] Properly handle sliced arrays in ascii_lower, ascii_upper kernels
wesm opened a new pull request #7458: URL: https://github.com/apache/arrow/pull/7458 We must both take into account the referenced range of the data buffer and shift the value offsets if the array slice offset is nonzero. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm edited a comment on pull request #7315: ARROW-7605: [C++] Bundle jemalloc into static libarrow
wesm edited a comment on pull request #7315: URL: https://github.com/apache/arrow/pull/7315#issuecomment-645038858 Will this approach work on Windows? This doesn't look like it's going to work for all our dependencies that might be built in bundled mode. I still think the static lib splicing with `ar` (and equivalents on macOS and Windows) is the most viable solution. Do others have opinions? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7315: ARROW-7605: [C++] Bundle jemalloc into static libarrow
wesm commented on pull request #7315: URL: https://github.com/apache/arrow/pull/7315#issuecomment-645038858 Will this approach work on Windows? This doesn't look like it's going to work for all our dependencies that might be built in bundled mode. I still think the static lib splicing with AR is the most viable solution. Do others have opinions? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7435: ARROW-8779: [R] Implement conversion to List
wesm closed pull request #7435: URL: https://github.com/apache/arrow/pull/7435 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7435: ARROW-8779: [R] Implement conversion to List
wesm commented on pull request #7435: URL: https://github.com/apache/arrow/pull/7435#issuecomment-645037889 thanks @romainfrancois! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7435: ARROW-8779: [R] Implement conversion to List
wesm commented on a change in pull request #7435: URL: https://github.com/apache/arrow/pull/7435#discussion_r441170772 ## File path: r/src/array_from_vector.cpp ## @@ -201,6 +202,67 @@ struct VectorToArrayConverter { return Status::OK(); } + template + arrow::enable_if_t::value, Status> Visit(const T& type) { +using BuilderType = typename TypeTraits::BuilderType; +ARROW_RETURN_IF(!Rf_inherits(x, "data.frame"), +Status::RError("Expecting a data frame")); + +auto* struct_builder = checked_cast(builder); + +int64_t n = vctrs::short_vec_size(x); Review comment: It seems in R it's possible to have data-frame-like tibble columns that are malformed viewed as Arrow-like (as evidenced by the unit test) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7410: ARROW-971: [C++][Compute] IsValid, IsNull kernels
wesm closed pull request #7410: URL: https://github.com/apache/arrow/pull/7410 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7410: ARROW-971: [C++][Compute] IsValid, IsNull kernels
wesm commented on pull request #7410: URL: https://github.com/apache/arrow/pull/7410#issuecomment-645034827 Merging This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on pull request #6512: ARROW-8430: [CI] Configure self-hosted runners for Github Actions
kszucs commented on pull request #6512: URL: https://github.com/apache/arrow/pull/6512#issuecomment-645027704 Tried but it didn’t work. I can further investigate it, but passing “-j” fixed the resource issues for now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on a change in pull request #6512: ARROW-8430: [CI] Configure self-hosted runners for Github Actions
kou commented on a change in pull request #6512: URL: https://github.com/apache/arrow/pull/6512#discussion_r441156336 ## File path: ci/docker/ubuntu-14.04-cpp.dockerfile ## @@ -68,7 +68,7 @@ ENV ARROW_BUILD_TESTS=ON \ ARROW_WITH_BROTLI=ON \ ARROW_WITH_BZ2=ON \ ARROW_WITH_LZ4=ON \ -ARROW_WITH_OPENSSL=OFF \ +PARQUET_REQUIRE_ENCRYPTION=OFF \ Review comment: Could you sort in alphabetical order? ## File path: ci/scripts/cpp_build.sh ## @@ -127,7 +127,11 @@ cmake -G "${CMAKE_GENERATOR:-Ninja}" \ ${CMAKE_ARGS} \ ${source_dir} -time cmake --build . --target install +if [ ! -z "${CPP_MAKE_PARALLELISM}" ]; then + time cmake --build . --target install -- -j${CPP_MAKE_PARALLELISM} Review comment: Does `CMAKE_BUILD_PARALLEL_LEVEL` not work with Ubuntu 20.04? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
wesm commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-645023981 I went ahead and asked https://github.com/ufal/unilib/issues/2 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7430: ARROW-9126: [C++] Fix building trimmed Boost bundle on Windows
nealrichardson commented on pull request #7430: URL: https://github.com/apache/arrow/pull/7430#issuecomment-645021672 > OK. Could you update the bundle and then re-run the build? Yes, will do > And could you add me ( https://bintray.com/kou ) to https://bintray.com/ursalabs 's members? > I may want to upload something to there in the future. Done. > Or should we create https://bintray.com/apache/arrow/boost and migrate https://bintray.com/ursalabs/arrow-boost/arrow-boost to https://bintray.com/apache/arrow/boost ? We could I guess, though I don't think I have sufficient permissions on the org to do that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
wesm commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r441146931 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -30,6 +31,21 @@ namespace internal { namespace { +// Code units in the range [a-z] can only be an encoding of an ascii +// character/codepoint, not the 2nd, 3rd or 4th code unit (byte) of an different +// codepoint. This guaranteed by non-overlap design of the unicode standard. (see +// section 2.5 of Unicode Standard Core Specification v13.0) + +uint8_t ascii_tolower(uint8_t utf8_code_unit) { Review comment: I think you will want this to be `static inline` (not sure all compilers will inline it otherwise) ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -159,11 +282,23 @@ void MakeUnaryStringBatchKernel(std::string name, ArrayKernelExec exec, DCHECK_OK(registry->AddFunction(std::move(func))); } +template typename Transformer> Review comment: I think some compilers demand that "class" be used for the last "typename" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on pull request #6729: ARROW-8229: [Java] Move ArrowBuf into the Arrow package
kou commented on pull request #6729: URL: https://github.com/apache/arrow/pull/6729#issuecomment-645016478 Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on pull request #7430: ARROW-9126: [C++] Fix building trimmed Boost bundle on Windows
kou commented on pull request #7430: URL: https://github.com/apache/arrow/pull/7430#issuecomment-645015432 OK. Could you update the bundle and then re-run the build? And could you add me ( https://bintray.com/kou ) to https://bintray.com/ursalabs 's members? I may want to upload something to there in the future. Or should we create https://bintray.com/apache/arrow/boost and migrate https://bintray.com/ursalabs/arrow-boost/arrow-boost to https://bintray.com/apache/arrow/boost ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7410: ARROW-971: [C++][Compute] IsValid, IsNull kernels
wesm commented on pull request #7410: URL: https://github.com/apache/arrow/pull/7410#issuecomment-645011421 @bkietz if you want to disable the ascii_* tests that are failing please go ahead This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7410: ARROW-971: [C++][Compute] IsValid, IsNull kernels
wesm commented on a change in pull request #7410: URL: https://github.com/apache/arrow/pull/7410#discussion_r441139760 ## File path: cpp/src/arrow/compute/kernels/scalar_validity.cc ## @@ -0,0 +1,107 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/compute/kernels/common.h" + +#include "arrow/util/bit_util.h" +#include "arrow/util/bitmap_ops.h" + +namespace arrow { + +using internal::CopyBitmap; +using internal::InvertBitmap; + +namespace compute { +namespace { + +struct IsValidOperator { + static void Call(KernelContext* ctx, const Scalar& in, Scalar* out) { +checked_cast(out)->value = in.is_valid; + } + + static void Call(KernelContext* ctx, const ArrayData& arr, ArrayData* out) { +DCHECK_EQ(out->offset, 0); +DCHECK_LE(out->length, arr.length); +if (arr.buffers[0] != nullptr) { + out->buffers[1] = arr.offset == 0 +? arr.buffers[0] +: SliceBuffer(arr.buffers[0], arr.offset / 8, arr.length / 8); + out->offset = arr.offset % 8; Review comment: I don't know off the top of my head what will be the implications of modifying the offset of the output value, but I think it should be okay, and we can always fix it later if it becomes an issue This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm edited a comment on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm edited a comment on pull request #7442: URL: https://github.com/apache/arrow/pull/7442#issuecomment-645004792 I'll have to deal with the string optimization in a follow up PR, so I'm going to leave this for review as is. It would be good to get this merged sooner rather than later. EDIT: opened https://issues.apache.org/jira/browse/ARROW-9152 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on pull request #7442: URL: https://github.com/apache/arrow/pull/7442#issuecomment-645004792 I'll have to deal with the string optimization in a follow up PR, so I'm going to leave this for review as is. It would be good to get this merged sooner rather than later This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #7441: ARROW-3446: [R] Document mapping of Arrow <-> R types
nealrichardson closed pull request #7441: URL: https://github.com/apache/arrow/pull/7441 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #7457: ARROW-9151: [R][CI] Fix Rtools 4.0 build: pacman sync
nealrichardson closed pull request #7457: URL: https://github.com/apache/arrow/pull/7457 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on pull request #7156: ARROW-8074: [C++][Dataset][Python] FileFragments from buffers and NativeFiles
bkietz commented on pull request #7156: URL: https://github.com/apache/arrow/pull/7156#issuecomment-644973548 Okay, I'll start trimming This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7410: ARROW-971: [C++][Compute] IsValid, IsNull kernels
bkietz commented on a change in pull request #7410: URL: https://github.com/apache/arrow/pull/7410#discussion_r441096806 ## File path: cpp/src/arrow/compute/kernels/scalar_validity.cc ## @@ -0,0 +1,110 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/compute/kernels/common.h" + +#include "arrow/util/bit_util.h" +#include "arrow/util/bitmap_ops.h" + +namespace arrow { + +using internal::CopyBitmap; +using internal::InvertBitmap; + +namespace compute { +namespace { + +struct IsValidOperator { + static void Call(KernelContext* ctx, const Scalar& in, Scalar* out) { +checked_cast(out)->value = in.is_valid; + } + + static void Call(KernelContext* ctx, const ArrayData& arr, ArrayData* out) { +if (arr.buffers[0] != nullptr && out->offset == arr.offset && +out->length == arr.length) { Review comment: PTAL This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7457: ARROW-9151: [R][CI] Fix Rtools 4.0 build: pacman sync
github-actions[bot] commented on pull request #7457: URL: https://github.com/apache/arrow/pull/7457#issuecomment-644969431 https://issues.apache.org/jira/browse/ARROW-9151 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on pull request #7156: ARROW-8074: [C++][Dataset][Python] FileFragments from buffers and NativeFiles
fsaintjacques commented on pull request #7156: URL: https://github.com/apache/arrow/pull/7156#issuecomment-644969405 Yes, we can already create FileFragment from any [FileSource](https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/file_base.h#L153-L157). You make a valid point that, usually, this is meant for a single buffer. If this is only scoped to FileSource and the python bindings, not touching any Factory, then this is fine as is. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #7453: ARROW-9141: [R] Update cross-package documentation links
nealrichardson closed pull request #7453: URL: https://github.com/apache/arrow/pull/7453 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7156: ARROW-8074: [C++][Dataset][Python] FileFragments from buffers and NativeFiles
jorisvandenbossche commented on pull request #7156: URL: https://github.com/apache/arrow/pull/7156#issuecomment-644966741 Note that it is not *only* for testing. We for sure use it for testing in pyarrow, but in pandas 1.0.4, we accidentally broke reading parquet files from file-like objects, and we directly got a some bug reports about it. So actual users also do that, to a certain extent. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche edited a comment on pull request #7156: ARROW-8074: [C++][Dataset][Python] FileFragments from buffers and NativeFiles
jorisvandenbossche edited a comment on pull request #7156: URL: https://github.com/apache/arrow/pull/7156#issuecomment-644966084 Taking a step back: wouldn't it be possible to eg "just" allow to create a Fragment from a buffer instead from a file? In practice, I think we only need to support dealing with buffers when there is a *single* buffer (so not like paths, where you can have multiple paths or a directory etc). And then do we need discovery at all? If we can construct a Fragment backed by a buffer instead of a file path, then you can create a Dataset from that, either with the physical schema of the fragment (no unification is needed if there is only one) or either with a user-specified schema. And in such a case, the factory can focus on file paths only. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #7451: ARROW-8769: [C++][R] Add convenience accessor for StructScalar fields
nealrichardson closed pull request #7451: URL: https://github.com/apache/arrow/pull/7451 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7451: ARROW-8769: [C++][R] Add convenience accessor for StructScalar fields
nealrichardson commented on pull request #7451: URL: https://github.com/apache/arrow/pull/7451#issuecomment-644966570 Rtools build failure is ARROW-9151. I'll merge. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson opened a new pull request #7457: ARROW-9151: [R][CI] Fix Rtools 4.0 build: pacman sync
nealrichardson opened a new pull request #7457: URL: https://github.com/apache/arrow/pull/7457 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7156: ARROW-8074: [C++][Dataset][Python] FileFragments from buffers and NativeFiles
jorisvandenbossche commented on pull request #7156: URL: https://github.com/apache/arrow/pull/7156#issuecomment-644966084 Taking a step back: wouldn't it be possible to eg "just" allow to create a Fragment from a buffer instead from a file? In practice, I think we only need to support dealing with buffers when there is a *single* buffer (so not like paths, where you can have multiple paths or a directory etc). And then do we need discovery at all? If we can construct a Fragment backed by a buffer instead of a file path, then you can create a Dataset from that, either with the physical schema of the fragment (no unification is needed if there is only one) or either with a user-specified schema. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #7451: ARROW-8769: [C++][R] Add convenience accessor for StructScalar fields
fsaintjacques commented on a change in pull request #7451: URL: https://github.com/apache/arrow/pull/7451#discussion_r441071413 ## File path: cpp/src/arrow/scalar_test.cc ## @@ -433,4 +433,24 @@ TYPED_TEST(TestNumericScalar, Cast) { } } +TEST(TestStructScalar, FieldAccess) { + StructScalar abc({MakeScalar(true), MakeNullScalar(int32()), MakeScalar("hello")}, + struct_({ + field("a", boolean()), + field("b", int32()), + field("b", utf8()), + })); + + ASSERT_OK_AND_ASSIGN(auto a, abc.field("a")); + AssertScalarsEqual(*a, *abc.value[0]); + + ASSERT_RAISES(Invalid, abc.field("b").status()); + + ASSERT_OK_AND_ASSIGN(auto b, abc.field(1)); + AssertScalarsEqual(*b, *abc.value[1]); + + ASSERT_RAISES(Invalid, abc.field(5).status()); + ASSERT_RAISES(Invalid, abc.field("c").status()); Review comment: Add a nested failure (not implemented). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #7441: ARROW-3446: [R] Document mapping of Arrow <-> R types
nealrichardson commented on a change in pull request #7441: URL: https://github.com/apache/arrow/pull/7441#discussion_r441079053 ## File path: r/src/array_to_vector.cpp ## @@ -418,6 +418,7 @@ class Converter_Struct : public Converter { std::vector> converters; }; +// Shouldn't this cast before dividing? Otherwise we're doing integer division Review comment: I made the change and tests pass, so let's go with it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #7441: ARROW-3446: [R] Document mapping of Arrow <-> R types
fsaintjacques commented on a change in pull request #7441: URL: https://github.com/apache/arrow/pull/7441#discussion_r440846641 ## File path: r/vignettes/arrow.Rmd ## @@ -86,7 +88,73 @@ to other applications and services that use Arrow. One example is Spark: the move data to and from Spark, yielding [significant performance gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/). -# Class structure and package conventions +# Internals + +## Mapping of R <--> Arrow types + +Arrow has a rich data type system that includes direct parallels with R's data types and much more. + +In the tables, entries with a `-` are not currently implemented. + +### R to Arrow + +| R type | Arrow type | +|--|| +| logical | boolean| +| integer | int32 | +| double ("numeric") | float64| +| character| utf8 | +| factor | dictionary | +| raw | uint8 | +| Date | date32 | +| POSIXct | timestamp | +| POSIXlt | - | +| data.frame | struct | +| list^+^ | list | +| bit64::integer64 | int64 | +| difftime | time32 | +| vctrs::vctrs_unspecified | null | + +^+^: Only lists where all elements are the same type are able to be translated to Arrow list type (which is a "list of" some type). + +### Arrow to R + +| Arrow type| R type | +|---|--| +| boolean | logical | +| int8 | integer | +| int16 | integer | +| int32 | integer | +| int64 | bit64::integer64 | +| uint8 | integer | +| uint16| integer | +| uint32| double | Review comment: I'm curious about the double, is it a copy-paste error, or does it really cast to double. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #7441: ARROW-3446: [R] Document mapping of Arrow <-> R types
nealrichardson commented on a change in pull request #7441: URL: https://github.com/apache/arrow/pull/7441#discussion_r441074458 ## File path: r/vignettes/arrow.Rmd ## @@ -86,7 +88,73 @@ to other applications and services that use Arrow. One example is Spark: the move data to and from Spark, yielding [significant performance gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/). -# Class structure and package conventions +# Internals + +## Mapping of R <--> Arrow types + +Arrow has a rich data type system that includes direct parallels with R's data types and much more. + +In the tables, entries with a `-` are not currently implemented. + +### R to Arrow + +| R type | Arrow type | +|--|| +| logical | boolean| +| integer | int32 | +| double ("numeric") | float64| +| character| utf8 | +| factor | dictionary | +| raw | uint8 | +| Date | date32 | +| POSIXct | timestamp | +| POSIXlt | - | +| data.frame | struct | +| list^+^ | list | +| bit64::integer64 | int64 | +| difftime | time32 | +| vctrs::vctrs_unspecified | null | + +^+^: Only lists where all elements are the same type are able to be translated to Arrow list type (which is a "list of" some type). + +### Arrow to R + +| Arrow type| R type | +|---|--| +| boolean | logical | +| int8 | integer | +| int16 | integer | +| int32 | integer | +| int64 | bit64::integer64 | +| uint8 | integer | +| uint16| integer | +| uint32| double | Review comment: Expanded https://jira.apache.org/jira/browse/ARROW-9083 to cover uint{32,64}. I'm going to proceed with this PR as is and will happily revise the docs if we change the implementation. Thanks for the discussion--shining a light on things like this is a big reason I picked up the documentation task. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on pull request #7452: ARROW-8961: [C++] Add utf8proc library to toolchain
maartenbreddels commented on pull request #7452: URL: https://github.com/apache/arrow/pull/7452#issuecomment-644951837 @xhochy do you want me to rebase https://github.com/apache/arrow/pull/7449 on this? So we can see if it's all working? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-644949978 It's not *that* slow, it was 40% of Vaex' performance (single threaded), so I think there is a bit more to be gained still. But I have added an optimization that tries ASCII conversion first. This gives it a 7x (compared to Vaex) to 10x speedup (in the benchmarks). Before: ``` Utf8Lower 193873803 ns193823124 ns3 bytes_per_second=102.387M/s items_per_second=5.40996M/s Utf8Upper 197154929 ns197093083 ns4 bytes_per_second=100.688M/s items_per_second=5.32021M/s ``` After: ``` Utf8Lower19508443 ns 19493652 ns 36 bytes_per_second=1018.02M/s items_per_second=53.7906M/s Utf8Upper19846885 ns 19832066 ns 35 bytes_per_second=1000.65M/s items_per_second=52.8728M/s ``` There is one loose end, the growth of the string can cause a utf8 array to be promoted to a large_utf8. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7453: ARROW-9141: [R] Update cross-package documentation links
github-actions[bot] commented on pull request #7453: URL: https://github.com/apache/arrow/pull/7453#issuecomment-644946345 Revision: bdaec94d2b1c88216a4b9e0395c98438b5cf5c5e Submitted crossbow builds: [ursa-labs/crossbow @ actions-325](https://github.com/ursa-labs/crossbow/branches/all?query=actions-325) |Task|Status| ||--| |test-r-linux-as-cran|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-325-github-test-r-linux-as-cran)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-325-github-test-r-linux-as-cran)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7453: ARROW-9141: [R] Update cross-package documentation links
nealrichardson commented on pull request #7453: URL: https://github.com/apache/arrow/pull/7453#issuecomment-644945052 @github-actions crossbow submit *as-cran* This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7410: ARROW-971: [C++][Compute] IsValid, IsNull kernels
bkietz commented on a change in pull request #7410: URL: https://github.com/apache/arrow/pull/7410#discussion_r441019388 ## File path: cpp/src/arrow/compute/kernels/scalar_validity.cc ## @@ -0,0 +1,110 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/compute/kernels/common.h" + +#include "arrow/util/bit_util.h" +#include "arrow/util/bitmap_ops.h" + +namespace arrow { + +using internal::CopyBitmap; +using internal::InvertBitmap; + +namespace compute { +namespace { + +struct IsValidOperator { + static void Call(KernelContext* ctx, const Scalar& in, Scalar* out) { +checked_cast(out)->value = in.is_valid; + } + + static void Call(KernelContext* ctx, const ArrayData& arr, ArrayData* out) { +if (arr.buffers[0] != nullptr && out->offset == arr.offset && +out->length == arr.length) { Review comment: As it happens this fails for sliced input. I'll restore the checks. For your interest, one approach i tried for `is_valid` was a `MetaFunction` which invoked a no-op ScalarFunction with INTERSECTION null handling then yanked the null bitmap from that. Unfortunately INTERSECTION doesn't currently support the zero copy case and I wasn't sure the approach would be acceptable but it did avoid repetition of null bitmap handling logic. What do you think? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques edited a comment on pull request #7156: ARROW-8074: [C++][Dataset][Python] FileFragments from buffers and NativeFiles
fsaintjacques edited a comment on pull request #7156: URL: https://github.com/apache/arrow/pull/7156#issuecomment-644939437 I feel 0.5 on this PR in general, the functionality it adds is initially for testing and it introduces debt. I'm not keen on the change on FileSystemFactory since this class is used for 3 purposes, and only one of them is used by Native handles: - [x] Unify schema - [ ] Crawl a file systems to explore (requires a filesystem instance) - [ ] Discover partition information (requires a path) This patch retrofits FileSystemFactory for no other purpose than making buffer sources accessible in python while ignoring the actual discovery. In reality, this class deals heavily and almost exclusively with paths. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7455: ARROW-9031: [R] Implement conversion from Type::UINT64 to R vector
nealrichardson commented on pull request #7455: URL: https://github.com/apache/arrow/pull/7455#issuecomment-644940220 > Why does `int64` use `Converter_Int64` but the other types are handled differently? Historical reasons I suppose. See also this discussion: https://github.com/apache/arrow/pull/7441#discussion_r440846641 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on pull request #7156: ARROW-8074: [C++][Dataset][Python] FileFragments from buffers and NativeFiles
fsaintjacques commented on pull request #7156: URL: https://github.com/apache/arrow/pull/7156#issuecomment-644939437 I feel 0.5 on this PR in general, the functionality it adds is initially for testing and it introduces debt. I'm not keen on the change on FileSystemFactory since this class is used for 3 purposes, and only of of them is used by Native handles: - [x] Unify schema - [ ] Crawl a file systems to explore (requires a filesystem instance) - [ ] Discover partition information (requires a path) This patch retrofits FileSystemFactory for no other purpose than making buffer sources accessible in python while ignoring the actual discovery. In reality, this class deals heavily and almost exclusively with paths. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org