[jira] [Updated] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04
[ https://issues.apache.org/jira/browse/ARROW-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8166: -- Labels: pull-request-available (was: ) > [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04 > > > Key: ARROW-8166 > URL: https://issues.apache.org/jira/browse/ARROW-8166 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Frank Du >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > > cc [~frank.du] > I have an i9-9960X AVX512-capable CPU but I see > {code} > /usr/bin/ccache /usr/bin/clang++-8 -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS > -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API > -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 > -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB > -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated > -isystem ../thirdparty/flatbuffers/include -isystem > /home/wesm/cpp-toolchain/include -isystem jemalloc_ep-prefix/src -isystem > ../thirdparty/hadoop/include -Qunused-arguments -fcolor-diagnostics > -fuse-ld=gold -ggdb -O0 -Wall -Wextra -Wdocumentation -Wno-missing-braces > -Wno-unused-parameter -Wno-unknown-warning-option > -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option > -march=skylake-avx512 -maltivec -fno-omit-frame-pointer -g -fPIE -pthread > -std=gnu++11 -MD -MT > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c > ../src/arrow/util/rle_encoding_test.cc > In file included from ../src/arrow/util/rle_encoding_test.cc:33: > In file included from ../src/arrow/util/bit_stream_utils.h:28: > ../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier > '__m512i_u' > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:49:15: error: expected expression > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier > '__m512i_u' > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:55:15: error: expected expression > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > 4 errors generated. > {code} > I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems > that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the > clang-* versions are using the gcc-7 toolchain headers. Evidently we will > need to make this more robust -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04
[ https://issues.apache.org/jira/browse/ARROW-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063069#comment-17063069 ] Wes McKinney commented on ARROW-8166: - OK you have reproduced it > [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04 > > > Key: ARROW-8166 > URL: https://issues.apache.org/jira/browse/ARROW-8166 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Frank Du >Priority: Major > Fix For: 0.17.0 > > > cc [~frank.du] > I have an i9-9960X AVX512-capable CPU but I see > {code} > /usr/bin/ccache /usr/bin/clang++-8 -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS > -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API > -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 > -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB > -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated > -isystem ../thirdparty/flatbuffers/include -isystem > /home/wesm/cpp-toolchain/include -isystem jemalloc_ep-prefix/src -isystem > ../thirdparty/hadoop/include -Qunused-arguments -fcolor-diagnostics > -fuse-ld=gold -ggdb -O0 -Wall -Wextra -Wdocumentation -Wno-missing-braces > -Wno-unused-parameter -Wno-unknown-warning-option > -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option > -march=skylake-avx512 -maltivec -fno-omit-frame-pointer -g -fPIE -pthread > -std=gnu++11 -MD -MT > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c > ../src/arrow/util/rle_encoding_test.cc > In file included from ../src/arrow/util/rle_encoding_test.cc:33: > In file included from ../src/arrow/util/bit_stream_utils.h:28: > ../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier > '__m512i_u' > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:49:15: error: expected expression > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier > '__m512i_u' > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:55:15: error: expected expression > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > 4 errors generated. > {code} > I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems > that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the > clang-* versions are using the gcc-7 toolchain headers. Evidently we will > need to make this more robust -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04
[ https://issues.apache.org/jira/browse/ARROW-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Du reassigned ARROW-8166: --- Assignee: Frank Du > [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04 > > > Key: ARROW-8166 > URL: https://issues.apache.org/jira/browse/ARROW-8166 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Frank Du >Priority: Major > Fix For: 0.17.0 > > > cc [~frank.du] > I have an i9-9960X AVX512-capable CPU but I see > {code} > /usr/bin/ccache /usr/bin/clang++-8 -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS > -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API > -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 > -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB > -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated > -isystem ../thirdparty/flatbuffers/include -isystem > /home/wesm/cpp-toolchain/include -isystem jemalloc_ep-prefix/src -isystem > ../thirdparty/hadoop/include -Qunused-arguments -fcolor-diagnostics > -fuse-ld=gold -ggdb -O0 -Wall -Wextra -Wdocumentation -Wno-missing-braces > -Wno-unused-parameter -Wno-unknown-warning-option > -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option > -march=skylake-avx512 -maltivec -fno-omit-frame-pointer -g -fPIE -pthread > -std=gnu++11 -MD -MT > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c > ../src/arrow/util/rle_encoding_test.cc > In file included from ../src/arrow/util/rle_encoding_test.cc:33: > In file included from ../src/arrow/util/bit_stream_utils.h:28: > ../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier > '__m512i_u' > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:49:15: error: expected expression > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier > '__m512i_u' > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:55:15: error: expected expression > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > 4 errors generated. > {code} > I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems > that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the > clang-* versions are using the gcc-7 toolchain headers. Evidently we will > need to make this more robust -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04
[ https://issues.apache.org/jira/browse/ARROW-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063049#comment-17063049 ] Frank Du commented on ARROW-8166: - Repeat it with passing COMPILER clang. -DCMAKE_C_COMPILER=clang-8 \ -DCMAKE_CXX_COMPILER=clang++-8 \ /mnt/arrow/cpp/src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier '__m512i_u' *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04 > > > Key: ARROW-8166 > URL: https://issues.apache.org/jira/browse/ARROW-8166 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.17.0 > > > cc [~frank.du] > I have an i9-9960X AVX512-capable CPU but I see > {code} > /usr/bin/ccache /usr/bin/clang++-8 -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS > -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API > -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 > -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB > -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated > -isystem ../thirdparty/flatbuffers/include -isystem > /home/wesm/cpp-toolchain/include -isystem jemalloc_ep-prefix/src -isystem > ../thirdparty/hadoop/include -Qunused-arguments -fcolor-diagnostics > -fuse-ld=gold -ggdb -O0 -Wall -Wextra -Wdocumentation -Wno-missing-braces > -Wno-unused-parameter -Wno-unknown-warning-option > -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option > -march=skylake-avx512 -maltivec -fno-omit-frame-pointer -g -fPIE -pthread > -std=gnu++11 -MD -MT > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c > ../src/arrow/util/rle_encoding_test.cc > In file included from ../src/arrow/util/rle_encoding_test.cc:33: > In file included from ../src/arrow/util/bit_stream_utils.h:28: > ../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier > '__m512i_u' > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:49:15: error: expected expression > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier > '__m512i_u' > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:55:15: error: expected expression > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > 4 errors generated. > {code} > I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems > that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the > clang-* versions are using the gcc-7 toolchain headers. Evidently we will > need to make this more robust -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04
[ https://issues.apache.org/jira/browse/ARROW-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063046#comment-17063046 ] Frank Du commented on ARROW-8166: - Seems I'm still using gcc for the build, how can I change to clang? Sorry I'm not vary familiar with this part. cd /mnt/arrow/cpp/build/src/arrow/util && /usr/bin/c++ -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_USE_SIMD -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DGTEST_LINKED_AS_ SHARED_LIBRARY=1 -DURI_STATIC_BUILD -isystem /mnt/arrow/cpp/thirdparty/flatbuffers/include -isystem /mnt/arrow/cpp/build/boost_ep-prefix/src/boost_ep -isystem /mnt/arrow/cpp/build/snappy _ep/src/snappy_ep-install/include -isystem /mnt/arrow/cpp/build/gflags_ep-prefix/src/gflags_ep/include -isystem /mnt/arrow/cpp/build/thrift_ep-install/include -isystem /mnt/arrow/cpp/bui ld/protobuf_ep-install/include -isystem /mnt/arrow/cpp/build/jemalloc_ep-prefix/src -isystem /mnt/arrow/cpp/build/googletest_ep-prefix/src/googletest_ep/include -isystem /mnt/arrow/cpp/b uild/gbenchmark_ep/src/gbenchmark_ep-install/include -isystem /mnt/arrow/cpp/build/rapidjson_ep/src/rapidjson_ep-install/include -isystem /mnt/arrow/cpp/build/re2_ep-install/include -isy stem /mnt/arrow/cpp/thirdparty/hadoop/include -I/mnt/arrow/cpp/build/src -I/mnt/arrow/cpp/src -I/mnt/arrow/cpp/src/generated -Wno-noexcept-type -fdiagnostics-color=always -O3 -DNDEBUG -Wall -march=skylake-avx512 -O3 -DNDEBUG -fPIE -pthread -std=gnu++11 -o CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c /mnt/arrow/cpp/src/arrow/util/rle_encoding_test.cc [ 49%] Linking CXX executable ../../../release/arrow-utility-test > [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04 > > > Key: ARROW-8166 > URL: https://issues.apache.org/jira/browse/ARROW-8166 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.17.0 > > > cc [~frank.du] > I have an i9-9960X AVX512-capable CPU but I see > {code} > /usr/bin/ccache /usr/bin/clang++-8 -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS > -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API > -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 > -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB > -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated > -isystem ../thirdparty/flatbuffers/include -isystem > /home/wesm/cpp-toolchain/include -isystem jemalloc_ep-prefix/src -isystem > ../thirdparty/hadoop/include -Qunused-arguments -fcolor-diagnostics > -fuse-ld=gold -ggdb -O0 -Wall -Wextra -Wdocumentation -Wno-missing-braces > -Wno-unused-parameter -Wno-unknown-warning-option > -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option > -march=skylake-avx512 -maltivec -fno-omit-frame-pointer -g -fPIE -pthread > -std=gnu++11 -MD -MT > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c > ../src/arrow/util/rle_encoding_test.cc > In file included from ../src/arrow/util/rle_encoding_test.cc:33: > In file included from ../src/arrow/util/bit_stream_utils.h:28: > ../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier > '__m512i_u' > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:49:15: error: expected expression > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier > '__m512i_u' > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:55:15: error: expected expression > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > 4 errors generated. > {code} > I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems > that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the > clang-* versions are using the gcc-7 toolchain headers. Evidently we will > need to make this more robust -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8169) [Java] Improve the performance of JDBC adapter by allocating memory proactively
[ https://issues.apache.org/jira/browse/ARROW-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8169: -- Labels: pull-request-available (was: ) > [Java] Improve the performance of JDBC adapter by allocating memory > proactively > --- > > Key: ARROW-8169 > URL: https://issues.apache.org/jira/browse/ARROW-8169 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > > The current implementation use {{setSafe}} methods to dynamically allocate > memory if necessary. For fixed width vectors (which are frequently used in > JDBC), however, we can allocate memory proactively, since the vector size is > known as a configuration parameter. So for fixed width vectors, we can use > {{set}} methods instead. > This change leads to two benefits: > 1. When processing each value, we no longer have to check vector capacity and > reallocate memroy if needed. This leads to better performance. > 2. If we allow the memory to expand automatically (each time by 2x), the > amount of memory usually ends up being more than necessary. By allocating > memory by the configuration parameter, we allocate no more, or no less. > Benchmark results show notable performance improvements: > Before: > Benchmark Mode CntScore Error Units > JdbcAdapterBenchmarks.consumeBenchmark avgt5 521.700 ± 4.837 us/op > After: > Benchmark Mode CntScore Error Units > JdbcAdapterBenchmarks.consumeBenchmark avgt5 430.523 ± 9.932 us/op -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8169) [Java] Improve the performance of JDBC adapter by allocating memory proactively
Liya Fan created ARROW-8169: --- Summary: [Java] Improve the performance of JDBC adapter by allocating memory proactively Key: ARROW-8169 URL: https://issues.apache.org/jira/browse/ARROW-8169 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan The current implementation use {{setSafe}} methods to dynamically allocate memory if necessary. For fixed width vectors (which are frequently used in JDBC), however, we can allocate memory proactively, since the vector size is known as a configuration parameter. So for fixed width vectors, we can use {{set}} methods instead. This change leads to two benefits: 1. When processing each value, we no longer have to check vector capacity and reallocate memroy if needed. This leads to better performance. 2. If we allow the memory to expand automatically (each time by 2x), the amount of memory usually ends up being more than necessary. By allocating memory by the configuration parameter, we allocate no more, or no less. Benchmark results show notable performance improvements: Before: Benchmark Mode CntScore Error Units JdbcAdapterBenchmarks.consumeBenchmark avgt5 521.700 ± 4.837 us/op After: Benchmark Mode CntScore Error Units JdbcAdapterBenchmarks.consumeBenchmark avgt5 430.523 ± 9.932 us/op -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04
[ https://issues.apache.org/jira/browse/ARROW-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063044#comment-17063044 ] Wes McKinney commented on ARROW-8166: - I'll investigate some more and see if I can boil down what is different on my system > [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04 > > > Key: ARROW-8166 > URL: https://issues.apache.org/jira/browse/ARROW-8166 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.17.0 > > > cc [~frank.du] > I have an i9-9960X AVX512-capable CPU but I see > {code} > /usr/bin/ccache /usr/bin/clang++-8 -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS > -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API > -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 > -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB > -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated > -isystem ../thirdparty/flatbuffers/include -isystem > /home/wesm/cpp-toolchain/include -isystem jemalloc_ep-prefix/src -isystem > ../thirdparty/hadoop/include -Qunused-arguments -fcolor-diagnostics > -fuse-ld=gold -ggdb -O0 -Wall -Wextra -Wdocumentation -Wno-missing-braces > -Wno-unused-parameter -Wno-unknown-warning-option > -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option > -march=skylake-avx512 -maltivec -fno-omit-frame-pointer -g -fPIE -pthread > -std=gnu++11 -MD -MT > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c > ../src/arrow/util/rle_encoding_test.cc > In file included from ../src/arrow/util/rle_encoding_test.cc:33: > In file included from ../src/arrow/util/bit_stream_utils.h:28: > ../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier > '__m512i_u' > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:49:15: error: expected expression > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier > '__m512i_u' > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:55:15: error: expected expression > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > 4 errors generated. > {code} > I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems > that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the > clang-* versions are using the gcc-7 toolchain headers. Evidently we will > need to make this more robust -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04
[ https://issues.apache.org/jira/browse/ARROW-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063041#comment-17063041 ] Frank Du commented on ARROW-8166: - root@9735cf0f4203:/usr# grep __m512i_u * -R lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h:typedef long long __m512i_u __attribute__ ((__vector_size__ (64), __may_alias__, __aligned__ (1))); lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h: return *(__m512i_u *)__P; lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h: *(__m512i_u *)__P = __A; lib/gcc/x86_64-linux-gnu/7.5.0/include/avx512fintrin.h:typedef long long __m512i_u __attribute__ ((__vector_size__ (64), __may_alias__, __aligned__ (1))); lib/gcc/x86_64-linux-gnu/7.5.0/include/avx512fintrin.h: return *(__m512i_u *)__P; lib/gcc/x86_64-linux-gnu/7.5.0/include/avx512fintrin.h: *(__m512i_u *)__P = __A; > [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04 > > > Key: ARROW-8166 > URL: https://issues.apache.org/jira/browse/ARROW-8166 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.17.0 > > > cc [~frank.du] > I have an i9-9960X AVX512-capable CPU but I see > {code} > /usr/bin/ccache /usr/bin/clang++-8 -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS > -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API > -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 > -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB > -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated > -isystem ../thirdparty/flatbuffers/include -isystem > /home/wesm/cpp-toolchain/include -isystem jemalloc_ep-prefix/src -isystem > ../thirdparty/hadoop/include -Qunused-arguments -fcolor-diagnostics > -fuse-ld=gold -ggdb -O0 -Wall -Wextra -Wdocumentation -Wno-missing-braces > -Wno-unused-parameter -Wno-unknown-warning-option > -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option > -march=skylake-avx512 -maltivec -fno-omit-frame-pointer -g -fPIE -pthread > -std=gnu++11 -MD -MT > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c > ../src/arrow/util/rle_encoding_test.cc > In file included from ../src/arrow/util/rle_encoding_test.cc:33: > In file included from ../src/arrow/util/bit_stream_utils.h:28: > ../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier > '__m512i_u' > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:49:15: error: expected expression > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier > '__m512i_u' > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:55:15: error: expected expression > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > 4 errors generated. > {code} > I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems > that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the > clang-* versions are using the gcc-7 toolchain headers. Evidently we will > need to make this more robust -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04
[ https://issues.apache.org/jira/browse/ARROW-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063040#comment-17063040 ] Frank Du commented on ARROW-8166: - I tried a quick build on docker context with ubuntu:18.04 image, the build successfully, below is the command: sudo docker run -it -v /home/pnp/arrow/:/mnt ubuntu:18.04 apt-get update apt-get install llvm-8 cmake build-essential clang-8 autoconf libboost-dev libboost-filesystem-dev libboost-system-dev libboost-regex-dev libjemalloc-dev cmake -DARROW_WITH_SNAPPY=ON \ -DARROW_GANDIVA=ON \ -DARROW_PARQUET=ON \ -DARROW_BUILD_TESTS=ON \ -DARROW_BUILD_BENCHMARKS=ON \ -DARROW_SIMD_LEVEL=AVX512 \ .. make -j16 And below is the info of GCC and LLVM: gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.5.0-3ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) clang-8 -v clang version 8.0.0-3~ubuntu18.04.2 (tags/RELEASE_800/final) Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/bin Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/7 Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/7.5.0 Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/8 Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7 Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7.5.0 Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/8 Selected GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/7.5.0 Candidate multilib: .;@m64 Selected multilib: .;@m64 > [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04 > > > Key: ARROW-8166 > URL: https://issues.apache.org/jira/browse/ARROW-8166 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.17.0 > > > cc [~frank.du] > I have an i9-9960X AVX512-capable CPU but I see > {code} > /usr/bin/ccache /usr/bin/clang++-8 -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS > -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API > -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 > -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB > -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated > -isystem ../thirdparty/flatbuffers/include -isystem > /home/wesm/cpp-toolchain/include -isystem jemalloc_ep-prefix/src -isystem > ../thirdparty/hadoop/include -Qunused-arguments -fcolor-diagnostics > -fuse-ld=gold -ggdb -O0 -Wall -Wextra -Wdocumentation -Wno-missing-braces > -Wno-unused-parameter -Wno-unknown-warning-option > -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option > -march=skylake-avx512 -maltivec -fno-omit-frame-pointer -g -fPIE -pthread > -std=gnu++11 -MD -MT > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o > src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c > ../src/arrow/util/rle_encoding_test.cc > In file included from ../src/arrow/util/rle_encoding_test.cc:33: > In file included from ../src/arrow/util/bit_stream_utils.h:28: > ../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier > '__m512i_u' > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ > ../src/arrow/util/bpacking.h:49:15: error: expected expression > *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); > ^ >
[jira] [Updated] (ARROW-8138) [C++] parquet::arrow::FileReader cannot read multiple RowGroup
[ https://issues.apache.org/jira/browse/ARROW-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8138: Summary: [C++] parquet::arrow::FileReader cannot read multiple RowGroup (was: parquet::arrow::FileReader cannot read multiple RowGroup) > [C++] parquet::arrow::FileReader cannot read multiple RowGroup > -- > > Key: ARROW-8138 > URL: https://issues.apache.org/jira/browse/ARROW-8138 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.16.0 > Environment: Centos 7 >Reporter: Feng Tian >Priority: Major > Attachments: bug.cpp, bug.parquet > > > When use parquet::arrow::FileReader to read parquet file consisting multiple > row groups, > {code:c++} > reader->RowGroup(i)->Column(c)->Read > {code} > It will repeated read data of the first rowgroup. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8138) [C++] parquet::arrow::FileReader cannot read multiple RowGroup
[ https://issues.apache.org/jira/browse/ARROW-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8138: Fix Version/s: 0.17.0 > [C++] parquet::arrow::FileReader cannot read multiple RowGroup > -- > > Key: ARROW-8138 > URL: https://issues.apache.org/jira/browse/ARROW-8138 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.16.0 > Environment: Centos 7 >Reporter: Feng Tian >Priority: Major > Fix For: 0.17.0 > > Attachments: bug.cpp, bug.parquet > > > When use parquet::arrow::FileReader to read parquet file consisting multiple > row groups, > {code:c++} > reader->RowGroup(i)->Column(c)->Read > {code} > It will repeated read data of the first rowgroup. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8168) Improve Java Plasma client off-heap memory usage
KunshangJi created ARROW-8168: - Summary: Improve Java Plasma client off-heap memory usage Key: ARROW-8168 URL: https://issues.apache.org/jira/browse/ARROW-8168 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: KunshangJi Fix For: 0.17.0 Currently, Plasma Java client API use byte[], which need copy memory from Java on-heap to off-heap(mmap file). we can improve create() and get() method and return a ByteBuffer or DirectByteBuffer to avoid unnecessary memory copy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8152) [C++] IO: split large coalesced reads into smaller ones
[ https://issues.apache.org/jira/browse/ARROW-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062941#comment-17062941 ] David Li commented on ARROW-8152: - Yes, having an options struct for those parameters (and potentially others, e.g. if we want an AsyncContext) makes sense to me. > [C++] IO: split large coalesced reads into smaller ones > --- > > Key: ARROW-8152 > URL: https://issues.apache.org/jira/browse/ARROW-8152 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Priority: Major > > We have a facility to coalesce small reads, but remote filesystems may also > benefit from splitting large reads to take advantage of concurrency. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8138) parquet::arrow::FileReader cannot read multiple RowGroup
[ https://issues.apache.org/jira/browse/ARROW-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062935#comment-17062935 ] Feng Tian commented on ARROW-8138: -- I attached a quick repro – bug.parquet is a data file with multiple row groups, each row is a int, float pair. bug.cpp should repro. As a side notes – I generally follow the cpp examples, but seems none of the parquet examples cover the case of multiple rowgroups. > parquet::arrow::FileReader cannot read multiple RowGroup > > > Key: ARROW-8138 > URL: https://issues.apache.org/jira/browse/ARROW-8138 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.16.0 > Environment: Centos 7 >Reporter: Feng Tian >Priority: Major > Attachments: bug.cpp, bug.parquet > > > When use parquet::arrow::FileReader to read parquet file consisting multiple > row groups, > {code:c++} > reader->RowGroup(i)->Column(c)->Read > {code} > It will repeated read data of the first rowgroup. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8138) parquet::arrow::FileReader cannot read multiple RowGroup
[ https://issues.apache.org/jira/browse/ARROW-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Tian updated ARROW-8138: - Attachment: bug.cpp > parquet::arrow::FileReader cannot read multiple RowGroup > > > Key: ARROW-8138 > URL: https://issues.apache.org/jira/browse/ARROW-8138 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.16.0 > Environment: Centos 7 >Reporter: Feng Tian >Priority: Major > Attachments: bug.cpp, bug.parquet > > > When use parquet::arrow::FileReader to read parquet file consisting multiple > row groups, > {code:c++} > reader->RowGroup(i)->Column(c)->Read > {code} > It will repeated read data of the first rowgroup. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8138) parquet::arrow::FileReader cannot read multiple RowGroup
[ https://issues.apache.org/jira/browse/ARROW-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Tian updated ARROW-8138: - Attachment: bug.parquet > parquet::arrow::FileReader cannot read multiple RowGroup > > > Key: ARROW-8138 > URL: https://issues.apache.org/jira/browse/ARROW-8138 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.16.0 > Environment: Centos 7 >Reporter: Feng Tian >Priority: Major > Attachments: bug.cpp, bug.parquet > > > When use parquet::arrow::FileReader to read parquet file consisting multiple > row groups, > {code:c++} > reader->RowGroup(i)->Column(c)->Read > {code} > It will repeated read data of the first rowgroup. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8167) [CI] Add support for skipping builds with skip pattern in pull request title
[ https://issues.apache.org/jira/browse/ARROW-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-8167: --- Summary: [CI] Add support for skipping builds with skip pattern in pull request title (was: [CI] Add support for skipping builds via commit messages) > [CI] Add support for skipping builds with skip pattern in pull request title > > > Key: ARROW-8167 > URL: https://issues.apache.org/jira/browse/ARROW-8167 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Github actions doesn't support to skip builds marked as [skip ci] by default. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8118) [R] dim method for FileSystemDataset
[ https://issues.apache.org/jira/browse/ARROW-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman reassigned ARROW-8118: --- Assignee: Sam Albers > [R] dim method for FileSystemDataset > > > Key: ARROW-8118 > URL: https://issues.apache.org/jira/browse/ARROW-8118 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Sam Albers >Assignee: Sam Albers >Priority: Minor > Labels: features, pull-request-available > Time Spent: 6h 40m > Remaining Estimate: 0h > > I been using this function enough that I wonder if a) would be useful in the > package and b) whether this is something you think is worth working on. The > basic problem is that if you have a hierarchical file structure that > accommodates using open_dataset, it is definitely useful to know the amount > of data you are dealing with. My idea is that 'FileSystemDataset' would have > dim, nrow and ncol methods. Here is how I've been using it: > {code:java} > library(arrow) > x <- open_dataset("data/rivers-data/", partitioning = c("prov", "month")) > dim_arrow <- function(x) { > rows <- sum(purrr::map_dbl(x$files, > ~ParquetFileReader$create(.x)$ReadTable()$num_rows)) > cols <- x$schema$num_fields > > c(rows, cols) > } > dim_arrow(x) > #> [1] 426929 7 > {code} > > Ideally this would work on arrow_dplyr_query objects as well but I haven't > quite figured out how that filters based on the partitioning variables. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8118) [R] dim method for FileSystemDataset
[ https://issues.apache.org/jira/browse/ARROW-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman resolved ARROW-8118. - Fix Version/s: 0.17.0 Resolution: Fixed Issue resolved by pull request 6635 [https://github.com/apache/arrow/pull/6635] > [R] dim method for FileSystemDataset > > > Key: ARROW-8118 > URL: https://issues.apache.org/jira/browse/ARROW-8118 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Sam Albers >Assignee: Sam Albers >Priority: Minor > Labels: features, pull-request-available > Fix For: 0.17.0 > > Time Spent: 6h 40m > Remaining Estimate: 0h > > I been using this function enough that I wonder if a) would be useful in the > package and b) whether this is something you think is worth working on. The > basic problem is that if you have a hierarchical file structure that > accommodates using open_dataset, it is definitely useful to know the amount > of data you are dealing with. My idea is that 'FileSystemDataset' would have > dim, nrow and ncol methods. Here is how I've been using it: > {code:java} > library(arrow) > x <- open_dataset("data/rivers-data/", partitioning = c("prov", "month")) > dim_arrow <- function(x) { > rows <- sum(purrr::map_dbl(x$files, > ~ParquetFileReader$create(.x)$ReadTable()$num_rows)) > cols <- x$schema$num_fields > > c(rows, cols) > } > dim_arrow(x) > #> [1] 426929 7 > {code} > > Ideally this would work on arrow_dplyr_query objects as well but I haven't > quite figured out how that filters based on the partitioning variables. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8167) [CI] Add support for skipping builds via commit messages
[ https://issues.apache.org/jira/browse/ARROW-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8167: -- Labels: pull-request-available (was: ) > [CI] Add support for skipping builds via commit messages > > > Key: ARROW-8167 > URL: https://issues.apache.org/jira/browse/ARROW-8167 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > > Github actions doesn't support to skip builds marked as [skip ci] by default. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8167) [CI] Add support for skipping builds via commit messages
Krisztian Szucs created ARROW-8167: -- Summary: [CI] Add support for skipping builds via commit messages Key: ARROW-8167 URL: https://issues.apache.org/jira/browse/ARROW-8167 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Krisztian Szucs Assignee: Krisztian Szucs Github actions doesn't support to skip builds marked as [skip ci] by default. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7854) [C++][Dataset] Option to memory map when reading IPC format
[ https://issues.apache.org/jira/browse/ARROW-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062844#comment-17062844 ] Wes McKinney commented on ARROW-7854: - Well, it seems like this detail should perhaps not be so visible to users. If an interface prefers memory mapping if it's available, then it can do so without leaking this configuration detail into some other part of the system > [C++][Dataset] Option to memory map when reading IPC format > --- > > Key: ARROW-7854 > URL: https://issues.apache.org/jira/browse/ARROW-7854 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset >Reporter: Joris Van den Bossche >Assignee: Francois Saint-Jacques >Priority: Major > > For the IPC format it would be interesting to be able to memory map the IPC > files? > cc [~fsaintjacques] [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04
Wes McKinney created ARROW-8166: --- Summary: [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04 Key: ARROW-8166 URL: https://issues.apache.org/jira/browse/ARROW-8166 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 0.17.0 cc [~frank.du] I have an i9-9960X AVX512-capable process but I see {code} /usr/bin/ccache /usr/bin/clang++-8 -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated -isystem ../thirdparty/flatbuffers/include -isystem /home/wesm/cpp-toolchain/include -isystem jemalloc_ep-prefix/src -isystem ../thirdparty/hadoop/include -Qunused-arguments -fcolor-diagnostics -fuse-ld=gold -ggdb -O0 -Wall -Wextra -Wdocumentation -Wno-missing-braces -Wno-unused-parameter -Wno-unknown-warning-option -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option -march=skylake-avx512 -maltivec -fno-omit-frame-pointer -g -fPIE -pthread -std=gnu++11 -MD -MT src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c ../src/arrow/util/rle_encoding_test.cc In file included from ../src/arrow/util/rle_encoding_test.cc:33: In file included from ../src/arrow/util/bit_stream_utils.h:28: ../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier '__m512i_u' *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); ^ ../src/arrow/util/bpacking.h:49:15: error: expected expression *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); ^ ../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier '__m512i_u' *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); ^ ../src/arrow/util/bpacking.h:55:15: error: expected expression *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); ^ 4 errors generated. {code} I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the clang-* versions are using the gcc-7 toolchain headers. Evidently we will need to make this more robust -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04
[ https://issues.apache.org/jira/browse/ARROW-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8166: Description: cc [~frank.du] I have an i9-9960X AVX512-capable CPU but I see {code} /usr/bin/ccache /usr/bin/clang++-8 -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated -isystem ../thirdparty/flatbuffers/include -isystem /home/wesm/cpp-toolchain/include -isystem jemalloc_ep-prefix/src -isystem ../thirdparty/hadoop/include -Qunused-arguments -fcolor-diagnostics -fuse-ld=gold -ggdb -O0 -Wall -Wextra -Wdocumentation -Wno-missing-braces -Wno-unused-parameter -Wno-unknown-warning-option -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option -march=skylake-avx512 -maltivec -fno-omit-frame-pointer -g -fPIE -pthread -std=gnu++11 -MD -MT src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c ../src/arrow/util/rle_encoding_test.cc In file included from ../src/arrow/util/rle_encoding_test.cc:33: In file included from ../src/arrow/util/bit_stream_utils.h:28: ../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier '__m512i_u' *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); ^ ../src/arrow/util/bpacking.h:49:15: error: expected expression *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); ^ ../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier '__m512i_u' *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); ^ ../src/arrow/util/bpacking.h:55:15: error: expected expression *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); ^ 4 errors generated. {code} I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the clang-* versions are using the gcc-7 toolchain headers. Evidently we will need to make this more robust was: cc [~frank.du] I have an i9-9960X AVX512-capable process but I see {code} /usr/bin/ccache /usr/bin/clang++-8 -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated -isystem ../thirdparty/flatbuffers/include -isystem /home/wesm/cpp-toolchain/include -isystem jemalloc_ep-prefix/src -isystem ../thirdparty/hadoop/include -Qunused-arguments -fcolor-diagnostics -fuse-ld=gold -ggdb -O0 -Wall -Wextra -Wdocumentation -Wno-missing-braces -Wno-unused-parameter -Wno-unknown-warning-option -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option -march=skylake-avx512 -maltivec -fno-omit-frame-pointer -g -fPIE -pthread -std=gnu++11 -MD -MT src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c ../src/arrow/util/rle_encoding_test.cc In file included from ../src/arrow/util/rle_encoding_test.cc:33: In file included from ../src/arrow/util/bit_stream_utils.h:28: ../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier '__m512i_u' *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); ^ ../src/arrow/util/bpacking.h:49:15: error: expected expression *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); ^ ../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier '__m512i_u' *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); ^ ../src/arrow/util/bpacking.h:55:15: error: expected expression *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks); ^ 4 errors generated. {code} I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the clang-* versions are using the gcc-7 toolchain headers. Evidently we will need to make this more robust > [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04 > > > Key: ARROW-8166 > URL: https://issues.apache.org/jira/browse/ARROW-8166 >
[jira] [Updated] (ARROW-8061) [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support row groups)
[ https://issues.apache.org/jira/browse/ARROW-8061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8061: -- Labels: pull-request-available (was: ) > [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support > row groups) > - > > Key: ARROW-8061 > URL: https://issues.apache.org/jira/browse/ARROW-8061 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset >Reporter: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > > Specifically for parquet (not sure if it will be relevant for other file > formats as well, for IPC/feather potentially ther record batch), it would be > useful to target row groups instead of files as fragments. > Quoting the original design documents: _"In datasets consisting of many > fragments, the dataset API must expose the granularity of fragments in a > public way to enable parallel processing, if desired. "._ > And a comment from Wes on that: _"a single Parquet file can "export" one or > more fragments based on settings. The default might be to split fragments > based on row group"_ > Currently, the level on which fragments are defined (at least in the typical > partitioned parquet dataset) is "1 file == 1 fragment". > Would it be possible or desirable to make this more fine grained, where you > could also opt to have a fragment per row group? > We could have a ParquetFragment that has this option, and a ParquetFileFormat > specific option to say what the granularity of a fragment is (file vs row > group)? > cc [~fsaintjacques] [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8165) [Packaging] Make nightly wheels available on a PyPI server
[ https://issues.apache.org/jira/browse/ARROW-8165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-8165: --- Summary: [Packaging] Make nightly wheels available on a PyPI server (was: [Packaging] Make nightly wheels available) > [Packaging] Make nightly wheels available on a PyPI server > -- > > Key: ARROW-8165 > URL: https://issues.apache.org/jira/browse/ARROW-8165 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8165) [Packaging] Make nightly wheels available on a PyPI server
[ https://issues.apache.org/jira/browse/ARROW-8165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8165: -- Labels: pull-request-available (was: ) > [Packaging] Make nightly wheels available on a PyPI server > -- > > Key: ARROW-8165 > URL: https://issues.apache.org/jira/browse/ARROW-8165 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8138) parquet::arrow::FileReader cannot read multiple RowGroup
[ https://issues.apache.org/jira/browse/ARROW-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062835#comment-17062835 ] Francois Saint-Jacques commented on ARROW-8138: --- Can you provide more information on the calling context? If this is true, we have a serious problem and this should be a blocker for 0.17.0. > parquet::arrow::FileReader cannot read multiple RowGroup > > > Key: ARROW-8138 > URL: https://issues.apache.org/jira/browse/ARROW-8138 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.16.0 > Environment: Centos 7 >Reporter: Feng Tian >Priority: Major > > When use parquet::arrow::FileReader to read parquet file consisting multiple > row groups, > {code:c++} > reader->RowGroup(i)->Column(c)->Read > {code} > It will repeated read data of the first rowgroup. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8165) [Packaging] Make nightly wheels available
Krisztian Szucs created ARROW-8165: -- Summary: [Packaging] Make nightly wheels available Key: ARROW-8165 URL: https://issues.apache.org/jira/browse/ARROW-8165 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Krisztian Szucs Assignee: Krisztian Szucs -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8142) [C++] Casting a chunked array with 0 chunks critical failure
[ https://issues.apache.org/jira/browse/ARROW-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8142: -- Labels: pull-request-available (was: ) > [C++] Casting a chunked array with 0 chunks critical failure > > > Key: ARROW-8142 > URL: https://issues.apache.org/jira/browse/ARROW-8142 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Florian Jetter >Assignee: Ben Kietzman >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > > When casting a schema of an empty table from dict encoded to non-dict encoded > type a critical error is raised and not handled causing the interpreter to > shut down. > This only happens after a parquet roundtrip > > {code:python} > import pyarrow as pa > import pandas as pd > import pyarrow.parquet as pq > df = pd.DataFrame({"col": ["a"]}).astype({"col": "category"}).iloc[:0] > table = pa.Table.from_pandas(df) > field = table.schema[0] > new_field = pa.field(field.name, field.type.value_type, field.nullable, > field.metadata) > buf = pa.BufferOutputStream() > pq.write_table(table, buf) > reader = pa.BufferReader(buf.getvalue().to_pybytes()) > table = pq.read_table(reader) > schema = table.schema.remove(0).insert(0, new_field) > new_table = table.cast(schema) > assert new_table.schema == schema > {code} > > Output > {code:java} > WARNING: Logging before InitGoogleLogging() is written to STDERR > F0318 09:55:14.266649 299722176 table.cc:47] Check failed: (chunks.size()) > > (0) cannot construct ChunkedArray from empty vector and omitted type {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-7480) [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns don't match the selected columns
[ https://issues.apache.org/jira/browse/ARROW-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove closed ARROW-7480. - Resolution: Fixed Fixed by https://github.com/apache/arrow/pull/6625 > [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns > don't match the selected columns > > > Key: ARROW-7480 > URL: https://issues.apache.org/jira/browse/ARROW-7480 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Kyle McCarthy >Assignee: Andy Grove >Priority: Major > Fix For: 1.0.0 > > > There are two scenarios that cause problems but are related to the queries > with aggregate expressions and the SQL planner. The aggregate_test_100 > dataset is used for both of the queries. > At a high level, the issue is basically that queries containing aggregate > expressions may generate the wrong schema. > > *Scenario 1* > Columns are grouped by but not selected. > Query: > {code:java} > SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code} > Error: > {noformat} > ArrowError(InvalidArgumentError("number of columns must match number of > fields in schema")){noformat} > While the error is an ArrowError, it actually looks like it is because the > wrong schema is generated. In the src/sql/planner.rs file the impl for > SqlToRel is defined. In the sql_to_rel method, it checks if the query > contains aggregate expressions, and if it does it generates the schema from > the columns included in group expressions and aggregate expressions. > This in turn generates the following schema: > {code:java} > Schema { > fields: [ > Field { > name: "c1", > data_type: Utf8, > nullable: false, > }, > Field { > name: "c13", > data_type: Utf8, > nullable: false, > }, > Field { > name: "MIN", > data_type: Float64, > nullable: true, > }, > ], > metadata: {}, > }{code} > I am not super familiar with how DataFusion works under the hood, but I would > assume that this schema is actually correct for the Aggregate logical plan, > but isn't projecting the data correctly thus resulting in the wrong query > result schema? > > *Senario 2* > Columns are selected, but not grouped or part of an aggregate function. This > query actually will run, but the wrong schema is produced. > Query: > {code:java} > SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code} > Schema generated: > {code:java} > Schema { > fields: [ > Field { > name: "c0", > data_type: Utf8, > nullable: true, > }, > Field { > name: "c1", > data_type: Float64, > nullable: true, > }, > Field { > name: "c1", > data_type: Float64, > nullable: true, > }, > ], > metadata: {}, > } {code} > This should actually be Field(c1, Utf8), Field(c13, Utf8), Field(MAX, > Float64). > > > Schema 2 is questionable since some DBMS will run the query (ex MySQL) but > others (Postgres) will require that all the columns must be in the GROUP BY > to be used in an aggregate function. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8142) [C++] Casting a chunked array with 0 chunks critical failure
[ https://issues.apache.org/jira/browse/ARROW-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman reassigned ARROW-8142: --- Assignee: Ben Kietzman > [C++] Casting a chunked array with 0 chunks critical failure > > > Key: ARROW-8142 > URL: https://issues.apache.org/jira/browse/ARROW-8142 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Florian Jetter >Assignee: Ben Kietzman >Priority: Major > Fix For: 0.17.0 > > > When casting a schema of an empty table from dict encoded to non-dict encoded > type a critical error is raised and not handled causing the interpreter to > shut down. > This only happens after a parquet roundtrip > > {code:python} > import pyarrow as pa > import pandas as pd > import pyarrow.parquet as pq > df = pd.DataFrame({"col": ["a"]}).astype({"col": "category"}).iloc[:0] > table = pa.Table.from_pandas(df) > field = table.schema[0] > new_field = pa.field(field.name, field.type.value_type, field.nullable, > field.metadata) > buf = pa.BufferOutputStream() > pq.write_table(table, buf) > reader = pa.BufferReader(buf.getvalue().to_pybytes()) > table = pq.read_table(reader) > schema = table.schema.remove(0).insert(0, new_field) > new_table = table.cast(schema) > assert new_table.schema == schema > {code} > > Output > {code:java} > WARNING: Logging before InitGoogleLogging() is written to STDERR > F0318 09:55:14.266649 299722176 table.cc:47] Check failed: (chunks.size()) > > (0) cannot construct ChunkedArray from empty vector and omitted type {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8123) [Rust] [DataFusion] Create LogicalPlanBuilder
[ https://issues.apache.org/jira/browse/ARROW-8123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-8123. --- Fix Version/s: (was: 1.0.0) 0.17.0 Resolution: Fixed Issue resolved by pull request 6625 [https://github.com/apache/arrow/pull/6625] > [Rust] [DataFusion] Create LogicalPlanBuilder > - > > Key: ARROW-8123 > URL: https://issues.apache.org/jira/browse/ARROW-8123 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Building logical plans is arduous and a builder would make this nicer. > Example: > {code:java} > let plan = LogicalPlanBuilder::new() > .scan( > "default", > "employee.csv", > _schema(), > Some(vec![0, 3]), > )? > .filter(col(1).eq(_str("CO")))? > .project(vec![col(0)])? > .build()?; {code} > Note that I am already working on this and will have a PR shortly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8159) [Python] pyarrow.Schema.from_pandas doesn't support ExtensionDtype
[ https://issues.apache.org/jira/browse/ARROW-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche resolved ARROW-8159. -- Resolution: Fixed Issue resolved by pull request 6665 [https://github.com/apache/arrow/pull/6665] > [Python] pyarrow.Schema.from_pandas doesn't support ExtensionDtype > -- > > Key: ARROW-8159 > URL: https://issues.apache.org/jira/browse/ARROW-8159 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.16.0 >Reporter: Uwe Korn >Assignee: Uwe Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7824) [C++][Dataset] Provide Dataset writing to IPC format
[ https://issues.apache.org/jira/browse/ARROW-7824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques resolved ARROW-7824. --- Resolution: Fixed Issue resolved by pull request 6449 [https://github.com/apache/arrow/pull/6449] > [C++][Dataset] Provide Dataset writing to IPC format > > > Key: ARROW-7824 > URL: https://issues.apache.org/jira/browse/ARROW-7824 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Dataset >Affects Versions: 0.16.0 >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 3h > Remaining Estimate: 0h > > Begin with writing to IPC format since it is simpler than parquet and to > efficiently support the "locally cached extract" workflow. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8158) [Java] Getting length of data buffer and base variable width vector
[ https://issues.apache.org/jira/browse/ARROW-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062725#comment-17062725 ] Micah Kornfield commented on ARROW-8158: [~tianchen92] The issue is there isn't a clear way to get the length of an individual VarChar or Bytes element (one needs to go through the holder or access the offsets buffer directly). A similar issue exists for lists. > [Java] Getting length of data buffer and base variable width vector > --- > > Key: ARROW-8158 > URL: https://issues.apache.org/jira/browse/ARROW-8158 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Gaurangi Saxena >Assignee: Ji Liu >Priority: Minor > > For string data buffer and base variable width vector can we have a way to > get length of the data? > For instance, in ArrowColumnVector in StringAccessor we use > stringResult.start and stringResult.end, instead we would like to get length > of the data through an exposed function. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8142) [C++] Casting a chunked array with 0 chunks critical failure
[ https://issues.apache.org/jira/browse/ARROW-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-8142: - Summary: [C++] Casting a chunked array with 0 chunks critical failure (was: [Python/C++] Casting empty table from after parquet roundtrip causes critical failure) > [C++] Casting a chunked array with 0 chunks critical failure > > > Key: ARROW-8142 > URL: https://issues.apache.org/jira/browse/ARROW-8142 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Florian Jetter >Priority: Major > Fix For: 0.17.0 > > > When casting a schema of an empty table from dict encoded to non-dict encoded > type a critical error is raised and not handled causing the interpreter to > shut down. > This only happens after a parquet roundtrip > > {code:python} > import pyarrow as pa > import pandas as pd > import pyarrow.parquet as pq > df = pd.DataFrame({"col": ["a"]}).astype({"col": "category"}).iloc[:0] > table = pa.Table.from_pandas(df) > field = table.schema[0] > new_field = pa.field(field.name, field.type.value_type, field.nullable, > field.metadata) > buf = pa.BufferOutputStream() > pq.write_table(table, buf) > reader = pa.BufferReader(buf.getvalue().to_pybytes()) > table = pq.read_table(reader) > schema = table.schema.remove(0).insert(0, new_field) > new_table = table.cast(schema) > assert new_table.schema == schema > {code} > > Output > {code:java} > WARNING: Logging before InitGoogleLogging() is written to STDERR > F0318 09:55:14.266649 299722176 table.cc:47] Check failed: (chunks.size()) > > (0) cannot construct ChunkedArray from empty vector and omitted type {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8142) [Python/C++] Casting empty table from after parquet roundtrip causes critical failure
[ https://issues.apache.org/jira/browse/ARROW-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062709#comment-17062709 ] Joris Van den Bossche commented on ARROW-8142: -- It's also not specific to dictionary, it fails for eg int8 -> int16 cast as well. > [Python/C++] Casting empty table from after parquet roundtrip causes critical > failure > - > > Key: ARROW-8142 > URL: https://issues.apache.org/jira/browse/ARROW-8142 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Florian Jetter >Priority: Major > Fix For: 0.17.0 > > > When casting a schema of an empty table from dict encoded to non-dict encoded > type a critical error is raised and not handled causing the interpreter to > shut down. > This only happens after a parquet roundtrip > > {code:python} > import pyarrow as pa > import pandas as pd > import pyarrow.parquet as pq > df = pd.DataFrame({"col": ["a"]}).astype({"col": "category"}).iloc[:0] > table = pa.Table.from_pandas(df) > field = table.schema[0] > new_field = pa.field(field.name, field.type.value_type, field.nullable, > field.metadata) > buf = pa.BufferOutputStream() > pq.write_table(table, buf) > reader = pa.BufferReader(buf.getvalue().to_pybytes()) > table = pq.read_table(reader) > schema = table.schema.remove(0).insert(0, new_field) > new_table = table.cast(schema) > assert new_table.schema == schema > {code} > > Output > {code:java} > WARNING: Logging before InitGoogleLogging() is written to STDERR > F0318 09:55:14.266649 299722176 table.cc:47] Check failed: (chunks.size()) > > (0) cannot construct ChunkedArray from empty vector and omitted type {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8164) [C++][Dataset] Let datasets be viewable with non-identical schema
Ben Kietzman created ARROW-8164: --- Summary: [C++][Dataset] Let datasets be viewable with non-identical schema Key: ARROW-8164 URL: https://issues.apache.org/jira/browse/ARROW-8164 Project: Apache Arrow Issue Type: Improvement Components: C++, C++ - Dataset Affects Versions: 0.16.0 Reporter: Ben Kietzman Assignee: Ben Kietzman Fix For: 1.0.0 It would be useful to allow some schema unification capability after discovery has completed. For example, if a FileSystemDataset is being wrapped into a UnionDataset with another and their schemas are unifiable then there is no reason we can't create the UnionDataset (rather than emitting an error because the schemas are not identical). I think this behavior will be most naturally expressed in C++ like so: {code} virtual Result Dataset::ReplaceSchema(std::shared_ptr schema) const = 0; {code} which will raise an error if the provided schema is not unifiable with the current dataset schema. If this needs to be extended to non trivial projections then this will probably warrant a separate class, {{ProjectedDataset}} or so. Definitely follow up material (if desired) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8163) [C++][Dataset] Allow FileSystemDataset's file list to be lazy
Ben Kietzman created ARROW-8163: --- Summary: [C++][Dataset] Allow FileSystemDataset's file list to be lazy Key: ARROW-8163 URL: https://issues.apache.org/jira/browse/ARROW-8163 Project: Apache Arrow Issue Type: Improvement Components: C++, C++ - Dataset Affects Versions: 0.16.0 Reporter: Ben Kietzman Assignee: Ben Kietzman Fix For: 1.0.0 A FileSystemDataset currently requires a full listing of files it contains on construction, so a scan cannot start until all files in the dataset are discovered. Instead it would be ideal if a large dataset could be constructed with a lazy file listing so that scans can start immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8162) [Format][Python] Add serialization for CSF sparse tensors
[ https://issues.apache.org/jira/browse/ARROW-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8162: -- Labels: pull-request-available (was: ) > [Format][Python] Add serialization for CSF sparse tensors > - > > Key: ARROW-8162 > URL: https://issues.apache.org/jira/browse/ARROW-8162 > Project: Apache Arrow > Issue Type: Improvement > Components: Format, Python >Reporter: Rok Mihevc >Assignee: Rok Mihevc >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > > Once [ARROW-7428|https://issues.apache.org/jira/browse/ARROW-7428] is > complete serialization for CSF sparse tensors should be enabled in Python too. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8162) [Format][Python] Add serialization for CSF sparse tensors
[ https://issues.apache.org/jira/browse/ARROW-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc updated ARROW-8162: -- Description: Once [ARROW-7428|https://issues.apache.org/jira/browse/ARROW-7428] is complete serialization for CSF sparse tensors should be enabled in Python too. (was: Once [#ARROW-7428] is complete serialization for CSF sparse tensors should be enabled in Python too.) > [Format][Python] Add serialization for CSF sparse tensors > - > > Key: ARROW-8162 > URL: https://issues.apache.org/jira/browse/ARROW-8162 > Project: Apache Arrow > Issue Type: Improvement > Components: Format, Python >Reporter: Rok Mihevc >Assignee: Rok Mihevc >Priority: Minor > Fix For: 1.0.0 > > > Once [ARROW-7428|https://issues.apache.org/jira/browse/ARROW-7428] is > complete serialization for CSF sparse tensors should be enabled in Python too. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8162) [Format][Python] Add serialization for CSF sparse tensors
Rok Mihevc created ARROW-8162: - Summary: [Format][Python] Add serialization for CSF sparse tensors Key: ARROW-8162 URL: https://issues.apache.org/jira/browse/ARROW-8162 Project: Apache Arrow Issue Type: Improvement Components: Format, Python Reporter: Rok Mihevc Assignee: Rok Mihevc Fix For: 1.0.0 Once [#ARROW-7428] is complete serialization for CSF sparse tensors should be enabled in Python too. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8161) [C++][Gandiva] Consolidate the data generation code for benchmark tests in gandiva into arrow/testing
Projjal Chanda created ARROW-8161: - Summary: [C++][Gandiva] Consolidate the data generation code for benchmark tests in gandiva into arrow/testing Key: ARROW-8161 URL: https://issues.apache.org/jira/browse/ARROW-8161 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Projjal Chanda Assignee: Projjal Chanda -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7966) [Integration][Flight][C++] Client should verify each batch independently
[ https://issues.apache.org/jira/browse/ARROW-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-7966. --- Fix Version/s: 0.17.0 Resolution: Fixed Issue resolved by pull request 6662 [https://github.com/apache/arrow/pull/6662] > [Integration][Flight][C++] Client should verify each batch independently > > > Key: ARROW-7966 > URL: https://issues.apache.org/jira/browse/ARROW-7966 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, FlightRPC, Integration >Reporter: Bryan Cutler >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Currently the C++ Flight test client in {{test_integration_client.cc}} reads > all batches from JSON into a Table, reads all batches in the flight stream > from the server into a Table, then compares the Tables for equality. This is > potentially a problem because a record batch might have specific information > that is then lost in the conversion to a Table. For example, if the server > sends empty batches, the resulting Table would not be different from one with > no empty batches. > Instead, the client should check each record batch from the JSON file against > each record batch from the server independently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8160) [FlightRPC][C++] DoPutPayloadWriter doesn't always expose server error message
David Li created ARROW-8160: --- Summary: [FlightRPC][C++] DoPutPayloadWriter doesn't always expose server error message Key: ARROW-8160 URL: https://issues.apache.org/jira/browse/ARROW-8160 Project: Apache Arrow Issue Type: Bug Components: C++, FlightRPC Affects Versions: 0.16.0 Reporter: David Li {noformat} C:/projects/arrow/cpp/src/arrow/flight/flight_test.cc(1261): error: Value of: status.message() Expected: has substring "Invalid token" Actual: "Could not write record batch to stream: " [ FAILED ] TestBasicAuthHandler.FailUnauthenticatedCalls (17 ms) {noformat} This happens because {{Close()}} calls {{RecordBatchPayloadWriter::Close()}}, which calls {{CheckStarted}}, which in turn tries to write data. If the data gets flushed and the server responds in time, we'll see a failure during writing, causing us to never check the server status (which is the last part of {{DoPutPayloadWriter::Close}}). We need to reliably check and expose the gRPC status. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7927) [C++] Fix 'cpu_info.cc' compilation warning
[ https://issues.apache.org/jira/browse/ARROW-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman resolved ARROW-7927. - Resolution: Fixed Issue resolved by pull request 6610 [https://github.com/apache/arrow/pull/6610] > [C++] Fix 'cpu_info.cc' compilation warning > --- > > Key: ARROW-7927 > URL: https://issues.apache.org/jira/browse/ARROW-7927 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yuqi Gu >Assignee: Yuqi Gu >Priority: Minor > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Cpu_info compilation warning: > {code:java} > [100/424] Building CXX object > src/arrow/CMakeFiles/arrow_objlib.dir/util/cpu_info.cc.o > ../src/arrow/util/cpu_info.cc:79:16: warning: ‘int64_t > GetArm64CacheSize(const char*, int64_t)’ defined but not used > [-Wunused-function] > static int64_t GetArm64CacheSize(const char* filename, int64_t default_size > = -1) { > ^ > ../src/arrow/util/cpu_info.cc:77:20: warning: ‘kL3CacheSizeFile’ defined but > not used [-Wunused-variable] > static const char* kL3CacheSizeFile = > "/sys/devices/system/cpu/cpu0/cache/index3/size"; > ^~~~ > ../src/arrow/util/cpu_info.cc:76:20: warning: ‘kL2CacheSizeFile’ defined but > not used [-Wunused-variable] > static const char* kL2CacheSizeFile = > "/sys/devices/system/cpu/cpu0/cache/index2/size"; > ^~~~ > ../src/arrow/util/cpu_info.cc:75:20: warning: ‘kL1CacheSizeFile’ defined but > not used [-Wunused-variable] > static const char* kL1CacheSizeFile = > "/sys/devices/system/cpu/cpu0/cache/index0/size"; > ^~~~ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7854) [C++][Dataset] Option to memory map when reading IPC format
[ https://issues.apache.org/jira/browse/ARROW-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062505#comment-17062505 ] Francois Saint-Jacques commented on ARROW-7854: --- Which granularity would you like to see? A user can still create another LocalFilesystem without mmap. > [C++][Dataset] Option to memory map when reading IPC format > --- > > Key: ARROW-7854 > URL: https://issues.apache.org/jira/browse/ARROW-7854 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset >Reporter: Joris Van den Bossche >Assignee: Francois Saint-Jacques >Priority: Major > > For the IPC format it would be interesting to be able to memory map the IPC > files? > cc [~fsaintjacques] [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8146) [C++] Add per-filesystem facility to sanitize a path
[ https://issues.apache.org/jira/browse/ARROW-8146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-8146. --- Fix Version/s: 0.17.0 Resolution: Fixed Issue resolved by pull request 6657 [https://github.com/apache/arrow/pull/6657] > [C++] Add per-filesystem facility to sanitize a path > > > Key: ARROW-8146 > URL: https://issues.apache.org/jira/browse/ARROW-8146 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8158) [Java] Getting length of data buffer and base variable width vector
[ https://issues.apache.org/jira/browse/ARROW-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ji Liu reassigned ARROW-8158: - Assignee: Ji Liu > [Java] Getting length of data buffer and base variable width vector > --- > > Key: ARROW-8158 > URL: https://issues.apache.org/jira/browse/ARROW-8158 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Gaurangi Saxena >Assignee: Ji Liu >Priority: Minor > > For string data buffer and base variable width vector can we have a way to > get length of the data? > For instance, in ArrowColumnVector in StringAccessor we use > stringResult.start and stringResult.end, instead we would like to get length > of the data through an exposed function. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8158) [Java] Getting length of data buffer and base variable width vector
[ https://issues.apache.org/jira/browse/ARROW-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062479#comment-17062479 ] Ji Liu commented on ARROW-8158: --- Hi, I think one could get valid data length by BaseVariableWidthVector#sizeOfValueBuffer. [https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java#L582] > [Java] Getting length of data buffer and base variable width vector > --- > > Key: ARROW-8158 > URL: https://issues.apache.org/jira/browse/ARROW-8158 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Gaurangi Saxena >Priority: Minor > > For string data buffer and base variable width vector can we have a way to > get length of the data? > For instance, in ArrowColumnVector in StringAccessor we use > stringResult.start and stringResult.end, instead we would like to get length > of the data through an exposed function. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7365) [Python] Support FixedSizeList type in conversion to numpy/pandas
[ https://issues.apache.org/jira/browse/ARROW-7365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-7365. --- Resolution: Fixed Issue resolved by pull request 6663 [https://github.com/apache/arrow/pull/6663] > [Python] Support FixedSizeList type in conversion to numpy/pandas > - > > Key: ARROW-7365 > URL: https://issues.apache.org/jira/browse/ARROW-7365 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Follow-up on ARROW-7261, still need to add support for FixedSizeListType in > the arrow -> python conversion (arrow_to_pandas.cc) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8159) [Python] pyarrow.Schema.from_pandas doesn't support ExtensionDtype
[ https://issues.apache.org/jira/browse/ARROW-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8159: -- Labels: pull-request-available (was: ) > [Python] pyarrow.Schema.from_pandas doesn't support ExtensionDtype > -- > > Key: ARROW-8159 > URL: https://issues.apache.org/jira/browse/ARROW-8159 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.16.0 >Reporter: Uwe Korn >Assignee: Uwe Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8159) [Python] pyarrow.Schema.from_pandas doesn't support ExtensionDtype
Uwe Korn created ARROW-8159: --- Summary: [Python] pyarrow.Schema.from_pandas doesn't support ExtensionDtype Key: ARROW-8159 URL: https://issues.apache.org/jira/browse/ARROW-8159 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 0.17.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8158) [Java] Getting length of data buffer and base variable width vector
[ https://issues.apache.org/jira/browse/ARROW-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-8158: - Summary: [Java] Getting length of data buffer and base variable width vector (was: Getting length of data buffer and base variable width vector) > [Java] Getting length of data buffer and base variable width vector > --- > > Key: ARROW-8158 > URL: https://issues.apache.org/jira/browse/ARROW-8158 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Gaurangi Saxena >Priority: Minor > > For string data buffer and base variable width vector can we have a way to > get length of the data? > For instance, in ArrowColumnVector in StringAccessor we use > stringResult.start and stringResult.end, instead we would like to get length > of the data through an exposed function. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7857) [Python] Failing test with pandas master for extension type conversion
[ https://issues.apache.org/jira/browse/ARROW-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7857: -- Labels: pull-request-available (was: ) > [Python] Failing test with pandas master for extension type conversion > -- > > Key: ARROW-7857 > URL: https://issues.apache.org/jira/browse/ARROW-7857 > Project: Apache Arrow > Issue Type: Test > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > > The pandas master test build has one failure > {code} > ___ test_conversion_extensiontype_to_extensionarray > > monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7fcd6c580bd0> > def test_conversion_extensiontype_to_extensionarray(monkeypatch): > # converting extension type to linked pandas ExtensionDtype/Array > import pandas.core.internals as _int > > storage = pa.array([1, 2, 3, 4], pa.int64()) > arr = pa.ExtensionArray.from_storage(MyCustomIntegerType(), storage) > table = pa.table({'a': arr}) > > if LooseVersion(pd.__version__) < "0.26.0.dev": > # ensure pandas Int64Dtype has the protocol method (for older > pandas) > monkeypatch.setattr( > pd.Int64Dtype, '__from_arrow__', _Int64Dtype__from_arrow__, > raising=False) > > # extension type points to Int64Dtype, which knows how to create a > # pandas ExtensionArray > > result = table.to_pandas() > opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_pandas.py:3560: > > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > pyarrow/ipc.pxi:559: in pyarrow.lib.read_message > ??? > pyarrow/table.pxi:1369: in pyarrow.lib.Table._to_pandas > ??? > opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:764: > in table_to_blockmanager > blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes) > opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:1102: > in _table_to_blocks > for item in result] > opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:1102: > in > for item in result] > opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:723: > in _reconstruct_block > pd_ext_arr = pandas_dtype.__from_arrow__(arr) > opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/arrays/integer.py:108: > in __from_arrow__ > array = array.cast(pyarrow_type) > pyarrow/table.pxi:240: in pyarrow.lib.ChunkedArray.cast > ??? > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > > ??? > E pyarrow.lib.ArrowNotImplementedError: No cast implemented from > extension to int64 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)