[jira] [Updated] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04

2020-03-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8166:
--
Labels: pull-request-available  (was: )

> [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04
> 
>
> Key: ARROW-8166
> URL: https://issues.apache.org/jira/browse/ARROW-8166
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Frank Du
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>
> cc [~frank.du]
> I have an i9-9960X AVX512-capable CPU but I see
> {code}
> /usr/bin/ccache /usr/bin/clang++-8  -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS 
> -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API 
> -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 
> -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB 
> -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated 
> -isystem ../thirdparty/flatbuffers/include -isystem 
> /home/wesm/cpp-toolchain/include -isystem jemalloc_ep-prefix/src -isystem 
> ../thirdparty/hadoop/include -Qunused-arguments -fcolor-diagnostics 
> -fuse-ld=gold -ggdb -O0  -Wall -Wextra -Wdocumentation -Wno-missing-braces 
> -Wno-unused-parameter -Wno-unknown-warning-option 
> -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option 
> -march=skylake-avx512 -maltivec -fno-omit-frame-pointer -g -fPIE   -pthread 
> -std=gnu++11 -MD -MT 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c 
> ../src/arrow/util/rle_encoding_test.cc
> In file included from ../src/arrow/util/rle_encoding_test.cc:33:
> In file included from ../src/arrow/util/bit_stream_utils.h:28:
> ../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier 
> '__m512i_u'
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
> ^
> ../src/arrow/util/bpacking.h:49:15: error: expected expression
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
>   ^
> ../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier 
> '__m512i_u'
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
> ^
> ../src/arrow/util/bpacking.h:55:15: error: expected expression
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
>   ^
> 4 errors generated.
> {code}
> I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems 
> that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the 
> clang-* versions are using the gcc-7 toolchain headers. Evidently we will 
> need to make this more robust



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04

2020-03-19 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063069#comment-17063069
 ] 

Wes McKinney commented on ARROW-8166:
-

OK you have reproduced it

> [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04
> 
>
> Key: ARROW-8166
> URL: https://issues.apache.org/jira/browse/ARROW-8166
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Frank Du
>Priority: Major
> Fix For: 0.17.0
>
>
> cc [~frank.du]
> I have an i9-9960X AVX512-capable CPU but I see
> {code}
> /usr/bin/ccache /usr/bin/clang++-8  -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS 
> -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API 
> -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 
> -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB 
> -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated 
> -isystem ../thirdparty/flatbuffers/include -isystem 
> /home/wesm/cpp-toolchain/include -isystem jemalloc_ep-prefix/src -isystem 
> ../thirdparty/hadoop/include -Qunused-arguments -fcolor-diagnostics 
> -fuse-ld=gold -ggdb -O0  -Wall -Wextra -Wdocumentation -Wno-missing-braces 
> -Wno-unused-parameter -Wno-unknown-warning-option 
> -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option 
> -march=skylake-avx512 -maltivec -fno-omit-frame-pointer -g -fPIE   -pthread 
> -std=gnu++11 -MD -MT 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c 
> ../src/arrow/util/rle_encoding_test.cc
> In file included from ../src/arrow/util/rle_encoding_test.cc:33:
> In file included from ../src/arrow/util/bit_stream_utils.h:28:
> ../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier 
> '__m512i_u'
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
> ^
> ../src/arrow/util/bpacking.h:49:15: error: expected expression
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
>   ^
> ../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier 
> '__m512i_u'
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
> ^
> ../src/arrow/util/bpacking.h:55:15: error: expected expression
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
>   ^
> 4 errors generated.
> {code}
> I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems 
> that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the 
> clang-* versions are using the gcc-7 toolchain headers. Evidently we will 
> need to make this more robust



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04

2020-03-19 Thread Frank Du (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Du reassigned ARROW-8166:
---

Assignee: Frank Du

> [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04
> 
>
> Key: ARROW-8166
> URL: https://issues.apache.org/jira/browse/ARROW-8166
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Frank Du
>Priority: Major
> Fix For: 0.17.0
>
>
> cc [~frank.du]
> I have an i9-9960X AVX512-capable CPU but I see
> {code}
> /usr/bin/ccache /usr/bin/clang++-8  -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS 
> -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API 
> -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 
> -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB 
> -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated 
> -isystem ../thirdparty/flatbuffers/include -isystem 
> /home/wesm/cpp-toolchain/include -isystem jemalloc_ep-prefix/src -isystem 
> ../thirdparty/hadoop/include -Qunused-arguments -fcolor-diagnostics 
> -fuse-ld=gold -ggdb -O0  -Wall -Wextra -Wdocumentation -Wno-missing-braces 
> -Wno-unused-parameter -Wno-unknown-warning-option 
> -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option 
> -march=skylake-avx512 -maltivec -fno-omit-frame-pointer -g -fPIE   -pthread 
> -std=gnu++11 -MD -MT 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c 
> ../src/arrow/util/rle_encoding_test.cc
> In file included from ../src/arrow/util/rle_encoding_test.cc:33:
> In file included from ../src/arrow/util/bit_stream_utils.h:28:
> ../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier 
> '__m512i_u'
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
> ^
> ../src/arrow/util/bpacking.h:49:15: error: expected expression
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
>   ^
> ../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier 
> '__m512i_u'
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
> ^
> ../src/arrow/util/bpacking.h:55:15: error: expected expression
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
>   ^
> 4 errors generated.
> {code}
> I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems 
> that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the 
> clang-* versions are using the gcc-7 toolchain headers. Evidently we will 
> need to make this more robust



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04

2020-03-19 Thread Frank Du (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063049#comment-17063049
 ] 

Frank Du commented on ARROW-8166:
-

Repeat it with passing COMPILER clang.

 -DCMAKE_C_COMPILER=clang-8 \ 

-DCMAKE_CXX_COMPILER=clang++-8 \


/mnt/arrow/cpp/src/arrow/util/bpacking.h:49:5: error: use of undeclared 
identifier '__m512i_u'
 *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);

> [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04
> 
>
> Key: ARROW-8166
> URL: https://issues.apache.org/jira/browse/ARROW-8166
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.17.0
>
>
> cc [~frank.du]
> I have an i9-9960X AVX512-capable CPU but I see
> {code}
> /usr/bin/ccache /usr/bin/clang++-8  -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS 
> -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API 
> -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 
> -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB 
> -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated 
> -isystem ../thirdparty/flatbuffers/include -isystem 
> /home/wesm/cpp-toolchain/include -isystem jemalloc_ep-prefix/src -isystem 
> ../thirdparty/hadoop/include -Qunused-arguments -fcolor-diagnostics 
> -fuse-ld=gold -ggdb -O0  -Wall -Wextra -Wdocumentation -Wno-missing-braces 
> -Wno-unused-parameter -Wno-unknown-warning-option 
> -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option 
> -march=skylake-avx512 -maltivec -fno-omit-frame-pointer -g -fPIE   -pthread 
> -std=gnu++11 -MD -MT 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c 
> ../src/arrow/util/rle_encoding_test.cc
> In file included from ../src/arrow/util/rle_encoding_test.cc:33:
> In file included from ../src/arrow/util/bit_stream_utils.h:28:
> ../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier 
> '__m512i_u'
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
> ^
> ../src/arrow/util/bpacking.h:49:15: error: expected expression
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
>   ^
> ../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier 
> '__m512i_u'
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
> ^
> ../src/arrow/util/bpacking.h:55:15: error: expected expression
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
>   ^
> 4 errors generated.
> {code}
> I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems 
> that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the 
> clang-* versions are using the gcc-7 toolchain headers. Evidently we will 
> need to make this more robust



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04

2020-03-19 Thread Frank Du (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063046#comment-17063046
 ] 

Frank Du commented on ARROW-8166:
-

Seems I'm still using gcc for the build, how can I change to clang? Sorry I'm 
not vary familiar with this part.

cd /mnt/arrow/cpp/build/src/arrow/util && /usr/bin/c++ -DARROW_JEMALLOC 
-DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_USE_SIMD -DARROW_WITH_SNAPPY 
-DARROW_WITH_TIMING_TESTS -DGTEST_LINKED_AS_
SHARED_LIBRARY=1 -DURI_STATIC_BUILD -isystem 
/mnt/arrow/cpp/thirdparty/flatbuffers/include -isystem 
/mnt/arrow/cpp/build/boost_ep-prefix/src/boost_ep -isystem 
/mnt/arrow/cpp/build/snappy
_ep/src/snappy_ep-install/include -isystem 
/mnt/arrow/cpp/build/gflags_ep-prefix/src/gflags_ep/include -isystem 
/mnt/arrow/cpp/build/thrift_ep-install/include -isystem /mnt/arrow/cpp/bui
ld/protobuf_ep-install/include -isystem 
/mnt/arrow/cpp/build/jemalloc_ep-prefix/src -isystem 
/mnt/arrow/cpp/build/googletest_ep-prefix/src/googletest_ep/include -isystem 
/mnt/arrow/cpp/b
uild/gbenchmark_ep/src/gbenchmark_ep-install/include -isystem 
/mnt/arrow/cpp/build/rapidjson_ep/src/rapidjson_ep-install/include -isystem 
/mnt/arrow/cpp/build/re2_ep-install/include -isy
stem /mnt/arrow/cpp/thirdparty/hadoop/include -I/mnt/arrow/cpp/build/src 
-I/mnt/arrow/cpp/src -I/mnt/arrow/cpp/src/generated -Wno-noexcept-type 
-fdiagnostics-color=always -O3 -DNDEBUG
 -Wall -march=skylake-avx512 -O3 -DNDEBUG -fPIE -pthread -std=gnu++11 -o 
CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c 
/mnt/arrow/cpp/src/arrow/util/rle_encoding_test.cc
[ 49%] Linking CXX executable ../../../release/arrow-utility-test

> [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04
> 
>
> Key: ARROW-8166
> URL: https://issues.apache.org/jira/browse/ARROW-8166
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.17.0
>
>
> cc [~frank.du]
> I have an i9-9960X AVX512-capable CPU but I see
> {code}
> /usr/bin/ccache /usr/bin/clang++-8  -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS 
> -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API 
> -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 
> -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB 
> -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated 
> -isystem ../thirdparty/flatbuffers/include -isystem 
> /home/wesm/cpp-toolchain/include -isystem jemalloc_ep-prefix/src -isystem 
> ../thirdparty/hadoop/include -Qunused-arguments -fcolor-diagnostics 
> -fuse-ld=gold -ggdb -O0  -Wall -Wextra -Wdocumentation -Wno-missing-braces 
> -Wno-unused-parameter -Wno-unknown-warning-option 
> -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option 
> -march=skylake-avx512 -maltivec -fno-omit-frame-pointer -g -fPIE   -pthread 
> -std=gnu++11 -MD -MT 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c 
> ../src/arrow/util/rle_encoding_test.cc
> In file included from ../src/arrow/util/rle_encoding_test.cc:33:
> In file included from ../src/arrow/util/bit_stream_utils.h:28:
> ../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier 
> '__m512i_u'
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
> ^
> ../src/arrow/util/bpacking.h:49:15: error: expected expression
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
>   ^
> ../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier 
> '__m512i_u'
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
> ^
> ../src/arrow/util/bpacking.h:55:15: error: expected expression
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
>   ^
> 4 errors generated.
> {code}
> I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems 
> that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the 
> clang-* versions are using the gcc-7 toolchain headers. Evidently we will 
> need to make this more robust



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8169) [Java] Improve the performance of JDBC adapter by allocating memory proactively

2020-03-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8169:
--
Labels: pull-request-available  (was: )

> [Java] Improve the performance of JDBC adapter by allocating memory 
> proactively
> ---
>
> Key: ARROW-8169
> URL: https://issues.apache.org/jira/browse/ARROW-8169
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>
> The current implementation use {{setSafe}} methods to dynamically allocate 
> memory if necessary. For fixed width vectors (which are frequently used in 
> JDBC), however, we can allocate memory proactively, since the vector size is 
> known as a configuration parameter. So for fixed width vectors, we can use 
> {{set}} methods instead.
> This change leads to two benefits:
> 1. When processing each value, we no longer have to check vector capacity and 
> reallocate memroy if needed. This leads to better performance.
> 2. If we allow the memory to expand automatically (each time by 2x), the 
> amount of memory usually ends up being more than necessary. By allocating 
> memory by the configuration parameter, we allocate no more, or no less. 
> Benchmark results show notable performance improvements:
> Before:
> Benchmark   Mode  CntScore   Error  Units
> JdbcAdapterBenchmarks.consumeBenchmark  avgt5  521.700 ± 4.837  us/op
> After:
> Benchmark   Mode  CntScore   Error  Units
> JdbcAdapterBenchmarks.consumeBenchmark  avgt5  430.523 ± 9.932  us/op



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8169) [Java] Improve the performance of JDBC adapter by allocating memory proactively

2020-03-19 Thread Liya Fan (Jira)
Liya Fan created ARROW-8169:
---

 Summary: [Java] Improve the performance of JDBC adapter by 
allocating memory proactively
 Key: ARROW-8169
 URL: https://issues.apache.org/jira/browse/ARROW-8169
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


The current implementation use {{setSafe}} methods to dynamically allocate 
memory if necessary. For fixed width vectors (which are frequently used in 
JDBC), however, we can allocate memory proactively, since the vector size is 
known as a configuration parameter. So for fixed width vectors, we can use 
{{set}} methods instead.

This change leads to two benefits:
1. When processing each value, we no longer have to check vector capacity and 
reallocate memroy if needed. This leads to better performance.
2. If we allow the memory to expand automatically (each time by 2x), the amount 
of memory usually ends up being more than necessary. By allocating memory by 
the configuration parameter, we allocate no more, or no less. 

Benchmark results show notable performance improvements:

Before:

Benchmark   Mode  CntScore   Error  Units
JdbcAdapterBenchmarks.consumeBenchmark  avgt5  521.700 ± 4.837  us/op

After:

Benchmark   Mode  CntScore   Error  Units
JdbcAdapterBenchmarks.consumeBenchmark  avgt5  430.523 ± 9.932  us/op



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04

2020-03-19 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063044#comment-17063044
 ] 

Wes McKinney commented on ARROW-8166:
-

I'll investigate some more and see if I can boil down what is different on my 
system

> [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04
> 
>
> Key: ARROW-8166
> URL: https://issues.apache.org/jira/browse/ARROW-8166
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.17.0
>
>
> cc [~frank.du]
> I have an i9-9960X AVX512-capable CPU but I see
> {code}
> /usr/bin/ccache /usr/bin/clang++-8  -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS 
> -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API 
> -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 
> -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB 
> -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated 
> -isystem ../thirdparty/flatbuffers/include -isystem 
> /home/wesm/cpp-toolchain/include -isystem jemalloc_ep-prefix/src -isystem 
> ../thirdparty/hadoop/include -Qunused-arguments -fcolor-diagnostics 
> -fuse-ld=gold -ggdb -O0  -Wall -Wextra -Wdocumentation -Wno-missing-braces 
> -Wno-unused-parameter -Wno-unknown-warning-option 
> -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option 
> -march=skylake-avx512 -maltivec -fno-omit-frame-pointer -g -fPIE   -pthread 
> -std=gnu++11 -MD -MT 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c 
> ../src/arrow/util/rle_encoding_test.cc
> In file included from ../src/arrow/util/rle_encoding_test.cc:33:
> In file included from ../src/arrow/util/bit_stream_utils.h:28:
> ../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier 
> '__m512i_u'
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
> ^
> ../src/arrow/util/bpacking.h:49:15: error: expected expression
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
>   ^
> ../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier 
> '__m512i_u'
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
> ^
> ../src/arrow/util/bpacking.h:55:15: error: expected expression
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
>   ^
> 4 errors generated.
> {code}
> I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems 
> that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the 
> clang-* versions are using the gcc-7 toolchain headers. Evidently we will 
> need to make this more robust



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04

2020-03-19 Thread Frank Du (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063041#comment-17063041
 ] 

Frank Du commented on ARROW-8166:
-

root@9735cf0f4203:/usr# grep __m512i_u * -R
lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h:typedef long long __m512i_u 
__attribute__ ((__vector_size__ (64), __may_alias__, __aligned__ (1)));
lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h: return *(__m512i_u *)__P;
lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h: *(__m512i_u *)__P = __A;
lib/gcc/x86_64-linux-gnu/7.5.0/include/avx512fintrin.h:typedef long long 
__m512i_u __attribute__ ((__vector_size__ (64), __may_alias__, __aligned__ 
(1)));
lib/gcc/x86_64-linux-gnu/7.5.0/include/avx512fintrin.h: return *(__m512i_u 
*)__P;
lib/gcc/x86_64-linux-gnu/7.5.0/include/avx512fintrin.h: *(__m512i_u *)__P = __A;

> [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04
> 
>
> Key: ARROW-8166
> URL: https://issues.apache.org/jira/browse/ARROW-8166
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.17.0
>
>
> cc [~frank.du]
> I have an i9-9960X AVX512-capable CPU but I see
> {code}
> /usr/bin/ccache /usr/bin/clang++-8  -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS 
> -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API 
> -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 
> -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB 
> -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated 
> -isystem ../thirdparty/flatbuffers/include -isystem 
> /home/wesm/cpp-toolchain/include -isystem jemalloc_ep-prefix/src -isystem 
> ../thirdparty/hadoop/include -Qunused-arguments -fcolor-diagnostics 
> -fuse-ld=gold -ggdb -O0  -Wall -Wextra -Wdocumentation -Wno-missing-braces 
> -Wno-unused-parameter -Wno-unknown-warning-option 
> -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option 
> -march=skylake-avx512 -maltivec -fno-omit-frame-pointer -g -fPIE   -pthread 
> -std=gnu++11 -MD -MT 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c 
> ../src/arrow/util/rle_encoding_test.cc
> In file included from ../src/arrow/util/rle_encoding_test.cc:33:
> In file included from ../src/arrow/util/bit_stream_utils.h:28:
> ../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier 
> '__m512i_u'
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
> ^
> ../src/arrow/util/bpacking.h:49:15: error: expected expression
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
>   ^
> ../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier 
> '__m512i_u'
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
> ^
> ../src/arrow/util/bpacking.h:55:15: error: expected expression
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
>   ^
> 4 errors generated.
> {code}
> I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems 
> that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the 
> clang-* versions are using the gcc-7 toolchain headers. Evidently we will 
> need to make this more robust



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04

2020-03-19 Thread Frank Du (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063040#comment-17063040
 ] 

Frank Du commented on ARROW-8166:
-

I tried a quick build on docker context with ubuntu:18.04 image, the build 
successfully, below is the command:

sudo docker run -it -v /home/pnp/arrow/:/mnt ubuntu:18.04
apt-get update

apt-get install llvm-8 cmake build-essential clang-8 autoconf libboost-dev 
libboost-filesystem-dev libboost-system-dev libboost-regex-dev libjemalloc-dev

cmake -DARROW_WITH_SNAPPY=ON \
    -DARROW_GANDIVA=ON \
    -DARROW_PARQUET=ON \
    -DARROW_BUILD_TESTS=ON \
    -DARROW_BUILD_BENCHMARKS=ON \
    -DARROW_SIMD_LEVEL=AVX512 \
   ..

make -j16

 

And below is the info of GCC and LLVM:

gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 
7.5.0-3ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs 
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr 
--with-gcc-major-version-only --program-suffix=-7 
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id 
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix 
--libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu 
--enable-libstdcxx-debug --enable-libstdcxx-time=yes 
--with-default-libstdcxx-abi=new --enable-gnu-unique-object 
--disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie 
--with-system-zlib --with-target-system-zlib --enable-objc-gc=auto 
--enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 
--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic 
--enable-offload-targets=nvptx-none --without-cuda-driver 
--enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu 
--target=x86_64-linux-gnu
Thread model: posix
gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)

 

clang-8 -v
clang version 8.0.0-3~ubuntu18.04.2 (tags/RELEASE_800/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/7
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/7.5.0
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/8
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7.5.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/8
Selected GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/7.5.0
Candidate multilib: .;@m64
Selected multilib: .;@m64

 

> [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04
> 
>
> Key: ARROW-8166
> URL: https://issues.apache.org/jira/browse/ARROW-8166
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.17.0
>
>
> cc [~frank.du]
> I have an i9-9960X AVX512-capable CPU but I see
> {code}
> /usr/bin/ccache /usr/bin/clang++-8  -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS 
> -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API 
> -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 
> -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB 
> -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated 
> -isystem ../thirdparty/flatbuffers/include -isystem 
> /home/wesm/cpp-toolchain/include -isystem jemalloc_ep-prefix/src -isystem 
> ../thirdparty/hadoop/include -Qunused-arguments -fcolor-diagnostics 
> -fuse-ld=gold -ggdb -O0  -Wall -Wextra -Wdocumentation -Wno-missing-braces 
> -Wno-unused-parameter -Wno-unknown-warning-option 
> -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option 
> -march=skylake-avx512 -maltivec -fno-omit-frame-pointer -g -fPIE   -pthread 
> -std=gnu++11 -MD -MT 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o 
> src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c 
> ../src/arrow/util/rle_encoding_test.cc
> In file included from ../src/arrow/util/rle_encoding_test.cc:33:
> In file included from ../src/arrow/util/bit_stream_utils.h:28:
> ../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier 
> '__m512i_u'
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
> ^
> ../src/arrow/util/bpacking.h:49:15: error: expected expression
>   *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
>   ^
> 

[jira] [Updated] (ARROW-8138) [C++] parquet::arrow::FileReader cannot read multiple RowGroup

2020-03-19 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8138:

Summary: [C++] parquet::arrow::FileReader cannot read multiple RowGroup  
(was: parquet::arrow::FileReader cannot read multiple RowGroup)

> [C++] parquet::arrow::FileReader cannot read multiple RowGroup
> --
>
> Key: ARROW-8138
> URL: https://issues.apache.org/jira/browse/ARROW-8138
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.16.0
> Environment: Centos 7
>Reporter: Feng Tian
>Priority: Major
> Attachments: bug.cpp, bug.parquet
>
>
> When use parquet::arrow::FileReader to read parquet file consisting multiple 
> row groups,
> {code:c++}
> reader->RowGroup(i)->Column(c)->Read
> {code}
> It will repeated read data of the first rowgroup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8138) [C++] parquet::arrow::FileReader cannot read multiple RowGroup

2020-03-19 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8138:

Fix Version/s: 0.17.0

> [C++] parquet::arrow::FileReader cannot read multiple RowGroup
> --
>
> Key: ARROW-8138
> URL: https://issues.apache.org/jira/browse/ARROW-8138
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.16.0
> Environment: Centos 7
>Reporter: Feng Tian
>Priority: Major
> Fix For: 0.17.0
>
> Attachments: bug.cpp, bug.parquet
>
>
> When use parquet::arrow::FileReader to read parquet file consisting multiple 
> row groups,
> {code:c++}
> reader->RowGroup(i)->Column(c)->Read
> {code}
> It will repeated read data of the first rowgroup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8168) Improve Java Plasma client off-heap memory usage

2020-03-19 Thread KunshangJi (Jira)
KunshangJi created ARROW-8168:
-

 Summary: Improve Java Plasma client off-heap memory usage
 Key: ARROW-8168
 URL: https://issues.apache.org/jira/browse/ARROW-8168
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: KunshangJi
 Fix For: 0.17.0


Currently, Plasma Java client API use byte[], which need copy memory from Java 
on-heap to off-heap(mmap file). we can improve create() and get() method and 
return a ByteBuffer or DirectByteBuffer to avoid unnecessary memory copy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8152) [C++] IO: split large coalesced reads into smaller ones

2020-03-19 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062941#comment-17062941
 ] 

David Li commented on ARROW-8152:
-

Yes, having an options struct for those parameters (and potentially others, 
e.g. if we want an AsyncContext) makes sense to me.

> [C++] IO: split large coalesced reads into smaller ones
> ---
>
> Key: ARROW-8152
> URL: https://issues.apache.org/jira/browse/ARROW-8152
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Priority: Major
>
> We have a facility to coalesce small reads, but remote filesystems may also 
> benefit from splitting large reads to take advantage of concurrency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8138) parquet::arrow::FileReader cannot read multiple RowGroup

2020-03-19 Thread Feng Tian (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062935#comment-17062935
 ] 

Feng Tian commented on ARROW-8138:
--

I attached a quick repro – bug.parquet is a data file with multiple row groups, 
each row is a int, float pair. bug.cpp should repro.

As a side notes – I generally follow the cpp examples, but seems none of the 
parquet examples cover the case of multiple rowgroups.

> parquet::arrow::FileReader cannot read multiple RowGroup
> 
>
> Key: ARROW-8138
> URL: https://issues.apache.org/jira/browse/ARROW-8138
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.16.0
> Environment: Centos 7
>Reporter: Feng Tian
>Priority: Major
> Attachments: bug.cpp, bug.parquet
>
>
> When use parquet::arrow::FileReader to read parquet file consisting multiple 
> row groups,
> {code:c++}
> reader->RowGroup(i)->Column(c)->Read
> {code}
> It will repeated read data of the first rowgroup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8138) parquet::arrow::FileReader cannot read multiple RowGroup

2020-03-19 Thread Feng Tian (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Tian updated ARROW-8138:
-
Attachment: bug.cpp

> parquet::arrow::FileReader cannot read multiple RowGroup
> 
>
> Key: ARROW-8138
> URL: https://issues.apache.org/jira/browse/ARROW-8138
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.16.0
> Environment: Centos 7
>Reporter: Feng Tian
>Priority: Major
> Attachments: bug.cpp, bug.parquet
>
>
> When use parquet::arrow::FileReader to read parquet file consisting multiple 
> row groups,
> {code:c++}
> reader->RowGroup(i)->Column(c)->Read
> {code}
> It will repeated read data of the first rowgroup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8138) parquet::arrow::FileReader cannot read multiple RowGroup

2020-03-19 Thread Feng Tian (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Tian updated ARROW-8138:
-
Attachment: bug.parquet

> parquet::arrow::FileReader cannot read multiple RowGroup
> 
>
> Key: ARROW-8138
> URL: https://issues.apache.org/jira/browse/ARROW-8138
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.16.0
> Environment: Centos 7
>Reporter: Feng Tian
>Priority: Major
> Attachments: bug.cpp, bug.parquet
>
>
> When use parquet::arrow::FileReader to read parquet file consisting multiple 
> row groups,
> {code:c++}
> reader->RowGroup(i)->Column(c)->Read
> {code}
> It will repeated read data of the first rowgroup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8167) [CI] Add support for skipping builds with skip pattern in pull request title

2020-03-19 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-8167:
---
Summary: [CI] Add support for skipping builds with skip pattern in pull 
request title  (was: [CI] Add support for skipping builds via commit messages)

> [CI] Add support for skipping builds with skip pattern in pull request title
> 
>
> Key: ARROW-8167
> URL: https://issues.apache.org/jira/browse/ARROW-8167
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Github actions doesn't support to skip builds marked as [skip ci] by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8118) [R] dim method for FileSystemDataset

2020-03-19 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-8118:
---

Assignee: Sam Albers

> [R] dim method for FileSystemDataset
> 
>
> Key: ARROW-8118
> URL: https://issues.apache.org/jira/browse/ARROW-8118
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Sam Albers
>Assignee: Sam Albers
>Priority: Minor
>  Labels: features, pull-request-available
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> I been using this function enough that I wonder if a) would be useful in the 
> package and b) whether this is something you think is worth working on. The 
> basic problem is that if you have a hierarchical file structure that 
> accommodates using open_dataset, it is definitely useful to know the amount 
> of data you are dealing with. My idea is that 'FileSystemDataset' would have 
> dim, nrow and ncol methods. Here is how I've been using it:
> {code:java}
> library(arrow)
> x <- open_dataset("data/rivers-data/", partitioning = c("prov", "month"))
> dim_arrow <- function(x) {
>  rows <- sum(purrr::map_dbl(x$files, 
> ~ParquetFileReader$create(.x)$ReadTable()$num_rows))
>  cols <- x$schema$num_fields
>  
>  c(rows, cols)
> }
> dim_arrow(x)
> #> [1] 426929 7
> {code}
>  
> Ideally this would work on arrow_dplyr_query objects as well but I haven't 
> quite figured out how that filters based on the partitioning variables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8118) [R] dim method for FileSystemDataset

2020-03-19 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-8118.
-
Fix Version/s: 0.17.0
   Resolution: Fixed

Issue resolved by pull request 6635
[https://github.com/apache/arrow/pull/6635]

> [R] dim method for FileSystemDataset
> 
>
> Key: ARROW-8118
> URL: https://issues.apache.org/jira/browse/ARROW-8118
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Sam Albers
>Assignee: Sam Albers
>Priority: Minor
>  Labels: features, pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> I been using this function enough that I wonder if a) would be useful in the 
> package and b) whether this is something you think is worth working on. The 
> basic problem is that if you have a hierarchical file structure that 
> accommodates using open_dataset, it is definitely useful to know the amount 
> of data you are dealing with. My idea is that 'FileSystemDataset' would have 
> dim, nrow and ncol methods. Here is how I've been using it:
> {code:java}
> library(arrow)
> x <- open_dataset("data/rivers-data/", partitioning = c("prov", "month"))
> dim_arrow <- function(x) {
>  rows <- sum(purrr::map_dbl(x$files, 
> ~ParquetFileReader$create(.x)$ReadTable()$num_rows))
>  cols <- x$schema$num_fields
>  
>  c(rows, cols)
> }
> dim_arrow(x)
> #> [1] 426929 7
> {code}
>  
> Ideally this would work on arrow_dplyr_query objects as well but I haven't 
> quite figured out how that filters based on the partitioning variables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8167) [CI] Add support for skipping builds via commit messages

2020-03-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8167:
--
Labels: pull-request-available  (was: )

> [CI] Add support for skipping builds via commit messages
> 
>
> Key: ARROW-8167
> URL: https://issues.apache.org/jira/browse/ARROW-8167
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>
> Github actions doesn't support to skip builds marked as [skip ci] by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8167) [CI] Add support for skipping builds via commit messages

2020-03-19 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8167:
--

 Summary: [CI] Add support for skipping builds via commit messages
 Key: ARROW-8167
 URL: https://issues.apache.org/jira/browse/ARROW-8167
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


Github actions doesn't support to skip builds marked as [skip ci] by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7854) [C++][Dataset] Option to memory map when reading IPC format

2020-03-19 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062844#comment-17062844
 ] 

Wes McKinney commented on ARROW-7854:
-

Well, it seems like this detail should perhaps not be so visible to users. If 
an interface prefers memory mapping if it's available, then it can do so 
without leaking this configuration detail into some other part of the system

> [C++][Dataset] Option to memory map when reading IPC format
> ---
>
> Key: ARROW-7854
> URL: https://issues.apache.org/jira/browse/ARROW-7854
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Assignee: Francois Saint-Jacques
>Priority: Major
>
> For the IPC format it would be interesting to be able to memory map the IPC 
> files?
> cc [~fsaintjacques] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04

2020-03-19 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8166:
---

 Summary: [C++] AVX512 intrinsics fail to compile with clang-8 on 
Ubuntu 18.04
 Key: ARROW-8166
 URL: https://issues.apache.org/jira/browse/ARROW-8166
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.17.0


cc [~frank.du]

I have an i9-9960X AVX512-capable process but I see

{code}
/usr/bin/ccache /usr/bin/clang++-8  -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS 
-DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API 
-DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 
-DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB 
-DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated -isystem 
../thirdparty/flatbuffers/include -isystem /home/wesm/cpp-toolchain/include 
-isystem jemalloc_ep-prefix/src -isystem ../thirdparty/hadoop/include 
-Qunused-arguments -fcolor-diagnostics -fuse-ld=gold -ggdb -O0  -Wall -Wextra 
-Wdocumentation -Wno-missing-braces -Wno-unused-parameter 
-Wno-unknown-warning-option -Wno-constant-logical-operand -Werror 
-Wno-unknown-warning-option -march=skylake-avx512 -maltivec 
-fno-omit-frame-pointer -g -fPIE   -pthread -std=gnu++11 -MD -MT 
src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF 
src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o 
src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c 
../src/arrow/util/rle_encoding_test.cc
In file included from ../src/arrow/util/rle_encoding_test.cc:33:
In file included from ../src/arrow/util/bit_stream_utils.h:28:
../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier 
'__m512i_u'
  *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
^
../src/arrow/util/bpacking.h:49:15: error: expected expression
  *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
  ^
../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier 
'__m512i_u'
  *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
^
../src/arrow/util/bpacking.h:55:15: error: expected expression
  *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
  ^
4 errors generated.
{code}

I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems 
that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the clang-* 
versions are using the gcc-7 toolchain headers. Evidently we will need to make 
this more robust



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8166) [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04

2020-03-19 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8166:

Description: 
cc [~frank.du]

I have an i9-9960X AVX512-capable CPU but I see

{code}
/usr/bin/ccache /usr/bin/clang++-8  -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS 
-DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API 
-DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 
-DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB 
-DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated -isystem 
../thirdparty/flatbuffers/include -isystem /home/wesm/cpp-toolchain/include 
-isystem jemalloc_ep-prefix/src -isystem ../thirdparty/hadoop/include 
-Qunused-arguments -fcolor-diagnostics -fuse-ld=gold -ggdb -O0  -Wall -Wextra 
-Wdocumentation -Wno-missing-braces -Wno-unused-parameter 
-Wno-unknown-warning-option -Wno-constant-logical-operand -Werror 
-Wno-unknown-warning-option -march=skylake-avx512 -maltivec 
-fno-omit-frame-pointer -g -fPIE   -pthread -std=gnu++11 -MD -MT 
src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF 
src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o 
src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c 
../src/arrow/util/rle_encoding_test.cc
In file included from ../src/arrow/util/rle_encoding_test.cc:33:
In file included from ../src/arrow/util/bit_stream_utils.h:28:
../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier 
'__m512i_u'
  *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
^
../src/arrow/util/bpacking.h:49:15: error: expected expression
  *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
  ^
../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier 
'__m512i_u'
  *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
^
../src/arrow/util/bpacking.h:55:15: error: expected expression
  *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
  ^
4 errors generated.
{code}

I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems 
that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the clang-* 
versions are using the gcc-7 toolchain headers. Evidently we will need to make 
this more robust

  was:
cc [~frank.du]

I have an i9-9960X AVX512-capable process but I see

{code}
/usr/bin/ccache /usr/bin/clang++-8  -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HDFS 
-DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_NO_DEPRECATED_API 
-DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 
-DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_ZLIB 
-DARROW_WITH_ZSTD -DURI_STATIC_BUILD -Isrc -I../src -I../src/generated -isystem 
../thirdparty/flatbuffers/include -isystem /home/wesm/cpp-toolchain/include 
-isystem jemalloc_ep-prefix/src -isystem ../thirdparty/hadoop/include 
-Qunused-arguments -fcolor-diagnostics -fuse-ld=gold -ggdb -O0  -Wall -Wextra 
-Wdocumentation -Wno-missing-braces -Wno-unused-parameter 
-Wno-unknown-warning-option -Wno-constant-logical-operand -Werror 
-Wno-unknown-warning-option -march=skylake-avx512 -maltivec 
-fno-omit-frame-pointer -g -fPIE   -pthread -std=gnu++11 -MD -MT 
src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -MF 
src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o.d -o 
src/arrow/util/CMakeFiles/arrow-utility-test.dir/rle_encoding_test.cc.o -c 
../src/arrow/util/rle_encoding_test.cc
In file included from ../src/arrow/util/rle_encoding_test.cc:33:
In file included from ../src/arrow/util/bit_stream_utils.h:28:
../src/arrow/util/bpacking.h:49:5: error: use of undeclared identifier 
'__m512i_u'
  *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
^
../src/arrow/util/bpacking.h:49:15: error: expected expression
  *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
  ^
../src/arrow/util/bpacking.h:55:5: error: use of undeclared identifier 
'__m512i_u'
  *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
^
../src/arrow/util/bpacking.h:55:15: error: expected expression
  *(__m512i_u*)out = _mm512_and_epi32(_mm512_srlv_epi32(inls, shifts), masks);
  ^
4 errors generated.
{code}

I tried compiling with gcc 8.3 instead of clang-8 and it worked. So it seems 
that because the base gcc toolchain on Ubuntu 18.04 is gcc 7.x that the clang-* 
versions are using the gcc-7 toolchain headers. Evidently we will need to make 
this more robust


> [C++] AVX512 intrinsics fail to compile with clang-8 on Ubuntu 18.04
> 
>
> Key: ARROW-8166
> URL: https://issues.apache.org/jira/browse/ARROW-8166
> 

[jira] [Updated] (ARROW-8061) [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support row groups)

2020-03-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8061:
--
Labels: pull-request-available  (was: )

> [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support 
> row groups)
> -
>
> Key: ARROW-8061
> URL: https://issues.apache.org/jira/browse/ARROW-8061
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>
> Specifically for parquet (not sure if it will be relevant for other file 
> formats as well, for IPC/feather potentially ther record batch), it would be 
> useful to target row groups instead of files as fragments.
> Quoting the original design documents: _"In datasets consisting of many 
> fragments, the dataset API must expose the granularity of fragments in a 
> public way to enable parallel processing, if desired. "._   
> And a comment from Wes on that: _"a single Parquet file can "export" one or 
> more fragments based on settings. The default might be to split fragments 
> based on row group"_
> Currently, the level on which fragments are defined (at least in the typical 
> partitioned parquet dataset) is "1 file == 1 fragment".
> Would it be possible or desirable to make this more fine grained, where you 
> could also opt to have a fragment per row group?   
> We could have a ParquetFragment that has this option, and a ParquetFileFormat 
> specific option to say what the granularity of a fragment is (file vs row 
> group)?
> cc [~fsaintjacques] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8165) [Packaging] Make nightly wheels available on a PyPI server

2020-03-19 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-8165:
---
Summary: [Packaging] Make nightly wheels available on a PyPI server  (was: 
[Packaging] Make nightly wheels available)

> [Packaging] Make nightly wheels available on a PyPI server
> --
>
> Key: ARROW-8165
> URL: https://issues.apache.org/jira/browse/ARROW-8165
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8165) [Packaging] Make nightly wheels available on a PyPI server

2020-03-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8165:
--
Labels: pull-request-available  (was: )

> [Packaging] Make nightly wheels available on a PyPI server
> --
>
> Key: ARROW-8165
> URL: https://issues.apache.org/jira/browse/ARROW-8165
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8138) parquet::arrow::FileReader cannot read multiple RowGroup

2020-03-19 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062835#comment-17062835
 ] 

Francois Saint-Jacques commented on ARROW-8138:
---

Can you provide more information on the calling context? If this is true, we 
have a serious problem and this should be a blocker for 0.17.0.

> parquet::arrow::FileReader cannot read multiple RowGroup
> 
>
> Key: ARROW-8138
> URL: https://issues.apache.org/jira/browse/ARROW-8138
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.16.0
> Environment: Centos 7
>Reporter: Feng Tian
>Priority: Major
>
> When use parquet::arrow::FileReader to read parquet file consisting multiple 
> row groups,
> {code:c++}
> reader->RowGroup(i)->Column(c)->Read
> {code}
> It will repeated read data of the first rowgroup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8165) [Packaging] Make nightly wheels available

2020-03-19 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8165:
--

 Summary: [Packaging] Make nightly wheels available
 Key: ARROW-8165
 URL: https://issues.apache.org/jira/browse/ARROW-8165
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8142) [C++] Casting a chunked array with 0 chunks critical failure

2020-03-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8142:
--
Labels: pull-request-available  (was: )

> [C++] Casting a chunked array with 0 chunks critical failure
> 
>
> Key: ARROW-8142
> URL: https://issues.apache.org/jira/browse/ARROW-8142
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Florian Jetter
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>
> When casting a schema of an empty table from dict encoded to non-dict encoded 
> type a critical error is raised and not handled causing the interpreter to 
> shut down.
> This only happens after a parquet roundtrip
>  
> {code:python}
> import pyarrow as pa
> import pandas as pd
> import pyarrow.parquet as pq
> df = pd.DataFrame({"col": ["a"]}).astype({"col": "category"}).iloc[:0]
> table = pa.Table.from_pandas(df)
> field = table.schema[0]
> new_field = pa.field(field.name, field.type.value_type, field.nullable, 
> field.metadata)
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> reader = pa.BufferReader(buf.getvalue().to_pybytes())
> table = pq.read_table(reader)
> schema = table.schema.remove(0).insert(0, new_field)
> new_table = table.cast(schema)
> assert new_table.schema == schema
>  {code}
>  
> Output
> {code:java}
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F0318 09:55:14.266649 299722176 table.cc:47] Check failed: (chunks.size()) > 
> (0) cannot construct ChunkedArray from empty vector and omitted type {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-7480) [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns don't match the selected columns

2020-03-19 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-7480.
-
Resolution: Fixed

Fixed by https://github.com/apache/arrow/pull/6625

> [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns 
> don't match the selected columns
> 
>
> Key: ARROW-7480
> URL: https://issues.apache.org/jira/browse/ARROW-7480
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Kyle McCarthy
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> There are two scenarios that cause problems but are related to the queries 
> with aggregate expressions and the SQL planner. The aggregate_test_100 
> dataset is used for both of the queries. 
> At a high level, the issue is basically that queries containing aggregate 
> expressions may generate the wrong schema.
>  
> *Scenario 1*
> Columns are grouped by but not selected.
> Query:
> {code:java}
> SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
> Error:
> {noformat}
> ArrowError(InvalidArgumentError("number of columns must match number of 
> fields in schema")){noformat}
> While the error is an ArrowError, it actually looks like it is because the 
> wrong schema is generated. In the src/sql/planner.rs file the impl for 
> SqlToRel is defined. In the sql_to_rel method, it checks if the query 
> contains aggregate expressions, and if it does it generates the schema from 
> the columns included in group expressions and aggregate expressions.
> This in turn generates the following schema:
> {code:java}
> Schema {
> fields: [
> Field {
> name: "c1",
> data_type: Utf8,
> nullable: false,
> },
> Field {
> name: "c13",
> data_type: Utf8,
> nullable: false,
> },
> Field {
> name: "MIN",
> data_type: Float64,
> nullable: true,
> },
> ],
> metadata: {},
> }{code}
> I am not super familiar with how DataFusion works under the hood, but I would 
> assume that this schema is actually correct for the Aggregate logical plan, 
> but isn't projecting the data correctly thus resulting in the wrong query 
> result schema? 
>  
> *Senario 2*
> Columns are selected, but not grouped or part of an aggregate function. This 
> query actually will run, but the wrong schema is produced.
> Query: 
> {code:java}
> SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
> Schema generated:
> {code:java}
> Schema {
> fields: [
> Field {
> name: "c0",
> data_type: Utf8,
> nullable: true,
> },
> Field {
> name: "c1",
> data_type: Float64,
> nullable: true,
> },
> Field {
> name: "c1",
> data_type: Float64,
> nullable: true,
> },
> ],
> metadata: {},
> } {code}
> This should actually be Field(c1, Utf8), Field(c13, Utf8), Field(MAX, 
> Float64).
>  
> 
> Schema 2 is questionable since some DBMS will run the query (ex MySQL) but 
> others (Postgres) will require that all the columns must be in the GROUP BY 
> to be used in an aggregate function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8142) [C++] Casting a chunked array with 0 chunks critical failure

2020-03-19 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-8142:
---

Assignee: Ben Kietzman

> [C++] Casting a chunked array with 0 chunks critical failure
> 
>
> Key: ARROW-8142
> URL: https://issues.apache.org/jira/browse/ARROW-8142
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Florian Jetter
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 0.17.0
>
>
> When casting a schema of an empty table from dict encoded to non-dict encoded 
> type a critical error is raised and not handled causing the interpreter to 
> shut down.
> This only happens after a parquet roundtrip
>  
> {code:python}
> import pyarrow as pa
> import pandas as pd
> import pyarrow.parquet as pq
> df = pd.DataFrame({"col": ["a"]}).astype({"col": "category"}).iloc[:0]
> table = pa.Table.from_pandas(df)
> field = table.schema[0]
> new_field = pa.field(field.name, field.type.value_type, field.nullable, 
> field.metadata)
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> reader = pa.BufferReader(buf.getvalue().to_pybytes())
> table = pq.read_table(reader)
> schema = table.schema.remove(0).insert(0, new_field)
> new_table = table.cast(schema)
> assert new_table.schema == schema
>  {code}
>  
> Output
> {code:java}
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F0318 09:55:14.266649 299722176 table.cc:47] Check failed: (chunks.size()) > 
> (0) cannot construct ChunkedArray from empty vector and omitted type {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8123) [Rust] [DataFusion] Create LogicalPlanBuilder

2020-03-19 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-8123.
---
Fix Version/s: (was: 1.0.0)
   0.17.0
   Resolution: Fixed

Issue resolved by pull request 6625
[https://github.com/apache/arrow/pull/6625]

> [Rust] [DataFusion] Create LogicalPlanBuilder
> -
>
> Key: ARROW-8123
> URL: https://issues.apache.org/jira/browse/ARROW-8123
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Building logical plans is arduous and a builder would make this nicer. 
> Example:
> {code:java}
> let plan = LogicalPlanBuilder::new()
> .scan(
> "default",
> "employee.csv",
> _schema(),
> Some(vec![0, 3]),
> )?
> .filter(col(1).eq(_str("CO")))?
> .project(vec![col(0)])?
> .build()?; {code}
> Note that I am already working on this and will have a PR shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8159) [Python] pyarrow.Schema.from_pandas doesn't support ExtensionDtype

2020-03-19 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-8159.
--
Resolution: Fixed

Issue resolved by pull request 6665
[https://github.com/apache/arrow/pull/6665]

> [Python] pyarrow.Schema.from_pandas doesn't support ExtensionDtype
> --
>
> Key: ARROW-8159
> URL: https://issues.apache.org/jira/browse/ARROW-8159
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7824) [C++][Dataset] Provide Dataset writing to IPC format

2020-03-19 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-7824.
---
Resolution: Fixed

Issue resolved by pull request 6449
[https://github.com/apache/arrow/pull/6449]

> [C++][Dataset] Provide Dataset writing to IPC format
> 
>
> Key: ARROW-7824
> URL: https://issues.apache.org/jira/browse/ARROW-7824
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Dataset
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Begin with writing to IPC format since it is simpler than parquet and to 
> efficiently support the "locally cached extract" workflow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8158) [Java] Getting length of data buffer and base variable width vector

2020-03-19 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062725#comment-17062725
 ] 

Micah Kornfield commented on ARROW-8158:


[~tianchen92] The issue is there isn't a clear way to get the length of an 
individual VarChar or Bytes element (one needs to go through the holder or 
access the offsets buffer directly).  A similar issue exists for lists.

> [Java] Getting length of data buffer and base variable width vector
> ---
>
> Key: ARROW-8158
> URL: https://issues.apache.org/jira/browse/ARROW-8158
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Gaurangi Saxena
>Assignee: Ji Liu
>Priority: Minor
>
> For string data buffer and base variable width vector can we have a way to 
> get length of the data? 
> For instance, in ArrowColumnVector in StringAccessor we use 
> stringResult.start and stringResult.end, instead we would like to get length 
> of the data through an exposed function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8142) [C++] Casting a chunked array with 0 chunks critical failure

2020-03-19 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8142:
-
Summary: [C++] Casting a chunked array with 0 chunks critical failure  
(was: [Python/C++] Casting empty table from after parquet roundtrip causes 
critical failure)

> [C++] Casting a chunked array with 0 chunks critical failure
> 
>
> Key: ARROW-8142
> URL: https://issues.apache.org/jira/browse/ARROW-8142
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Florian Jetter
>Priority: Major
> Fix For: 0.17.0
>
>
> When casting a schema of an empty table from dict encoded to non-dict encoded 
> type a critical error is raised and not handled causing the interpreter to 
> shut down.
> This only happens after a parquet roundtrip
>  
> {code:python}
> import pyarrow as pa
> import pandas as pd
> import pyarrow.parquet as pq
> df = pd.DataFrame({"col": ["a"]}).astype({"col": "category"}).iloc[:0]
> table = pa.Table.from_pandas(df)
> field = table.schema[0]
> new_field = pa.field(field.name, field.type.value_type, field.nullable, 
> field.metadata)
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> reader = pa.BufferReader(buf.getvalue().to_pybytes())
> table = pq.read_table(reader)
> schema = table.schema.remove(0).insert(0, new_field)
> new_table = table.cast(schema)
> assert new_table.schema == schema
>  {code}
>  
> Output
> {code:java}
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F0318 09:55:14.266649 299722176 table.cc:47] Check failed: (chunks.size()) > 
> (0) cannot construct ChunkedArray from empty vector and omitted type {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8142) [Python/C++] Casting empty table from after parquet roundtrip causes critical failure

2020-03-19 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062709#comment-17062709
 ] 

Joris Van den Bossche commented on ARROW-8142:
--

It's also not specific to dictionary, it fails for eg int8 -> int16 cast as 
well.

> [Python/C++] Casting empty table from after parquet roundtrip causes critical 
> failure
> -
>
> Key: ARROW-8142
> URL: https://issues.apache.org/jira/browse/ARROW-8142
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Florian Jetter
>Priority: Major
> Fix For: 0.17.0
>
>
> When casting a schema of an empty table from dict encoded to non-dict encoded 
> type a critical error is raised and not handled causing the interpreter to 
> shut down.
> This only happens after a parquet roundtrip
>  
> {code:python}
> import pyarrow as pa
> import pandas as pd
> import pyarrow.parquet as pq
> df = pd.DataFrame({"col": ["a"]}).astype({"col": "category"}).iloc[:0]
> table = pa.Table.from_pandas(df)
> field = table.schema[0]
> new_field = pa.field(field.name, field.type.value_type, field.nullable, 
> field.metadata)
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> reader = pa.BufferReader(buf.getvalue().to_pybytes())
> table = pq.read_table(reader)
> schema = table.schema.remove(0).insert(0, new_field)
> new_table = table.cast(schema)
> assert new_table.schema == schema
>  {code}
>  
> Output
> {code:java}
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F0318 09:55:14.266649 299722176 table.cc:47] Check failed: (chunks.size()) > 
> (0) cannot construct ChunkedArray from empty vector and omitted type {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8164) [C++][Dataset] Let datasets be viewable with non-identical schema

2020-03-19 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8164:
---

 Summary: [C++][Dataset] Let datasets be viewable with 
non-identical schema
 Key: ARROW-8164
 URL: https://issues.apache.org/jira/browse/ARROW-8164
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Dataset
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


It would be useful to allow some schema unification capability after discovery 
has completed. For example, if a FileSystemDataset is being wrapped into a 
UnionDataset with another and their schemas are unifiable then there is no 
reason we can't create the UnionDataset (rather than emitting an error because 
the schemas are not identical).

I think this behavior will be most naturally expressed in C++ like so:

{code}
virtual Result Dataset::ReplaceSchema(std::shared_ptr schema) 
const = 0;
{code}

which will raise an error if the provided schema is not unifiable with the 
current dataset schema.

If this needs to be extended to non trivial projections then this will probably 
warrant a separate class, {{ProjectedDataset}} or so. Definitely follow up 
material (if desired)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8163) [C++][Dataset] Allow FileSystemDataset's file list to be lazy

2020-03-19 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8163:
---

 Summary: [C++][Dataset] Allow FileSystemDataset's file list to be 
lazy
 Key: ARROW-8163
 URL: https://issues.apache.org/jira/browse/ARROW-8163
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Dataset
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


A FileSystemDataset currently requires a full listing of files it contains on 
construction, so a scan cannot start until all files in the dataset are 
discovered. Instead it would be ideal if a large dataset could be constructed 
with a lazy file listing so that scans can start immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8162) [Format][Python] Add serialization for CSF sparse tensors

2020-03-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8162:
--
Labels: pull-request-available  (was: )

> [Format][Python] Add serialization for CSF sparse tensors
> -
>
> Key: ARROW-8162
> URL: https://issues.apache.org/jira/browse/ARROW-8162
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format, Python
>Reporter: Rok Mihevc
>Assignee: Rok Mihevc
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Once [ARROW-7428|https://issues.apache.org/jira/browse/ARROW-7428] is 
> complete serialization for CSF sparse tensors should be enabled in Python too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8162) [Format][Python] Add serialization for CSF sparse tensors

2020-03-19 Thread Rok Mihevc (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc updated ARROW-8162:
--
Description: Once 
[ARROW-7428|https://issues.apache.org/jira/browse/ARROW-7428] is complete 
serialization for CSF sparse tensors should be enabled in Python too.  (was: 
Once [#ARROW-7428] is complete serialization for CSF sparse tensors should be 
enabled in Python too.)

> [Format][Python] Add serialization for CSF sparse tensors
> -
>
> Key: ARROW-8162
> URL: https://issues.apache.org/jira/browse/ARROW-8162
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format, Python
>Reporter: Rok Mihevc
>Assignee: Rok Mihevc
>Priority: Minor
> Fix For: 1.0.0
>
>
> Once [ARROW-7428|https://issues.apache.org/jira/browse/ARROW-7428] is 
> complete serialization for CSF sparse tensors should be enabled in Python too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8162) [Format][Python] Add serialization for CSF sparse tensors

2020-03-19 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-8162:
-

 Summary: [Format][Python] Add serialization for CSF sparse tensors
 Key: ARROW-8162
 URL: https://issues.apache.org/jira/browse/ARROW-8162
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format, Python
Reporter: Rok Mihevc
Assignee: Rok Mihevc
 Fix For: 1.0.0


Once [#ARROW-7428] is complete serialization for CSF sparse tensors should be 
enabled in Python too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8161) [C++][Gandiva] Consolidate the data generation code for benchmark tests in gandiva into arrow/testing

2020-03-19 Thread Projjal Chanda (Jira)
Projjal Chanda created ARROW-8161:
-

 Summary: [C++][Gandiva] Consolidate the data generation code for 
benchmark tests in gandiva into arrow/testing
 Key: ARROW-8161
 URL: https://issues.apache.org/jira/browse/ARROW-8161
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Projjal Chanda
Assignee: Projjal Chanda






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7966) [Integration][Flight][C++] Client should verify each batch independently

2020-03-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-7966.
---
Fix Version/s: 0.17.0
   Resolution: Fixed

Issue resolved by pull request 6662
[https://github.com/apache/arrow/pull/6662]

> [Integration][Flight][C++] Client should verify each batch independently
> 
>
> Key: ARROW-7966
> URL: https://issues.apache.org/jira/browse/ARROW-7966
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC, Integration
>Reporter: Bryan Cutler
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Currently the C++ Flight test client in {{test_integration_client.cc}} reads 
> all batches from JSON into a Table, reads all batches in the flight stream 
> from the server into a Table, then compares the Tables for equality.  This is 
> potentially a problem because a record batch might have specific information 
> that is then lost in the conversion to a Table. For example, if the server 
> sends empty batches, the resulting Table would not be different from one with 
> no empty batches.
> Instead, the client should check each record batch from the JSON file against 
> each record batch from the server independently. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8160) [FlightRPC][C++] DoPutPayloadWriter doesn't always expose server error message

2020-03-19 Thread David Li (Jira)
David Li created ARROW-8160:
---

 Summary: [FlightRPC][C++] DoPutPayloadWriter doesn't always expose 
server error message
 Key: ARROW-8160
 URL: https://issues.apache.org/jira/browse/ARROW-8160
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, FlightRPC
Affects Versions: 0.16.0
Reporter: David Li


{noformat}
C:/projects/arrow/cpp/src/arrow/flight/flight_test.cc(1261): error: Value of: 
status.message()
Expected: has substring "Invalid token"
  Actual: "Could not write record batch to stream: "
[  FAILED  ] TestBasicAuthHandler.FailUnauthenticatedCalls (17 ms)
{noformat}

This happens because {{Close()}} calls {{RecordBatchPayloadWriter::Close()}}, 
which calls {{CheckStarted}}, which in turn tries to write data. If the data 
gets flushed and the server responds in time, we'll see a failure during 
writing, causing us to never check the server status (which is the last part of 
{{DoPutPayloadWriter::Close}}). We need to reliably check and expose the gRPC 
status.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7927) [C++] Fix 'cpu_info.cc' compilation warning

2020-03-19 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-7927.
-
Resolution: Fixed

Issue resolved by pull request 6610
[https://github.com/apache/arrow/pull/6610]

> [C++] Fix 'cpu_info.cc' compilation warning
> ---
>
> Key: ARROW-7927
> URL: https://issues.apache.org/jira/browse/ARROW-7927
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yuqi Gu
>Assignee: Yuqi Gu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Cpu_info compilation warning:
> {code:java}
> [100/424] Building CXX object 
> src/arrow/CMakeFiles/arrow_objlib.dir/util/cpu_info.cc.o
> ../src/arrow/util/cpu_info.cc:79:16: warning: ‘int64_t 
> GetArm64CacheSize(const char*, int64_t)’ defined but not used 
> [-Wunused-function]
>  static int64_t GetArm64CacheSize(const char* filename, int64_t default_size 
> = -1) {
> ^
> ../src/arrow/util/cpu_info.cc:77:20: warning: ‘kL3CacheSizeFile’ defined but 
> not used [-Wunused-variable]
>  static const char* kL3CacheSizeFile = 
> "/sys/devices/system/cpu/cpu0/cache/index3/size";
> ^~~~
> ../src/arrow/util/cpu_info.cc:76:20: warning: ‘kL2CacheSizeFile’ defined but 
> not used [-Wunused-variable]
>  static const char* kL2CacheSizeFile = 
> "/sys/devices/system/cpu/cpu0/cache/index2/size";
> ^~~~
> ../src/arrow/util/cpu_info.cc:75:20: warning: ‘kL1CacheSizeFile’ defined but 
> not used [-Wunused-variable]
>  static const char* kL1CacheSizeFile = 
> "/sys/devices/system/cpu/cpu0/cache/index0/size";
> ^~~~
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7854) [C++][Dataset] Option to memory map when reading IPC format

2020-03-19 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062505#comment-17062505
 ] 

Francois Saint-Jacques commented on ARROW-7854:
---

Which granularity would you like to see? A user can still create another 
LocalFilesystem without mmap.

> [C++][Dataset] Option to memory map when reading IPC format
> ---
>
> Key: ARROW-7854
> URL: https://issues.apache.org/jira/browse/ARROW-7854
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Assignee: Francois Saint-Jacques
>Priority: Major
>
> For the IPC format it would be interesting to be able to memory map the IPC 
> files?
> cc [~fsaintjacques] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8146) [C++] Add per-filesystem facility to sanitize a path

2020-03-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-8146.
---
Fix Version/s: 0.17.0
   Resolution: Fixed

Issue resolved by pull request 6657
[https://github.com/apache/arrow/pull/6657]

> [C++] Add per-filesystem facility to sanitize a path
> 
>
> Key: ARROW-8146
> URL: https://issues.apache.org/jira/browse/ARROW-8146
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8158) [Java] Getting length of data buffer and base variable width vector

2020-03-19 Thread Ji Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Liu reassigned ARROW-8158:
-

Assignee: Ji Liu

> [Java] Getting length of data buffer and base variable width vector
> ---
>
> Key: ARROW-8158
> URL: https://issues.apache.org/jira/browse/ARROW-8158
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Gaurangi Saxena
>Assignee: Ji Liu
>Priority: Minor
>
> For string data buffer and base variable width vector can we have a way to 
> get length of the data? 
> For instance, in ArrowColumnVector in StringAccessor we use 
> stringResult.start and stringResult.end, instead we would like to get length 
> of the data through an exposed function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8158) [Java] Getting length of data buffer and base variable width vector

2020-03-19 Thread Ji Liu (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062479#comment-17062479
 ] 

Ji Liu commented on ARROW-8158:
---

Hi, I think one could get valid data length by 
BaseVariableWidthVector#sizeOfValueBuffer.

[https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java#L582]

> [Java] Getting length of data buffer and base variable width vector
> ---
>
> Key: ARROW-8158
> URL: https://issues.apache.org/jira/browse/ARROW-8158
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Gaurangi Saxena
>Priority: Minor
>
> For string data buffer and base variable width vector can we have a way to 
> get length of the data? 
> For instance, in ArrowColumnVector in StringAccessor we use 
> stringResult.start and stringResult.end, instead we would like to get length 
> of the data through an exposed function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7365) [Python] Support FixedSizeList type in conversion to numpy/pandas

2020-03-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-7365.
---
Resolution: Fixed

Issue resolved by pull request 6663
[https://github.com/apache/arrow/pull/6663]

> [Python] Support FixedSizeList type in conversion to numpy/pandas
> -
>
> Key: ARROW-7365
> URL: https://issues.apache.org/jira/browse/ARROW-7365
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Follow-up on ARROW-7261, still need to add support for FixedSizeListType in 
> the arrow -> python conversion (arrow_to_pandas.cc)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8159) [Python] pyarrow.Schema.from_pandas doesn't support ExtensionDtype

2020-03-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8159:
--
Labels: pull-request-available  (was: )

> [Python] pyarrow.Schema.from_pandas doesn't support ExtensionDtype
> --
>
> Key: ARROW-8159
> URL: https://issues.apache.org/jira/browse/ARROW-8159
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8159) [Python] pyarrow.Schema.from_pandas doesn't support ExtensionDtype

2020-03-19 Thread Uwe Korn (Jira)
Uwe Korn created ARROW-8159:
---

 Summary: [Python] pyarrow.Schema.from_pandas doesn't support 
ExtensionDtype
 Key: ARROW-8159
 URL: https://issues.apache.org/jira/browse/ARROW-8159
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.16.0
Reporter: Uwe Korn
Assignee: Uwe Korn
 Fix For: 0.17.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8158) [Java] Getting length of data buffer and base variable width vector

2020-03-19 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8158:
-
Summary: [Java] Getting length of data buffer and base variable width 
vector  (was: Getting length of data buffer and base variable width vector)

> [Java] Getting length of data buffer and base variable width vector
> ---
>
> Key: ARROW-8158
> URL: https://issues.apache.org/jira/browse/ARROW-8158
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Gaurangi Saxena
>Priority: Minor
>
> For string data buffer and base variable width vector can we have a way to 
> get length of the data? 
> For instance, in ArrowColumnVector in StringAccessor we use 
> stringResult.start and stringResult.end, instead we would like to get length 
> of the data through an exposed function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7857) [Python] Failing test with pandas master for extension type conversion

2020-03-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7857:
--
Labels: pull-request-available  (was: )

> [Python] Failing test with pandas master for extension type conversion
> --
>
> Key: ARROW-7857
> URL: https://issues.apache.org/jira/browse/ARROW-7857
> Project: Apache Arrow
>  Issue Type: Test
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>
> The pandas master test build has one failure
> {code}
> ___ test_conversion_extensiontype_to_extensionarray 
> 
> monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7fcd6c580bd0>
> def test_conversion_extensiontype_to_extensionarray(monkeypatch):
> # converting extension type to linked pandas ExtensionDtype/Array
> import pandas.core.internals as _int
> 
> storage = pa.array([1, 2, 3, 4], pa.int64())
> arr = pa.ExtensionArray.from_storage(MyCustomIntegerType(), storage)
> table = pa.table({'a': arr})
> 
> if LooseVersion(pd.__version__) < "0.26.0.dev":
> # ensure pandas Int64Dtype has the protocol method (for older 
> pandas)
> monkeypatch.setattr(
> pd.Int64Dtype, '__from_arrow__', _Int64Dtype__from_arrow__,
> raising=False)
> 
> # extension type points to Int64Dtype, which knows how to create a
> # pandas ExtensionArray
> >   result = table.to_pandas()
> opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_pandas.py:3560:
>  
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> pyarrow/ipc.pxi:559: in pyarrow.lib.read_message
> ???
> pyarrow/table.pxi:1369: in pyarrow.lib.Table._to_pandas
> ???
> opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:764:
>  in table_to_blockmanager
> blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
> opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:1102:
>  in _table_to_blocks
> for item in result]
> opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:1102:
>  in 
> for item in result]
> opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:723:
>  in _reconstruct_block
> pd_ext_arr = pandas_dtype.__from_arrow__(arr)
> opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/arrays/integer.py:108:
>  in __from_arrow__
> array = array.cast(pyarrow_type)
> pyarrow/table.pxi:240: in pyarrow.lib.ChunkedArray.cast
> ???
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> >   ???
> E   pyarrow.lib.ArrowNotImplementedError: No cast implemented from 
> extension to int64
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)