[jira] [Updated] (ARROW-5251) [C++][Parquet] Bad initialization in statistics computation

2019-05-02 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-5251:
--
Component/s: parquet
 C++

> [C++][Parquet] Bad initialization in statistics computation
> ---
>
> Key: ARROW-5251
> URL: https://issues.apache.org/jira/browse/ARROW-5251
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, parquet
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> The following lines are undefined if the first element is null.
> https://github.com/apache/arrow/blob/250e97c70f497581bca412dfd2a654a1f9736064/cpp/src/parquet/statistics.cc#L159-L160



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5251) [C++][Parquet] Bad initialization in statistics computation

2019-05-02 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-5251:
-

 Summary: [C++][Parquet] Bad initialization in statistics 
computation
 Key: ARROW-5251
 URL: https://issues.apache.org/jira/browse/ARROW-5251
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Francois Saint-Jacques


The following lines are undefined if the first element is null.

https://github.com/apache/arrow/blob/250e97c70f497581bca412dfd2a654a1f9736064/cpp/src/parquet/statistics.cc#L159-L160



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5253) [C++] external Snappy fails on Alpine

2019-05-03 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-5253:
-

 Summary: [C++] external Snappy fails on Alpine
 Key: ARROW-5253
 URL: https://issues.apache.org/jira/browse/ARROW-5253
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.13.0
Reporter: Francois Saint-Jacques
 Fix For: 0.14.0



{code:bash}
FAILED: debug/libarrow.so.14.0.0 
: && /usr/bin/c++ -fPIC -Wno-noexcept-type  -fdiagnostics-color=always -ggdb 
-O0  -Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror 
-msse4.2  -g  
-Wl,--version-script=/buildbot/amd64-alpine-3_9-cpp/cpp/src/arrow/symbols.map 
-shared -Wl,-soname,libarrow.so.14 -o debug/libarrow.so.14.0.0 
...
c++: error: snappy_ep/src/snappy_ep-install/lib/libsnappy.a: No such file or 
directory
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5130) [Python] Segfault when importing TensorFlow after Pyarrow

2019-04-26 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826967#comment-16826967
 ] 

Francois Saint-Jacques commented on ARROW-5130:
---

It's a component called crossbow, the gist of what you need is 
[here|https://github.com/apache/arrow/tree/master/dev/tasks/python-wheels]

> [Python] Segfault when importing TensorFlow after Pyarrow
> -
>
> Key: ARROW-5130
> URL: https://issues.apache.org/jira/browse/ARROW-5130
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Travis Addair
>Priority: Major
>
> This issue is similar to https://jira.apache.org/jira/browse/ARROW-2657 which 
> was fixed in v0.10.0.
> When we import TensorFlow after Pyarrow in Linux Debian Jessie, we get a 
> segfault.  To reproduce:
> {code:java}
> import pyarrow 
> import tensorflow{code}
> Here's the backtrace from gdb:
> {code:java}
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x in ?? ()
> (gdb) bt
> #0 0x in ?? ()
> #1 0x7f529ee04410 in pthread_once () at 
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:103
> #2 0x7f5229a74efa in void std::call_once(std::once_flag&, 
> void (&)()) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #3 0x7f5229a74f3e in 
> tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #4 0x7f522978b561 in tensorflow::port::(anonymous 
> namespace)::CheckFeatureOrDie(tensorflow::port::CPUFeature, std::string 
> const&) ()
> from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #5 0x7f522978b5b4 in _GLOBAL__sub_I_cpu_feature_guard.cc () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #6 0x7f529f224bea in call_init (l=, argc=argc@entry=9, 
> argv=argv@entry=0x7ffc6d8c1488, env=env@entry=0x294c0c0) at dl-init.c:78
> #7 0x7f529f224cd3 in call_init (env=0x294c0c0, argv=0x7ffc6d8c1488, 
> argc=9, l=) at dl-init.c:36
> #8 _dl_init (main_map=main_map@entry=0x2e4aff0, argc=9, argv=0x7ffc6d8c1488, 
> env=0x294c0c0) at dl-init.c:126
> #9 0x7f529f228e38 in dl_open_worker (a=a@entry=0x7ffc6d8bebb8) at 
> dl-open.c:577
> #10 0x7f529f224aa4 in _dl_catch_error 
> (objname=objname@entry=0x7ffc6d8beba8, 
> errstring=errstring@entry=0x7ffc6d8bebb0, 
> mallocedp=mallocedp@entry=0x7ffc6d8beba7,
> operate=operate@entry=0x7f529f228b60 , 
> args=args@entry=0x7ffc6d8bebb8) at dl-error.c:187
> #11 0x7f529f22862b in _dl_open (file=0x7f5248178b54 
> "/usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so",
>  mode=-2147483646, caller_dlopen=,
> nsid=-2, argc=9, argv=0x7ffc6d8c1488, env=0x294c0c0) at dl-open.c:661
> #12 0x7f529ebf402b in dlopen_doit (a=a@entry=0x7ffc6d8bedd0) at 
> dlopen.c:66
> #13 0x7f529f224aa4 in _dl_catch_error (objname=0x2950fc0, 
> errstring=0x2950fc8, mallocedp=0x2950fb8, operate=0x7f529ebf3fd0 
> , args=0x7ffc6d8bedd0) at dl-error.c:187
> #14 0x7f529ebf45dd in _dlerror_run (operate=operate@entry=0x7f529ebf3fd0 
> , args=args@entry=0x7ffc6d8bedd0) at dlerror.c:163
> #15 0x7f529ebf40c1 in __dlopen (file=, mode= out>) at dlopen.c:87
> #16 0x00540859 in _PyImport_GetDynLoadFunc ()
> #17 0x0054024c in _PyImport_LoadDynamicModule ()
> #18 0x005f2bcb in ?? ()
> #19 0x004ca235 in PyEval_EvalFrameEx ()
> #20 0x004ca9c2 in PyEval_EvalFrameEx ()
> #21 0x004c8c39 in PyEval_EvalCodeEx ()
> #22 0x004c84e6 in PyEval_EvalCode ()
> #23 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #24 0x004c3272 in ?? ()
> #25 0x004b19e2 in ?? ()
> #26 0x004b13d7 in ?? ()
> #27 0x004b42f6 in ?? ()
> #28 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #29 0x004ccdb3 in PyEval_EvalFrameEx ()
> #30 0x004c8c39 in PyEval_EvalCodeEx ()
> #31 0x004c84e6 in PyEval_EvalCode ()
> #32 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #33 0x004c3272 in ?? ()
> #34 0x004b1d3f in ?? ()
> #35 0x004b6b2b in ?? ()
> #36 0x004b0d82 in ?? ()
> #37 0x004b42f6 in ?? ()
> #38 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #39 0x004ccdb3 in PyEval_EvalFrameEx (){code}
> It looks like the code changes that fixed the previous issue was recently 
> removed in 
> [https://github.com/apache/arrow/commit/b766bff34b7d85034d26cebef5b3aeef1eb2fd82#diff-16806bcebc1df2fae432db426905b9f0].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5130) [Python] Segfault when importing TensorFlow after Pyarrow

2019-04-26 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826968#comment-16826968
 ] 

Francois Saint-Jacques commented on ARROW-5130:
---

You'll have to replicate 
https://github.com/apache/arrow/blob/master/dev/tasks/python-wheels/travis.linux.yml

> [Python] Segfault when importing TensorFlow after Pyarrow
> -
>
> Key: ARROW-5130
> URL: https://issues.apache.org/jira/browse/ARROW-5130
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Travis Addair
>Priority: Major
>
> This issue is similar to https://jira.apache.org/jira/browse/ARROW-2657 which 
> was fixed in v0.10.0.
> When we import TensorFlow after Pyarrow in Linux Debian Jessie, we get a 
> segfault.  To reproduce:
> {code:java}
> import pyarrow 
> import tensorflow{code}
> Here's the backtrace from gdb:
> {code:java}
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x in ?? ()
> (gdb) bt
> #0 0x in ?? ()
> #1 0x7f529ee04410 in pthread_once () at 
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:103
> #2 0x7f5229a74efa in void std::call_once(std::once_flag&, 
> void (&)()) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #3 0x7f5229a74f3e in 
> tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #4 0x7f522978b561 in tensorflow::port::(anonymous 
> namespace)::CheckFeatureOrDie(tensorflow::port::CPUFeature, std::string 
> const&) ()
> from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #5 0x7f522978b5b4 in _GLOBAL__sub_I_cpu_feature_guard.cc () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #6 0x7f529f224bea in call_init (l=, argc=argc@entry=9, 
> argv=argv@entry=0x7ffc6d8c1488, env=env@entry=0x294c0c0) at dl-init.c:78
> #7 0x7f529f224cd3 in call_init (env=0x294c0c0, argv=0x7ffc6d8c1488, 
> argc=9, l=) at dl-init.c:36
> #8 _dl_init (main_map=main_map@entry=0x2e4aff0, argc=9, argv=0x7ffc6d8c1488, 
> env=0x294c0c0) at dl-init.c:126
> #9 0x7f529f228e38 in dl_open_worker (a=a@entry=0x7ffc6d8bebb8) at 
> dl-open.c:577
> #10 0x7f529f224aa4 in _dl_catch_error 
> (objname=objname@entry=0x7ffc6d8beba8, 
> errstring=errstring@entry=0x7ffc6d8bebb0, 
> mallocedp=mallocedp@entry=0x7ffc6d8beba7,
> operate=operate@entry=0x7f529f228b60 , 
> args=args@entry=0x7ffc6d8bebb8) at dl-error.c:187
> #11 0x7f529f22862b in _dl_open (file=0x7f5248178b54 
> "/usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so",
>  mode=-2147483646, caller_dlopen=,
> nsid=-2, argc=9, argv=0x7ffc6d8c1488, env=0x294c0c0) at dl-open.c:661
> #12 0x7f529ebf402b in dlopen_doit (a=a@entry=0x7ffc6d8bedd0) at 
> dlopen.c:66
> #13 0x7f529f224aa4 in _dl_catch_error (objname=0x2950fc0, 
> errstring=0x2950fc8, mallocedp=0x2950fb8, operate=0x7f529ebf3fd0 
> , args=0x7ffc6d8bedd0) at dl-error.c:187
> #14 0x7f529ebf45dd in _dlerror_run (operate=operate@entry=0x7f529ebf3fd0 
> , args=args@entry=0x7ffc6d8bedd0) at dlerror.c:163
> #15 0x7f529ebf40c1 in __dlopen (file=, mode= out>) at dlopen.c:87
> #16 0x00540859 in _PyImport_GetDynLoadFunc ()
> #17 0x0054024c in _PyImport_LoadDynamicModule ()
> #18 0x005f2bcb in ?? ()
> #19 0x004ca235 in PyEval_EvalFrameEx ()
> #20 0x004ca9c2 in PyEval_EvalFrameEx ()
> #21 0x004c8c39 in PyEval_EvalCodeEx ()
> #22 0x004c84e6 in PyEval_EvalCode ()
> #23 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #24 0x004c3272 in ?? ()
> #25 0x004b19e2 in ?? ()
> #26 0x004b13d7 in ?? ()
> #27 0x004b42f6 in ?? ()
> #28 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #29 0x004ccdb3 in PyEval_EvalFrameEx ()
> #30 0x004c8c39 in PyEval_EvalCodeEx ()
> #31 0x004c84e6 in PyEval_EvalCode ()
> #32 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #33 0x004c3272 in ?? ()
> #34 0x004b1d3f in ?? ()
> #35 0x004b6b2b in ?? ()
> #36 0x004b0d82 in ?? ()
> #37 0x004b42f6 in ?? ()
> #38 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #39 0x004ccdb3 in PyEval_EvalFrameEx (){code}
> It looks like the code changes that fixed the previous issue was recently 
> removed in 
> [https://github.com/apache/arrow/commit/b766bff34b7d85034d26cebef5b3aeef1eb2fd82#diff-16806bcebc1df2fae432db426905b9f0].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5214) [C++] Offline dependency downloader misses some libraries

2019-04-26 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826979#comment-16826979
 ] 

Francois Saint-Jacques commented on ARROW-5214:
---

The script is exiting silently, but with a non-zero error code. I'll fix this. 
The real issue is that this snappy version does not exists anymore.

> [C++] Offline dependency downloader misses some libraries
> -
>
> Key: ARROW-5214
> URL: https://issues.apache.org/jira/browse/ARROW-5214
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 0.14.0
>
>
> Not sure yet but maybe this was introduced by 
> https://github.com/apache/arrow/commit/f913d8f0adff71c288a10f6c1b0ad2d1ab3e9e32
> {code}
> $ thirdparty/download_dependencies.sh /home/wesm/arrow-thirdparty
> # Environment variables for offline Arrow build
> export ARROW_BOOST_URL=/home/wesm/arrow-thirdparty/boost-1.67.0.tar.gz
> export ARROW_BROTLI_URL=/home/wesm/arrow-thirdparty/brotli-v1.0.7.tar.gz
> export ARROW_CARES_URL=/home/wesm/arrow-thirdparty/cares-1.15.0.tar.gz
> export 
> ARROW_DOUBLE_CONVERSION_URL=/home/wesm/arrow-thirdparty/double-conversion-v3.1.4.tar.gz
> export 
> ARROW_FLATBUFFERS_URL=/home/wesm/arrow-thirdparty/flatbuffers-v1.10.0.tar.gz
> export 
> ARROW_GBENCHMARK_URL=/home/wesm/arrow-thirdparty/gbenchmark-v1.4.1.tar.gz
> export ARROW_GFLAGS_URL=/home/wesm/arrow-thirdparty/gflags-v2.2.0.tar.gz
> export ARROW_GLOG_URL=/home/wesm/arrow-thirdparty/glog-v0.3.5.tar.gz
> export ARROW_GRPC_URL=/home/wesm/arrow-thirdparty/grpc-v1.20.0.tar.gz
> export ARROW_GTEST_URL=/home/wesm/arrow-thirdparty/gtest-1.8.1.tar.gz
> export ARROW_LZ4_URL=/home/wesm/arrow-thirdparty/lz4-v1.8.3.tar.gz
> export ARROW_ORC_URL=/home/wesm/arrow-thirdparty/orc-1.5.5.tar.gz
> export ARROW_PROTOBUF_URL=/home/wesm/arrow-thirdparty/protobuf-v3.7.1.tar.gz
> export 
> ARROW_RAPIDJSON_URL=/home/wesm/arrow-thirdparty/rapidjson-2bbd33b33217ff4a73434ebf10cdac41e2ef5e34.tar.gz
> export ARROW_RE2_URL=/home/wesm/arrow-thirdparty/re2-2019-04-01.tar.gz
> {code}
> The 5 dependencies listed after RE2 are not downloaded



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-5214) [C++] Offline dependency downloader misses some libraries

2019-04-26 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826979#comment-16826979
 ] 

Francois Saint-Jacques edited comment on ARROW-5214 at 4/26/19 1:59 PM:


The script is exiting silently, but with a non-zero error code. I'll fix this. 
The real issue is that this snappy version url (change path) does not exists 
anymore.


was (Author: fsaintjacques):
The script is exiting silently, but with a non-zero error code. I'll fix this. 
The real issue is that this snappy version does not exists anymore.

> [C++] Offline dependency downloader misses some libraries
> -
>
> Key: ARROW-5214
> URL: https://issues.apache.org/jira/browse/ARROW-5214
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 0.14.0
>
>
> Not sure yet but maybe this was introduced by 
> https://github.com/apache/arrow/commit/f913d8f0adff71c288a10f6c1b0ad2d1ab3e9e32
> {code}
> $ thirdparty/download_dependencies.sh /home/wesm/arrow-thirdparty
> # Environment variables for offline Arrow build
> export ARROW_BOOST_URL=/home/wesm/arrow-thirdparty/boost-1.67.0.tar.gz
> export ARROW_BROTLI_URL=/home/wesm/arrow-thirdparty/brotli-v1.0.7.tar.gz
> export ARROW_CARES_URL=/home/wesm/arrow-thirdparty/cares-1.15.0.tar.gz
> export 
> ARROW_DOUBLE_CONVERSION_URL=/home/wesm/arrow-thirdparty/double-conversion-v3.1.4.tar.gz
> export 
> ARROW_FLATBUFFERS_URL=/home/wesm/arrow-thirdparty/flatbuffers-v1.10.0.tar.gz
> export 
> ARROW_GBENCHMARK_URL=/home/wesm/arrow-thirdparty/gbenchmark-v1.4.1.tar.gz
> export ARROW_GFLAGS_URL=/home/wesm/arrow-thirdparty/gflags-v2.2.0.tar.gz
> export ARROW_GLOG_URL=/home/wesm/arrow-thirdparty/glog-v0.3.5.tar.gz
> export ARROW_GRPC_URL=/home/wesm/arrow-thirdparty/grpc-v1.20.0.tar.gz
> export ARROW_GTEST_URL=/home/wesm/arrow-thirdparty/gtest-1.8.1.tar.gz
> export ARROW_LZ4_URL=/home/wesm/arrow-thirdparty/lz4-v1.8.3.tar.gz
> export ARROW_ORC_URL=/home/wesm/arrow-thirdparty/orc-1.5.5.tar.gz
> export ARROW_PROTOBUF_URL=/home/wesm/arrow-thirdparty/protobuf-v3.7.1.tar.gz
> export 
> ARROW_RAPIDJSON_URL=/home/wesm/arrow-thirdparty/rapidjson-2bbd33b33217ff4a73434ebf10cdac41e2ef5e34.tar.gz
> export ARROW_RE2_URL=/home/wesm/arrow-thirdparty/re2-2019-04-01.tar.gz
> {code}
> The 5 dependencies listed after RE2 are not downloaded



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-4187) [C++] file-benchmark uses

2019-07-05 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-4187.
---
   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4809
[https://github.com/apache/arrow/pull/4809]

> [C++] file-benchmark uses 
> --
>
> Key: ARROW-4187
> URL: https://issues.apache.org/jira/browse/ARROW-4187
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Benjamin Kietzman
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> arrow/io/file-benchmark.cc includes poll.h, which causes the build to fail on 
> Windows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5849) Compiler warnings on mingw-w64

2019-07-05 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-5849.
---
   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4804
[https://github.com/apache/arrow/pull/4804]

> Compiler warnings on mingw-w64
> --
>
> Key: ARROW-5849
> URL: https://issues.apache.org/jira/browse/ARROW-5849
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0
>Reporter: Jeroen
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In mingw64 we see the following warnings:
> {code}
> [ 54%] Building CXX object 
> src/arrow/CMakeFiles/arrow_static.dir/util/io-util.cc.obj
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/util/decimal.cc:
>  In static member function 'static arrow::Status 
> arrow::Decimal128::FromString(const string_view&, arrow::Decimal128*, 
> int32_t*, int32_t*)':
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/util/decimal.cc:313:35:
>  warning: 'dec.arrow::{anonymous}::DecimalComponents::exponent' may be used 
> uninitialized in this function [-Wmaybe-uninitialized]
>*scale = -adjusted_exponent + len - 1;
> ~~~^~~~
> {code} 
> {code}
> [ 56%] Building CXX object 
> src/arrow/CMakeFiles/arrow_static.dir/util/string_builder.cc.obj
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/util/io-util.cc:
>  In static member function 'static arrow::Status 
> arrow::internal::TemporaryDir::Make(const string&, 
> std::unique_ptr*)':
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/util/io-util.cc:897:3:
>  warning: 'created' may be used uninitialized in this function 
> [-Wmaybe-uninitialized]
>if (!created) {
>^~
> {code}
> And on mingw32 we also see these:
> {code}
> In file included from 
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/io/file.cc:25:
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/io/mman.h:
>  In function 'void* mmap(void*, size_t, int, int, int, off_t)':
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/io/mman.h:94:62:
>  warning: right shift count >= width of type [-Wshift-count-overflow]
>const DWORD dwMaxSizeHigh = static_cast((maxSize >> 32) & 
> 0xL);
>   ^~
> {code}
> {code}
>  54%] Building CXX object 
> src/arrow/CMakeFiles/arrow_static.dir/util/logging.cc.obj
> In file included from 
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/util/io-util.cc:63:
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/io/mman.h:
>  In function 'void* mmap(void*, size_t, int, int, int, off_t)':
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/io/mman.h:94:62:
>  warning: right shift count >= width of type [-Wshift-count-overflow]
>const DWORD dwMaxSizeHigh = static_cast((maxSize >> 32) & 
> 0xL);
>   ^~
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/util/io-util.cc:
>  In function 'arrow::Status arrow::internal::MemoryMapRemap(void*, size_t, 
> size_t, int, void**)':
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/util/io-util.cc:568:55:
>  warning: right shift count >= width of type [-Wshift-count-overflow]
>LONG new_size_high = static_cast((new_size >> 32) & 0xL);
>  
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5851) [C++] Compilation of reference benchmarks fails

2019-07-05 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-5851.
---
   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4808
[https://github.com/apache/arrow/pull/4808]

> [C++] Compilation of reference benchmarks fails
> ---
>
> Key: ARROW-5851
> URL: https://issues.apache.org/jira/browse/ARROW-5851
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {code}
> ../src/arrow/util/compression-benchmark.cc: In function 'void 
> arrow::util::StreamingDecompression(arrow::Compression::type, const 
> std::vector&, benchmark::State&)':
> ../src/arrow/util/compression-benchmark.cc:172:5: error: 'ARROW_CHECK' was 
> not declared in this scope
>  ARROW_CHECK(decompressed_size == static_cast(data.size()));
>  ^~~
> ../src/arrow/util/compression-benchmark.cc:172:5: note: suggested 
> alternative: 'ARROW_CONCAT'
>  ARROW_CHECK(decompressed_size == static_cast(data.size()));
>  ^~~
>  ARROW_CONCAT
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5759) Suspend CI builds for draft pull requests on GitHub

2019-06-27 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874302#comment-16874302
 ] 

Francois Saint-Jacques commented on ARROW-5759:
---

I don't agree with this one, often the CI on draft is used to ensure that the 
CI passes. You can use the `[skip travis]` token in your commit message to 
achieve the same see 
https://github.com/apache/arrow/blob/master/ci/detect-changes.py#L223.

> Suspend CI builds for draft pull requests on GitHub
> ---
>
> Key: ARROW-5759
> URL: https://issues.apache.org/jira/browse/ARROW-5759
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration
>Reporter: Prudhvi Porandla
>Priority: Trivial
>
> CI should be disabled for draft pull requests. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5718) [R] auto splice data frames in record_batch() and table()

2019-06-27 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-5718.
---
Resolution: Fixed

> [R] auto splice data frames in record_batch() and table()
> -
>
> Key: ARROW-5718
> URL: https://issues.apache.org/jira/browse/ARROW-5718
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain François
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> ARROW-3814 / 
> [https://github.com/apache/arrow/pull/3565/files#diff-95ad459e0128bfecf0d72ebd6d6ee8aaR94]
>  changed the API of `record_batch()` and `arrow::table()` such that you could 
> no longer pass in a data.frame to the function, not without [massaging it 
> yourself|https://github.com/apache/arrow/pull/3565/files#diff-09c05d1a6ff41bed094fbccfa76395a6R27].
>  That broke sparklyr integration tests with an opaque `cannot infer type from 
> data` error, and it's unfortunate that there's no longer a direct way to go 
> from a data.frame to a record batch, which sounds like a common need.
> In order to follow best practices (cf. the 
> [tibble|https://tibble.tidyverse.org/] package, for example), we should (1) 
> add an {{as_record_batch}} function, which the data.frame method is probably 
> just {{as_record_batch.data.frame <- function(x) record_batch(!!!x)}}; and 
> (2) if a user supplies a single, unnamed data.frame as the argument to 
> {{record_batch()}}, raise an error that says to use {{as_record_batch()}}. We 
> may later decide that we should automatically call as_record_batch(), but in 
> case that is too magical and prevents some legitimate use case, let's hold 
> off for now. It's easier to add magic than remove it.
> Once this function exists, sparklyr tests can try to use {{as_record_batch}}, 
> and if that function doesn't exist, fall back to {{record_batch}} (because 
> that means it has an older released version of arrow that doesn't have 
> as_record_batch, so record_batch(df) should work).
> cc [~javierluraschi]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3732) [R] Add functions to write RecordBatch or Schema to Message value, then read back

2019-06-27 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-3732.
---
   Resolution: Fixed
Fix Version/s: 0.14.0

> [R] Add functions to write RecordBatch or Schema to Message value, then read 
> back
> -
>
> Key: ARROW-3732
> URL: https://issues.apache.org/jira/browse/ARROW-3732
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Wes McKinney
>Assignee: Romain François
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Follow up work to ARROW-3499



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5749) [Python] Add Python binding for Table::CombineChunks()

2019-06-27 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-5749.
---
Resolution: Fixed

Issue resolved by pull request 4712
[https://github.com/apache/arrow/pull/4712]

> [Python] Add Python binding for Table::CombineChunks()
> --
>
> Key: ARROW-5749
> URL: https://issues.apache.org/jira/browse/ARROW-5749
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Zhuo Peng
>Assignee: Zhuo Peng
>Priority: Minor
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5739) [CI] Fix docker python build

2019-06-26 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-5739:
-

 Summary: [CI] Fix docker python build
 Key: ARROW-5739
 URL: https://issues.apache.org/jira/browse/ARROW-5739
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Francois Saint-Jacques


python docker image will fail to clean the build directory, installing a 
previous invocation of `docker-compose run python`. This is not affecting CI 
that drops the `/build` mount, but only local users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5045) [Rust] Code coverage silently failing in CI

2019-06-26 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-5045.
---
   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4700
[https://github.com/apache/arrow/pull/4700]

> [Rust] Code coverage silently failing in CI
> ---
>
> Key: ARROW-5045
> URL: https://issues.apache.org/jira/browse/ARROW-5045
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.13.0
>Reporter: Andy Grove
>Assignee: Chao Sun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
>  
> {code:java}
> error: could not execute process `target/kcov-master/build/src/kcov --verify 
> --include-path=/home/travis/build/apache/arrow/rust 
> /home/travis/build/apache/arrow/rust/target/kcov-arrow-f04240306dd653e9 
> /home/travis/build/apache/arrow/rust/target/debug/deps/arrow-f04240306dd653e9`
>  (never executed){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5730) [CI] Dask integration tests are failing

2019-06-26 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873369#comment-16873369
 ] 

Francois Saint-Jacques commented on ARROW-5730:
---

Note that the local error I got is fixed by ARROW-5739.

> [CI] Dask integration tests are failing
> ---
>
> Key: ARROW-5730
> URL: https://issues.apache.org/jira/browse/ARROW-5730
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 0.14.0
>
>
> Have not investigated yet, build: 
> https://circleci.com/gh/ursa-labs/crossbow/387



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5745) [C++] properties of Map(Array|Type) are confusingly named

2019-06-26 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873499#comment-16873499
 ] 

Francois Saint-Jacques commented on ARROW-5745:
---

Just note that for PrimitiveArray, it doesn't even return the same type 
(Buffer* instead of Array*).

> [C++] properties of Map(Array|Type) are confusingly named
> -
>
> Key: ARROW-5745
> URL: https://issues.apache.org/jira/browse/ARROW-5745
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Benjamin Kietzman
>Assignee: Benjamin Kietzman
>Priority: Major
>
> In the context of ListArrays, "values" indicates the elements in a slot of 
> the ListArray. Since MapArray isa ListArray, "values" indicates the same 
> thing and the elements are key-item pairs. This naming scheme is not 
> idiomatic; these *should* be called key-value pairs but that would require 
> propagating the renaming down to ListArray.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5779) [R][CI] R's docker image fails due to incompatibility

2019-06-28 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-5779:
-

 Summary: [R][CI] R's docker image fails due to incompatibility
 Key: ARROW-5779
 URL: https://issues.apache.org/jira/browse/ARROW-5779
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Francois Saint-Jacques



{code:bash}
The downloaded source packages are in
'/tmp/RtmpLu0eiq/downloaded_packages'
v  checking for file 
'/tmp/RtmpLu0eiq/remotes1a8d7c759a55/romainfrancois-decor-6c5a5aa/DESCRIPTION' 
...
-  preparing 'decor':
v  checking DESCRIPTION meta-information ...
-  cleaning src
-  checking for LF line-endings in source and make files and shell scripts
-  checking for empty or unneeded directories
-  building 'decor_0.0.0.9001.tar.gz'
   
Installing package into '/usr/local/lib/R/site-library'
(as 'lib' is unspecified)
ERROR: this R is version 3.4.4, package 'decor' requires R >= 3.5.0
Error: Failed to install 'decor' from GitHub:
  (converted from warning) installation of package 
'/tmp/RtmpLu0eiq/file1a8d6986708c/decor_0.0.0.9001.tar.gz' had non-zero exit 
status
Execution halted
ERROR: Service 'r' failed to build: The command '/bin/sh -c Rscript -e 
"install.packages('devtools', repos = 'http://cran.rstudio.com')" && 
Rscript -e "devtools::install_github('romainfrancois/decor')" && Rscript -e 
"install.packages(c( 'Rcpp', 'dplyr', 'stringr', 'glue', 'vctrs',   
  'purrr', 'assertthat', 'fs', 'tibble', 
'crayon', 'testthat', 'bit64', 'hms', 
'lubridate'), repos = 'https://cran.rstudio.com')"' returned a non-zero 
code: 1
Makefile.docker:49: recipe for target 'build-r' failed

{code}

I'm not sure if the fix is just to bump R's version in the image, or avoid the 
failing package. cc [~romainfrancois]




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5914) [CI] Build bundled dependencies in docker build step

2019-07-11 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-5914:
-

 Summary: [CI] Build bundled dependencies in docker build step
 Key: ARROW-5914
 URL: https://issues.apache.org/jira/browse/ARROW-5914
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Francois Saint-Jacques
 Fix For: 1.0.0


In the recently introduced ARROW-5803, some heavy dependencies (thrift, 
protobuf, flatbufers, grpc) are build at each invocation of docker-compose 
build (thus each travis test).

We should aim to build the third party dependencies in docker build phase 
instead, to exploit caching and docker-compose pull so that the CI step doesn't 
need to build said dependencies each time.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5923) [C++] Fix int96 comment

2019-07-12 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-5923.
---
   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4858
[https://github.com/apache/arrow/pull/4858]

> [C++] Fix int96 comment
> ---
>
> Key: ARROW-5923
> URL: https://issues.apache.org/jira/browse/ARROW-5923
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Micah Kornfield
>Priority: Trivial
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5923) [C++] Fix int96 comment

2019-07-12 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-5923:
-

 Summary: [C++] Fix int96 comment
 Key: ARROW-5923
 URL: https://issues.apache.org/jira/browse/ARROW-5923
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Francois Saint-Jacques
Assignee: Micah Kornfield






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5588) [C++] Better support for building UnionArrays

2019-07-12 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-5588.
---
   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4781
[https://github.com/apache/arrow/pull/4781]

> [C++] Better support for building UnionArrays
> -
>
> Key: ARROW-5588
> URL: https://issues.apache.org/jira/browse/ARROW-5588
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Benjamin Kietzman
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> UnionBuilders (for both sparse and dense mode unions) are not currently 
> supported by MakeBuilder or ArrayFromJSON. This increases friction when 
> working with and testing against union arrays, and support should be added to 
> both. For ArrayFromJSON each entry must be specified with a (type code, 
> value) pair:
> {code}
> ArrayFromJSON(union_({field("lint", list(int32())), field("str", utf8())}), 
> R"([
>   [0, null],
>   [1, "hello"],
>   [0, [1, 2]],
>   [1, "world"]
> ])");
> {code}
> DenseUnionBuilder currently requires the user to explicitly input offsets, 
> but if it were modified to hold pointers to child builders (as ListBuilder, 
> for example) then those offsets could be derived from the lengths of child 
> builders (which is much more user friendly).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-5588) [C++] Better support for building UnionArrays

2019-07-12 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-5588:
-

Assignee: Benjamin Kietzman

> [C++] Better support for building UnionArrays
> -
>
> Key: ARROW-5588
> URL: https://issues.apache.org/jira/browse/ARROW-5588
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Benjamin Kietzman
>Assignee: Benjamin Kietzman
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> UnionBuilders (for both sparse and dense mode unions) are not currently 
> supported by MakeBuilder or ArrayFromJSON. This increases friction when 
> working with and testing against union arrays, and support should be added to 
> both. For ArrayFromJSON each entry must be specified with a (type code, 
> value) pair:
> {code}
> ArrayFromJSON(union_({field("lint", list(int32())), field("str", utf8())}), 
> R"([
>   [0, null],
>   [1, "hello"],
>   [0, [1, 2]],
>   [1, "world"]
> ])");
> {code}
> DenseUnionBuilder currently requires the user to explicitly input offsets, 
> but if it were modified to hold pointers to child builders (as ListBuilder, 
> for example) then those offsets could be derived from the lengths of child 
> builders (which is much more user friendly).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5921) [C++][Fuzzing] Missing nullptr checks in IPC

2019-07-12 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-5921:
--
Fix Version/s: 0.14.1

> [C++][Fuzzing] Missing nullptr checks in IPC
> 
>
> Key: ARROW-5921
> URL: https://issues.apache.org/jira/browse/ARROW-5921
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0
>Reporter: Marco Neumann
>Assignee: Marco Neumann
>Priority: Minor
>  Labels: fuzzer, pull-request-available
> Fix For: 0.14.1
>
> Attachments: crash-09f72ba2a52b80366ab676364abec850fc668168, 
> crash-607e9caa76863a97f2694a769a1ae2fb83c55e02, 
> crash-cb8cedb6ff8a6f164210c497d91069812ef5d6f8, 
> crash-f37e71777ad0324b55b99224f2c7ffb0107bdfa2, 
> crash-fd237566879dc60fff4d956d5fe3533d74a367f3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {{arrow-ipc-fuzzing-test}} found the attached attached crashes. Reproduce with
> {code}
> arrow-ipc-fuzzing-test crash-xxx
> {code}
> The attached crashes have all distinct sources and are all related with 
> missing nullptr checks. I have a fix basically ready.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5921) [C++][Fuzzing] Missing nullptr checks in IPC

2019-07-12 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-5921:
--
Fix Version/s: 1.0.0

> [C++][Fuzzing] Missing nullptr checks in IPC
> 
>
> Key: ARROW-5921
> URL: https://issues.apache.org/jira/browse/ARROW-5921
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0
>Reporter: Marco Neumann
>Assignee: Marco Neumann
>Priority: Minor
>  Labels: fuzzer, pull-request-available
> Fix For: 1.0.0, 0.14.1
>
> Attachments: crash-09f72ba2a52b80366ab676364abec850fc668168, 
> crash-607e9caa76863a97f2694a769a1ae2fb83c55e02, 
> crash-cb8cedb6ff8a6f164210c497d91069812ef5d6f8, 
> crash-f37e71777ad0324b55b99224f2c7ffb0107bdfa2, 
> crash-fd237566879dc60fff4d956d5fe3533d74a367f3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {{arrow-ipc-fuzzing-test}} found the attached attached crashes. Reproduce with
> {code}
> arrow-ipc-fuzzing-test crash-xxx
> {code}
> The attached crashes have all distinct sources and are all related with 
> missing nullptr checks. I have a fix basically ready.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5781) [Archery] Ensure benchmark clone accepts remotes in revision

2019-06-28 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-5781.
---
   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4741
[https://github.com/apache/arrow/pull/4741]

> [Archery] Ensure benchmark clone accepts remotes in revision
> 
>
> Key: ARROW-5781
> URL: https://issues.apache.org/jira/browse/ARROW-5781
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Affects Versions: 0.13.0
>Reporter: Francois Saint-Jacques
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Found that ursabot would always compare the PR tip commit with itself via 
> https://github.com/apache/arrow/pull/4739#issuecomment-506819250 . This is 
> due to buildbot github behavior of using a git-reset --hard local that 
> changes the `master` rev to this new state. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5781) [Archery] Ensure benchmark clone accepts remotes in revision

2019-06-28 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-5781:
-

Assignee: Francois Saint-Jacques

> [Archery] Ensure benchmark clone accepts remotes in revision
> 
>
> Key: ARROW-5781
> URL: https://issues.apache.org/jira/browse/ARROW-5781
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Affects Versions: 0.13.0
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Found that ursabot would always compare the PR tip commit with itself via 
> https://github.com/apache/arrow/pull/4739#issuecomment-506819250 . This is 
> due to buildbot github behavior of using a git-reset --hard local that 
> changes the `master` rev to this new state. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5527) [C++] HashTable/MemoTable should use Buffer(s)/Builder(s) for heap data

2019-07-08 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-5527:
--
Description: The current implementation uses `std::vector` and 
`std::string` with unbounded size. The refactor would take a memory pool in the 
constructor for buffer management and would get rid of vectors. This will have 
the side effect of propagating Status to some calls (notably insert due to 
Upsize failing to resize).  (was: The current implementation uses `std::vector` 
and `std::string` with unbounded size. The refactor would take a memory pool in 
the constructor for buffer management and would get rid of vectors.

This will have the side effect of propagating Status to some calls (notably 
insert due to Upsize failing to resize).)

> [C++] HashTable/MemoTable should use Buffer(s)/Builder(s) for heap data
> ---
>
> Key: ARROW-5527
> URL: https://issues.apache.org/jira/browse/ARROW-5527
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>
> The current implementation uses `std::vector` and `std::string` with 
> unbounded size. The refactor would take a memory pool in the constructor for 
> buffer management and would get rid of vectors. This will have the side 
> effect of propagating Status to some calls (notably insert due to Upsize 
> failing to resize).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5527) [C++] HashTable/MemoTable should use Buffer(s)/Builder(s) for heap data

2019-07-08 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-5527:
--
Description: 
The current implementation uses `std::vector` and `std::string` with unbounded 
size. The refactor would take a memory pool in the constructor for buffer 
management and would get rid of vectors. This will have the side effect of 
propagating Status to some calls (notably insert due to Upsize failing to 
resize).

* MemoTable constructor needs to take a MemoryPool in input
* GetOrInsert must return Status/Result
* MemoTable should use a TypeBufferBuilder instead of std::vector
* BinaryMemoTable should use a BinaryBuilder instead of (std::vector, 
std::string) pair.

  was:The current implementation uses `std::vector` and `std::string` with 
unbounded size. The refactor would take a memory pool in the constructor for 
buffer management and would get rid of vectors. This will have the side effect 
of propagating Status to some calls (notably insert due to Upsize failing to 
resize).


> [C++] HashTable/MemoTable should use Buffer(s)/Builder(s) for heap data
> ---
>
> Key: ARROW-5527
> URL: https://issues.apache.org/jira/browse/ARROW-5527
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>
> The current implementation uses `std::vector` and `std::string` with 
> unbounded size. The refactor would take a memory pool in the constructor for 
> buffer management and would get rid of vectors. This will have the side 
> effect of propagating Status to some calls (notably insert due to Upsize 
> failing to resize).
> * MemoTable constructor needs to take a MemoryPool in input
> * GetOrInsert must return Status/Result
> * MemoTable should use a TypeBufferBuilder instead of std::vector
> * BinaryMemoTable should use a BinaryBuilder instead of 
> (std::vector, std::string) pair.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5202) [C++] Test and benchmark libraries library search path subtly affected by installation

2019-04-23 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-5202:
--
Description: 
Test and benchmark binaries should always favor the local non-installed 
libarrow and libarrow_testing.

{code:bash}
$ cmake -GNinja -DARROW_BUILD_TESTS=ON ..
$ ldd release/arrow-array-test
libarrow_testing.so.14 => 
/home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow_testing.so.14 
(0x7f8f2b79e000)
libarrow.so.14 => 
/home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow.so.14 
(0x7f8f2b063000)

$ ninja install
$ rm -rf * && cmake -GNinja -DARROW_BUILD_TESTS=ON ..
$ ldd release/arrow-array-test 
libarrow_testing.so.14 => 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow_testing.so.14 
(0x7f75d2bda000)
libarrow.so.14 => 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow.so.14 
(0x7f75d249f000)
 {code}

Component/s: C++
 Issue Type: Bug  (was: Improvement)
Summary: [C++] Test and benchmark libraries library search path subtly 
affected by installation  (was: [C++)

> [C++] Test and benchmark libraries library search path subtly affected by 
> installation
> --
>
> Key: ARROW-5202
> URL: https://issues.apache.org/jira/browse/ARROW-5202
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> Test and benchmark binaries should always favor the local non-installed 
> libarrow and libarrow_testing.
> {code:bash}
> $ cmake -GNinja -DARROW_BUILD_TESTS=ON ..
> $ ldd release/arrow-array-test
> libarrow_testing.so.14 => 
> /home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow_testing.so.14 
> (0x7f8f2b79e000)
> libarrow.so.14 => 
> /home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow.so.14 
> (0x7f8f2b063000)
> $ ninja install
> $ rm -rf * && cmake -GNinja -DARROW_BUILD_TESTS=ON ..
> $ ldd release/arrow-array-test 
> libarrow_testing.so.14 => 
> /home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow_testing.so.14 
> (0x7f75d2bda000)
> libarrow.so.14 => 
> /home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow.so.14 
> (0x7f75d249f000)
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5202) [C++] Test and benchmark libraries library search path subtly affected by installation

2019-04-23 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-5202:
--
Priority: Minor  (was: Major)

> [C++] Test and benchmark libraries library search path subtly affected by 
> installation
> --
>
> Key: ARROW-5202
> URL: https://issues.apache.org/jira/browse/ARROW-5202
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Minor
>
> Test and benchmark binaries should always favor the local non-installed 
> libarrow and libarrow_testing.
> {code:bash}
> $ cmake -GNinja -DARROW_BUILD_TESTS=ON ..
> $ ldd release/arrow-array-test
> libarrow_testing.so.14 => 
> /home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow_testing.so.14 
> (0x7f8f2b79e000)
> libarrow.so.14 => 
> /home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow.so.14 
> (0x7f8f2b063000)
> $ ninja install
> $ rm -rf * && cmake -GNinja -DARROW_BUILD_TESTS=ON ..
> $ ldd release/arrow-array-test 
> libarrow_testing.so.14 => 
> /home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow_testing.so.14 
> (0x7f75d2bda000)
> libarrow.so.14 => 
> /home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow.so.14 
> (0x7f75d249f000)
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5202) [C++] Test and benchmark libraries library search path subtly affected by installation

2019-04-23 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-5202:
--
Fix Version/s: 0.14.0

> [C++] Test and benchmark libraries library search path subtly affected by 
> installation
> --
>
> Key: ARROW-5202
> URL: https://issues.apache.org/jira/browse/ARROW-5202
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Minor
> Fix For: 0.14.0
>
>
> Test and benchmark binaries should always favor the local non-installed 
> libarrow and libarrow_testing.
> {code:bash}
> $ cmake -GNinja -DARROW_BUILD_TESTS=ON ..
> $ ldd release/arrow-array-test
> libarrow_testing.so.14 => 
> /home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow_testing.so.14 
> (0x7f8f2b79e000)
> libarrow.so.14 => 
> /home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow.so.14 
> (0x7f8f2b063000)
> $ ninja install
> $ rm -rf * && cmake -GNinja -DARROW_BUILD_TESTS=ON ..
> $ ldd release/arrow-array-test 
> libarrow_testing.so.14 => 
> /home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow_testing.so.14 
> (0x7f75d2bda000)
> libarrow.so.14 => 
> /home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow.so.14 
> (0x7f75d249f000)
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5196) [C++] Uniform usage of Google cpu_features library accross the codebase

2019-04-23 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-5196:
--
Summary: [C++] Uniform usage of Google cpu_features library accross the 
codebase  (was: [CPP] Uniform usage of Google cpu_features library accross the 
codebase)

> [C++] Uniform usage of Google cpu_features library accross the codebase
> ---
>
> Key: ARROW-5196
> URL: https://issues.apache.org/jira/browse/ARROW-5196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Areg Melik-Adamyan
>Assignee: Areg Melik-Adamyan
>Priority: Minor
> Fix For: 0.14.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Is there any objection to use the Google's standard cpu_features library 
> [https://github.com/google/cpu_features] instead of 
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/cpu-info.h]. 
> So far it has been used in 3 place only which can be easily changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5202) [C++

2019-04-23 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-5202:
-

 Summary: [C++
 Key: ARROW-5202
 URL: https://issues.apache.org/jira/browse/ARROW-5202
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5202) [C++] Test and benchmark libraries library search path subtly affected by installation

2019-04-23 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-5202:
--
Description: 
Test and benchmark binaries should always favor the local non-installed 
libarrow and libarrow_testing.

{code:bash}
$ cmake -GNinja -DARROW_BUILD_TESTS=ON .. && ninja
$ ldd release/arrow-array-test
libarrow_testing.so.14 => 
/home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow_testing.so.14 
(0x7f8f2b79e000)
libarrow.so.14 => 
/home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow.so.14 
(0x7f8f2b063000)

$ ninja install
$ rm -rf * && cmake -GNinja -DARROW_BUILD_TESTS=ON .. && ninja
$ ldd release/arrow-array-test 
libarrow_testing.so.14 => 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow_testing.so.14 
(0x7f75d2bda000)
libarrow.so.14 => 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow.so.14 
(0x7f75d249f000)
$ readelf -d release/arrow-array-test |grep RPATH
 0x000f (RPATH)  Library rpath: 
[/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib:/home/fsaintjacques/src/db/arrow/cpp/build/release:/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib]
 
{code}


  was:
Test and benchmark binaries should always favor the local non-installed 
libarrow and libarrow_testing.

{code:bash}
$ cmake -GNinja -DARROW_BUILD_TESTS=ON .. && ninja
$ ldd release/arrow-array-test
libarrow_testing.so.14 => 
/home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow_testing.so.14 
(0x7f8f2b79e000)
libarrow.so.14 => 
/home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow.so.14 
(0x7f8f2b063000)

$ ninja install
$ rm -rf * && cmake -GNinja -DARROW_BUILD_TESTS=ON .. && ninja
$ ldd release/arrow-array-test 
libarrow_testing.so.14 => 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow_testing.so.14 
(0x7f75d2bda000)
libarrow.so.14 => 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow.so.14 
(0x7f75d249f000)
 {code}



> [C++] Test and benchmark libraries library search path subtly affected by 
> installation
> --
>
> Key: ARROW-5202
> URL: https://issues.apache.org/jira/browse/ARROW-5202
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Minor
> Fix For: 0.14.0
>
>
> Test and benchmark binaries should always favor the local non-installed 
> libarrow and libarrow_testing.
> {code:bash}
> $ cmake -GNinja -DARROW_BUILD_TESTS=ON .. && ninja
> $ ldd release/arrow-array-test
> libarrow_testing.so.14 => 
> /home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow_testing.so.14 
> (0x7f8f2b79e000)
> libarrow.so.14 => 
> /home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow.so.14 
> (0x7f8f2b063000)
> $ ninja install
> $ rm -rf * && cmake -GNinja -DARROW_BUILD_TESTS=ON .. && ninja
> $ ldd release/arrow-array-test 
> libarrow_testing.so.14 => 
> /home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow_testing.so.14 
> (0x7f75d2bda000)
> libarrow.so.14 => 
> /home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow.so.14 
> (0x7f75d249f000)
> $ readelf -d release/arrow-array-test |grep RPATH
>  0x000f (RPATH)  Library rpath: 
> [/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib:/home/fsaintjacques/src/db/arrow/cpp/build/release:/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib]
>  
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5202) [C++] Test and benchmark libraries library search path subtly affected by installation

2019-04-23 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-5202:
--
Description: 
Test and benchmark binaries should always favor the local non-installed 
libarrow and libarrow_testing.

{code:bash}
$ cmake -GNinja -DARROW_BUILD_TESTS=ON .. && ninja
$ ldd release/arrow-array-test
libarrow_testing.so.14 => 
/home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow_testing.so.14 
(0x7f8f2b79e000)
libarrow.so.14 => 
/home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow.so.14 
(0x7f8f2b063000)

$ ninja install
$ rm -rf * && cmake -GNinja -DARROW_BUILD_TESTS=ON .. && ninja
$ ldd release/arrow-array-test 
libarrow_testing.so.14 => 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow_testing.so.14 
(0x7f75d2bda000)
libarrow.so.14 => 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow.so.14 
(0x7f75d249f000)
$ readelf -d release/arrow-array-test |grep RPATH
 0x000f (RPATH)  Library rpath: 
[/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib:/home/fsaintjacques/src/db/arrow/cpp/build/release:/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib]
 

# actual invocation
[1/1] : && /usr/bin/ccache 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/bin/x86_64-conda_cos6-linux-gnu-c++
  -Wno-noexcept-type -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 
-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong 
-fno-plt -O2 -ffunction-sections -pipe -fdiagnostics-color=always -O3 -DNDEBUG  
-Wall -msse4.2  -O3 -DNDEBUG  -Wl,-O2 -Wl,--sort-common -Wl,--as-needed 
-Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections   -rdynamic 
src/arrow/CMakeFiles/arrow-array-test.dir/array-test.cc.o 
src/arrow/CMakeFiles/arrow-array-test.dir/array-binary-test.cc.o 
src/arrow/CMakeFiles/arrow-array-test.dir/array-dict-test.cc.o 
src/arrow/CMakeFiles/arrow-array-test.dir/array-list-test.cc.o 
src/arrow/CMakeFiles/arrow-array-test.dir/array-struct-test.cc.o 
src/arrow/CMakeFiles/arrow-array-test.dir/array-union-test.cc.o  -o 
release/arrow-array-test  
-Wl,-rpath,/home/fsaintjacques/src/db/arrow/cpp/build/release:/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib
 release/libarrow_testing.so.14.0.0 release/libarrow.so.14.0.0 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libdouble-conversion.a 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libbrotlienc.so 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libbrotlidec.so 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libbrotlicommon.so 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libglog.so -ldl 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libdouble-conversion.a 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libboost_system.so 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libboost_filesystem.so 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libboost_regex.so 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libgtest_main.so 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libgtest.so 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libgmock.so -ldl 
jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -lrt -pthread 
-Wl,-rpath-link,/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib && :
{code}


  was:
Test and benchmark binaries should always favor the local non-installed 
libarrow and libarrow_testing.

{code:bash}
$ cmake -GNinja -DARROW_BUILD_TESTS=ON .. && ninja
$ ldd release/arrow-array-test
libarrow_testing.so.14 => 
/home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow_testing.so.14 
(0x7f8f2b79e000)
libarrow.so.14 => 
/home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow.so.14 
(0x7f8f2b063000)

$ ninja install
$ rm -rf * && cmake -GNinja -DARROW_BUILD_TESTS=ON .. && ninja
$ ldd release/arrow-array-test 
libarrow_testing.so.14 => 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow_testing.so.14 
(0x7f75d2bda000)
libarrow.so.14 => 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow.so.14 
(0x7f75d249f000)
$ readelf -d release/arrow-array-test |grep RPATH
 0x000f (RPATH)  Library rpath: 
[/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib:/home/fsaintjacques/src/db/arrow/cpp/build/release:/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib]
 
{code}



> [C++] Test and benchmark libraries library search path subtly affected by 
> installation
> --
>
> Key: ARROW-5202
> URL: https://issues.apache.org/jira/browse/ARROW-5202
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Francois Saint-Jacques
>

[jira] [Updated] (ARROW-5202) [C++] Test and benchmark libraries library search path subtly affected by installation

2019-04-23 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-5202:
--
Description: 
Test and benchmark binaries should always favor the local non-installed 
libarrow and libarrow_testing.

{code:bash}
$ cmake -GNinja -DARROW_BUILD_TESTS=ON .. && ninja
$ ldd release/arrow-array-test
libarrow_testing.so.14 => 
/home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow_testing.so.14 
(0x7f8f2b79e000)
libarrow.so.14 => 
/home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow.so.14 
(0x7f8f2b063000)

$ ninja install
$ rm -rf * && cmake -GNinja -DARROW_BUILD_TESTS=ON .. && ninja
$ ldd release/arrow-array-test 
libarrow_testing.so.14 => 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow_testing.so.14 
(0x7f75d2bda000)
libarrow.so.14 => 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow.so.14 
(0x7f75d249f000)
 {code}


  was:
Test and benchmark binaries should always favor the local non-installed 
libarrow and libarrow_testing.

{code:bash}
$ cmake -GNinja -DARROW_BUILD_TESTS=ON ..
$ ldd release/arrow-array-test
libarrow_testing.so.14 => 
/home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow_testing.so.14 
(0x7f8f2b79e000)
libarrow.so.14 => 
/home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow.so.14 
(0x7f8f2b063000)

$ ninja install
$ rm -rf * && cmake -GNinja -DARROW_BUILD_TESTS=ON ..
$ ldd release/arrow-array-test 
libarrow_testing.so.14 => 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow_testing.so.14 
(0x7f75d2bda000)
libarrow.so.14 => 
/home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow.so.14 
(0x7f75d249f000)
 {code}



> [C++] Test and benchmark libraries library search path subtly affected by 
> installation
> --
>
> Key: ARROW-5202
> URL: https://issues.apache.org/jira/browse/ARROW-5202
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Minor
> Fix For: 0.14.0
>
>
> Test and benchmark binaries should always favor the local non-installed 
> libarrow and libarrow_testing.
> {code:bash}
> $ cmake -GNinja -DARROW_BUILD_TESTS=ON .. && ninja
> $ ldd release/arrow-array-test
> libarrow_testing.so.14 => 
> /home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow_testing.so.14 
> (0x7f8f2b79e000)
> libarrow.so.14 => 
> /home/fsaintjacques/src/db/arrow/cpp/build/release/libarrow.so.14 
> (0x7f8f2b063000)
> $ ninja install
> $ rm -rf * && cmake -GNinja -DARROW_BUILD_TESTS=ON .. && ninja
> $ ldd release/arrow-array-test 
> libarrow_testing.so.14 => 
> /home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow_testing.so.14 
> (0x7f75d2bda000)
> libarrow.so.14 => 
> /home/fsaintjacques/miniconda/envs/pyarrow-dev/lib/libarrow.so.14 
> (0x7f75d249f000)
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5071) [Benchmarking] Performs a benchmark run with archery

2019-04-24 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-5071:
--
Description: 
Run all regression benchmarks, consume output and re-format according to the 
format required by dev/benchmarking specification and/or push to upstream 
database.

This would be implemented as `archery benchmark run`. Provide facility to 
save/load results as a StaticRunner (such that it can be re-used in comparison 
without running the benchmark again).

  was:
Run all regression benchmarks, consume output and re-format according to the 
format required by dev/benchmarking specification.

This would be implemented as `archery benchmark run`. Provide facility to 
save/load results as a StaticRunner (such that it can be re-used in comparison 
without running the benchmark again).


> [Benchmarking] Performs a benchmark run with archery
> 
>
> Key: ARROW-5071
> URL: https://issues.apache.org/jira/browse/ARROW-5071
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Run all regression benchmarks, consume output and re-format according to the 
> format required by dev/benchmarking specification and/or push to upstream 
> database.
> This would be implemented as `archery benchmark run`. Provide facility to 
> save/load results as a StaticRunner (such that it can be re-used in 
> comparison without running the benchmark again).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5214) [C++] Offline dependency downloader misses some libraries

2019-04-25 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-5214:
-

Assignee: Francois Saint-Jacques

> [C++] Offline dependency downloader misses some libraries
> -
>
> Key: ARROW-5214
> URL: https://issues.apache.org/jira/browse/ARROW-5214
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 0.14.0
>
>
> Not sure yet but maybe this was introduced by 
> https://github.com/apache/arrow/commit/f913d8f0adff71c288a10f6c1b0ad2d1ab3e9e32
> {code}
> $ thirdparty/download_dependencies.sh /home/wesm/arrow-thirdparty
> # Environment variables for offline Arrow build
> export ARROW_BOOST_URL=/home/wesm/arrow-thirdparty/boost-1.67.0.tar.gz
> export ARROW_BROTLI_URL=/home/wesm/arrow-thirdparty/brotli-v1.0.7.tar.gz
> export ARROW_CARES_URL=/home/wesm/arrow-thirdparty/cares-1.15.0.tar.gz
> export 
> ARROW_DOUBLE_CONVERSION_URL=/home/wesm/arrow-thirdparty/double-conversion-v3.1.4.tar.gz
> export 
> ARROW_FLATBUFFERS_URL=/home/wesm/arrow-thirdparty/flatbuffers-v1.10.0.tar.gz
> export 
> ARROW_GBENCHMARK_URL=/home/wesm/arrow-thirdparty/gbenchmark-v1.4.1.tar.gz
> export ARROW_GFLAGS_URL=/home/wesm/arrow-thirdparty/gflags-v2.2.0.tar.gz
> export ARROW_GLOG_URL=/home/wesm/arrow-thirdparty/glog-v0.3.5.tar.gz
> export ARROW_GRPC_URL=/home/wesm/arrow-thirdparty/grpc-v1.20.0.tar.gz
> export ARROW_GTEST_URL=/home/wesm/arrow-thirdparty/gtest-1.8.1.tar.gz
> export ARROW_LZ4_URL=/home/wesm/arrow-thirdparty/lz4-v1.8.3.tar.gz
> export ARROW_ORC_URL=/home/wesm/arrow-thirdparty/orc-1.5.5.tar.gz
> export ARROW_PROTOBUF_URL=/home/wesm/arrow-thirdparty/protobuf-v3.7.1.tar.gz
> export 
> ARROW_RAPIDJSON_URL=/home/wesm/arrow-thirdparty/rapidjson-2bbd33b33217ff4a73434ebf10cdac41e2ef5e34.tar.gz
> export ARROW_RE2_URL=/home/wesm/arrow-thirdparty/re2-2019-04-01.tar.gz
> {code}
> The 5 dependencies listed after RE2 are not downloaded



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5130) [Python] Segfault when importing TensorFlow after Pyarrow

2019-04-25 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826333#comment-16826333
 ] 

Francois Saint-Jacques commented on ARROW-5130:
---

This is not fixed with in master, I fetched the latest wheel from crossbow and 
can trigger the segfault. I think that propagating the fix in ARROW-2796 to all 
libraries (parquet, plasma, gandiva) will do the trick.

> [Python] Segfault when importing TensorFlow after Pyarrow
> -
>
> Key: ARROW-5130
> URL: https://issues.apache.org/jira/browse/ARROW-5130
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Travis Addair
>Priority: Major
>
> This issue is similar to https://jira.apache.org/jira/browse/ARROW-2657 which 
> was fixed in v0.10.0.
> When we import TensorFlow after Pyarrow in Linux Debian Jessie, we get a 
> segfault.  To reproduce:
> {code:java}
> import pyarrow 
> import tensorflow{code}
> Here's the backtrace from gdb:
> {code:java}
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x in ?? ()
> (gdb) bt
> #0 0x in ?? ()
> #1 0x7f529ee04410 in pthread_once () at 
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:103
> #2 0x7f5229a74efa in void std::call_once(std::once_flag&, 
> void (&)()) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #3 0x7f5229a74f3e in 
> tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #4 0x7f522978b561 in tensorflow::port::(anonymous 
> namespace)::CheckFeatureOrDie(tensorflow::port::CPUFeature, std::string 
> const&) ()
> from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #5 0x7f522978b5b4 in _GLOBAL__sub_I_cpu_feature_guard.cc () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #6 0x7f529f224bea in call_init (l=, argc=argc@entry=9, 
> argv=argv@entry=0x7ffc6d8c1488, env=env@entry=0x294c0c0) at dl-init.c:78
> #7 0x7f529f224cd3 in call_init (env=0x294c0c0, argv=0x7ffc6d8c1488, 
> argc=9, l=) at dl-init.c:36
> #8 _dl_init (main_map=main_map@entry=0x2e4aff0, argc=9, argv=0x7ffc6d8c1488, 
> env=0x294c0c0) at dl-init.c:126
> #9 0x7f529f228e38 in dl_open_worker (a=a@entry=0x7ffc6d8bebb8) at 
> dl-open.c:577
> #10 0x7f529f224aa4 in _dl_catch_error 
> (objname=objname@entry=0x7ffc6d8beba8, 
> errstring=errstring@entry=0x7ffc6d8bebb0, 
> mallocedp=mallocedp@entry=0x7ffc6d8beba7,
> operate=operate@entry=0x7f529f228b60 , 
> args=args@entry=0x7ffc6d8bebb8) at dl-error.c:187
> #11 0x7f529f22862b in _dl_open (file=0x7f5248178b54 
> "/usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so",
>  mode=-2147483646, caller_dlopen=,
> nsid=-2, argc=9, argv=0x7ffc6d8c1488, env=0x294c0c0) at dl-open.c:661
> #12 0x7f529ebf402b in dlopen_doit (a=a@entry=0x7ffc6d8bedd0) at 
> dlopen.c:66
> #13 0x7f529f224aa4 in _dl_catch_error (objname=0x2950fc0, 
> errstring=0x2950fc8, mallocedp=0x2950fb8, operate=0x7f529ebf3fd0 
> , args=0x7ffc6d8bedd0) at dl-error.c:187
> #14 0x7f529ebf45dd in _dlerror_run (operate=operate@entry=0x7f529ebf3fd0 
> , args=args@entry=0x7ffc6d8bedd0) at dlerror.c:163
> #15 0x7f529ebf40c1 in __dlopen (file=, mode= out>) at dlopen.c:87
> #16 0x00540859 in _PyImport_GetDynLoadFunc ()
> #17 0x0054024c in _PyImport_LoadDynamicModule ()
> #18 0x005f2bcb in ?? ()
> #19 0x004ca235 in PyEval_EvalFrameEx ()
> #20 0x004ca9c2 in PyEval_EvalFrameEx ()
> #21 0x004c8c39 in PyEval_EvalCodeEx ()
> #22 0x004c84e6 in PyEval_EvalCode ()
> #23 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #24 0x004c3272 in ?? ()
> #25 0x004b19e2 in ?? ()
> #26 0x004b13d7 in ?? ()
> #27 0x004b42f6 in ?? ()
> #28 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #29 0x004ccdb3 in PyEval_EvalFrameEx ()
> #30 0x004c8c39 in PyEval_EvalCodeEx ()
> #31 0x004c84e6 in PyEval_EvalCode ()
> #32 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #33 0x004c3272 in ?? ()
> #34 0x004b1d3f in ?? ()
> #35 0x004b6b2b in ?? ()
> #36 0x004b0d82 in ?? ()
> #37 0x004b42f6 in ?? ()
> #38 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #39 0x004ccdb3 in PyEval_EvalFrameEx (){code}
> It looks like the code changes that fixed the previous issue was recently 
> removed in 
> [https://github.com/apache/arrow/commit/b766bff34b7d85034d26cebef5b3aeef1eb2fd82#diff-16806bcebc1df2fae432db426905b9f0].



--
This message was 

[jira] [Commented] (ARROW-5130) [Python] Segfault when importing TensorFlow after Pyarrow

2019-04-25 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826335#comment-16826335
 ] 

Francois Saint-Jacques commented on ARROW-5130:
---

Also note that I can't trigger this from a local build, I suspect that this is 
related to how we build/package the wheel.

> [Python] Segfault when importing TensorFlow after Pyarrow
> -
>
> Key: ARROW-5130
> URL: https://issues.apache.org/jira/browse/ARROW-5130
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Travis Addair
>Priority: Major
>
> This issue is similar to https://jira.apache.org/jira/browse/ARROW-2657 which 
> was fixed in v0.10.0.
> When we import TensorFlow after Pyarrow in Linux Debian Jessie, we get a 
> segfault.  To reproduce:
> {code:java}
> import pyarrow 
> import tensorflow{code}
> Here's the backtrace from gdb:
> {code:java}
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x in ?? ()
> (gdb) bt
> #0 0x in ?? ()
> #1 0x7f529ee04410 in pthread_once () at 
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:103
> #2 0x7f5229a74efa in void std::call_once(std::once_flag&, 
> void (&)()) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #3 0x7f5229a74f3e in 
> tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #4 0x7f522978b561 in tensorflow::port::(anonymous 
> namespace)::CheckFeatureOrDie(tensorflow::port::CPUFeature, std::string 
> const&) ()
> from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #5 0x7f522978b5b4 in _GLOBAL__sub_I_cpu_feature_guard.cc () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #6 0x7f529f224bea in call_init (l=, argc=argc@entry=9, 
> argv=argv@entry=0x7ffc6d8c1488, env=env@entry=0x294c0c0) at dl-init.c:78
> #7 0x7f529f224cd3 in call_init (env=0x294c0c0, argv=0x7ffc6d8c1488, 
> argc=9, l=) at dl-init.c:36
> #8 _dl_init (main_map=main_map@entry=0x2e4aff0, argc=9, argv=0x7ffc6d8c1488, 
> env=0x294c0c0) at dl-init.c:126
> #9 0x7f529f228e38 in dl_open_worker (a=a@entry=0x7ffc6d8bebb8) at 
> dl-open.c:577
> #10 0x7f529f224aa4 in _dl_catch_error 
> (objname=objname@entry=0x7ffc6d8beba8, 
> errstring=errstring@entry=0x7ffc6d8bebb0, 
> mallocedp=mallocedp@entry=0x7ffc6d8beba7,
> operate=operate@entry=0x7f529f228b60 , 
> args=args@entry=0x7ffc6d8bebb8) at dl-error.c:187
> #11 0x7f529f22862b in _dl_open (file=0x7f5248178b54 
> "/usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so",
>  mode=-2147483646, caller_dlopen=,
> nsid=-2, argc=9, argv=0x7ffc6d8c1488, env=0x294c0c0) at dl-open.c:661
> #12 0x7f529ebf402b in dlopen_doit (a=a@entry=0x7ffc6d8bedd0) at 
> dlopen.c:66
> #13 0x7f529f224aa4 in _dl_catch_error (objname=0x2950fc0, 
> errstring=0x2950fc8, mallocedp=0x2950fb8, operate=0x7f529ebf3fd0 
> , args=0x7ffc6d8bedd0) at dl-error.c:187
> #14 0x7f529ebf45dd in _dlerror_run (operate=operate@entry=0x7f529ebf3fd0 
> , args=args@entry=0x7ffc6d8bedd0) at dlerror.c:163
> #15 0x7f529ebf40c1 in __dlopen (file=, mode= out>) at dlopen.c:87
> #16 0x00540859 in _PyImport_GetDynLoadFunc ()
> #17 0x0054024c in _PyImport_LoadDynamicModule ()
> #18 0x005f2bcb in ?? ()
> #19 0x004ca235 in PyEval_EvalFrameEx ()
> #20 0x004ca9c2 in PyEval_EvalFrameEx ()
> #21 0x004c8c39 in PyEval_EvalCodeEx ()
> #22 0x004c84e6 in PyEval_EvalCode ()
> #23 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #24 0x004c3272 in ?? ()
> #25 0x004b19e2 in ?? ()
> #26 0x004b13d7 in ?? ()
> #27 0x004b42f6 in ?? ()
> #28 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #29 0x004ccdb3 in PyEval_EvalFrameEx ()
> #30 0x004c8c39 in PyEval_EvalCodeEx ()
> #31 0x004c84e6 in PyEval_EvalCode ()
> #32 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #33 0x004c3272 in ?? ()
> #34 0x004b1d3f in ?? ()
> #35 0x004b6b2b in ?? ()
> #36 0x004b0d82 in ?? ()
> #37 0x004b42f6 in ?? ()
> #38 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #39 0x004ccdb3 in PyEval_EvalFrameEx (){code}
> It looks like the code changes that fixed the previous issue was recently 
> removed in 
> [https://github.com/apache/arrow/commit/b766bff34b7d85034d26cebef5b3aeef1eb2fd82#diff-16806bcebc1df2fae432db426905b9f0].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5071) [Benchmarking] Performs a benchmark run with archery

2019-04-23 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-5071:
--
Description: 
Run all regression benchmarks, consume output and re-format according to the 
format required by dev/benchmarking specification.

This would be implemented as `archery benchmark run`. Provide facility to 
save/load results as a StaticRunner (such that it can be re-used in comparison 
without running the benchmark again).

  was:
Run all regression benchmarks, consume output and re-format according to the 
format required by dev/benchmarking specification.

This would be implemented as `archery benchmark run`


> [Benchmarking] Performs a benchmark run with archery
> 
>
> Key: ARROW-5071
> URL: https://issues.apache.org/jira/browse/ARROW-5071
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Run all regression benchmarks, consume output and re-format according to the 
> format required by dev/benchmarking specification.
> This would be implemented as `archery benchmark run`. Provide facility to 
> save/load results as a StaticRunner (such that it can be re-used in 
> comparison without running the benchmark again).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5781) [Archery] Ensure benchmark clone accepts remotes in revision

2019-06-28 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-5781:
-

 Summary: [Archery] Ensure benchmark clone accepts remotes in 
revision
 Key: ARROW-5781
 URL: https://issues.apache.org/jira/browse/ARROW-5781
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Affects Versions: 0.13.0
Reporter: Francois Saint-Jacques


Found that ursabot would always compare the PR tip commit with itself via 
https://github.com/apache/arrow/pull/4739#issuecomment-506819250 . This is due 
to buildbot github behavior of using a git-reset --hard local that changes the 
`master` rev to this new state. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5780) [C++] Add benchmark for Decimal128 operations

2019-06-28 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-5780.
---
   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4740
[https://github.com/apache/arrow/pull/4740]

> [C++] Add benchmark for Decimal128 operations
> -
>
> Key: ARROW-5780
> URL: https://issues.apache.org/jira/browse/ARROW-5780
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5803) [C++] Dockerize C++ with clang 7 Travis CI unit test logic

2019-07-02 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-5803:
-

Assignee: Francois Saint-Jacques

> [C++] Dockerize C++ with clang 7 Travis CI unit test logic
> --
>
> Key: ARROW-5803
> URL: https://issues.apache.org/jira/browse/ARROW-5803
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>
> Convert to docker-compose (or use one of the current Dockerfiles under cpp/)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-6004) [C++] CSV reader ignore_empty_lines option doesn't handle empty lines

2019-07-31 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897238#comment-16897238
 ] 

Francois Saint-Jacques commented on ARROW-6004:
---

I'd expect the empty lines to be skipped, if one wants nulls, it should be a 
line of the exact number of commas.

> [C++] CSV reader ignore_empty_lines option doesn't handle empty lines
> -
>
> Key: ARROW-6004
> URL: https://issues.apache.org/jira/browse/ARROW-6004
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: csv, pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Followup to https://issues.apache.org/jira/browse/ARROW-5747. If 
> {{ignore_empty_lines}} is false and there are empty lines, it fails to parse 
> (again, with {{Invalid: Empty CSV file}}).
> Correct behavior should be to fill those empty lines with missing data for 
> all columns.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6004) [C++] CSV reader ignore_empty_lines option doesn't handle empty lines

2019-07-31 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897264#comment-16897264
 ] 

Francois Saint-Jacques commented on ARROW-6004:
---

I was agreeing with the (current) default instead of adding nulls. I think it's 
worth having as an option.

> [C++] CSV reader ignore_empty_lines option doesn't handle empty lines
> -
>
> Key: ARROW-6004
> URL: https://issues.apache.org/jira/browse/ARROW-6004
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: csv, pull-request-available
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Followup to https://issues.apache.org/jira/browse/ARROW-5747. If 
> {{ignore_empty_lines}} is false and there are empty lines, it fails to parse 
> (again, with {{Invalid: Empty CSV file}}).
> Correct behavior should be to fill those empty lines with missing data for 
> all columns.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6123) [C++] IsIn kernel should not materialize the output internal

2019-08-02 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6123:
--
Affects Version/s: 0.15.0

> [C++] IsIn kernel should not materialize the output internal
> 
>
> Key: ARROW-6123
> URL: https://issues.apache.org/jira/browse/ARROW-6123
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> It should use the helpers since the output size is known.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6123) [C++] IsIn kernel should not materialize the output internal

2019-08-02 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-6123:
-

 Summary: [C++] IsIn kernel should not materialize the output 
internal
 Key: ARROW-6123
 URL: https://issues.apache.org/jira/browse/ARROW-6123
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Francois Saint-Jacques


It should use the helpers since the output size is known.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6123) [C++] IsIn kernel should not materialize the output internal

2019-08-02 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6123:
--
Labels: ana  (was: )

> [C++] IsIn kernel should not materialize the output internal
> 
>
> Key: ARROW-6123
> URL: https://issues.apache.org/jira/browse/ARROW-6123
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: ana
>
> It should use the helpers since the output size is known.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6123) [C++] IsIn kernel should not materialize the output internal

2019-08-02 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6123:
--
Component/s: C++

> [C++] IsIn kernel should not materialize the output internal
> 
>
> Key: ARROW-6123
> URL: https://issues.apache.org/jira/browse/ARROW-6123
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> It should use the helpers since the output size is known.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6121) [Tools] Improve merge tool cli ergonomic

2019-08-02 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-6121:
-

 Summary: [Tools] Improve merge tool cli ergonomic
 Key: ARROW-6121
 URL: https://issues.apache.org/jira/browse/ARROW-6121
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Francois Saint-Jacques
Assignee: Francois Saint-Jacques


* Accepts the pull-request number as an optional (first) parameter to the script
* Supports reading the jira username/password from a file



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6122) [C++] IsIn kernel must support FixedSizeBinary

2019-08-02 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-6122:
-

 Summary: [C++] IsIn kernel must support FixedSizeBinary
 Key: ARROW-6122
 URL: https://issues.apache.org/jira/browse/ARROW-6122
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.15.0
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5932) undefined reference to `__cxa_init_primary_exception@CXXABI_1.3.11'

2019-08-02 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899167#comment-16899167
 ] 

Francois Saint-Jacques commented on ARROW-5932:
---

How did you install arrow, from sources?

> undefined reference to `__cxa_init_primary_exception@CXXABI_1.3.11'
> ---
>
> Key: ARROW-5932
> URL: https://issues.apache.org/jira/browse/ARROW-5932
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0
> Environment: Linux Mint 19.1 Tessa
> g++-6
>Reporter: Cong Ding
>Priority: Critical
>
> I was installing Apache Arrow in my Linux Mint 19.1 Tessa server. I followed 
> the instructions on the official arrow website (using the ubuntu 18.04 
> method). However, when I was trying to compile the examples, the g++ compiler 
> threw out some errors.
> I have updated my g++ to g++-6, update my libstdc++ library, and using flag 
> -lstdc++, but it still didn't work.
>  
> {code:java}
> //代码占位符
> g++-6 -std=c++11 -larrow -lparquet main.cpp -lstdc++ 
> {code}
> The error message:
> /usr/lib/x86_64-linux-gnu/libarrow.so: undefined reference to 
> `__cxa_init_primary_exception@CXXABI_1.3.11'
> /usr/lib/x86_64-linux-gnu/libarrow.so: undefined reference to 
> `std::__exception_ptr::exception_ptr::exception_ptr(void*)@CXXABI_1.3.11'
> collect2: error: ld returned 1 exit status.
>  
> I do not know what to do this moment. Can anyone help me?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6124) [C++] IsIn kernel should sort in a single pass (with nulls)

2019-08-02 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-6124:
-

 Summary: [C++] IsIn kernel should sort in a single pass (with 
nulls)
 Key: ARROW-6124
 URL: https://issues.apache.org/jira/browse/ARROW-6124
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.15.0
Reporter: Francois Saint-Jacques


There's a good chance that merge sort must be implemented (spill to disk, 
ChunkedArray, ...)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6123) [C++] IsIn kernel should not materialize the output internal

2019-08-02 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6123:
--
Labels:   (was: ana)

> [C++] IsIn kernel should not materialize the output internal
> 
>
> Key: ARROW-6123
> URL: https://issues.apache.org/jira/browse/ARROW-6123
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> It should use the helpers since the output size is known.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6122) [C++] ArgSort kernel must support FixedSizeBinary

2019-08-02 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6122:
--
Summary: [C++] ArgSort kernel must support FixedSizeBinary  (was: [C++] 
IsIn kernel must support FixedSizeBinary)

> [C++] ArgSort kernel must support FixedSizeBinary
> -
>
> Key: ARROW-6122
> URL: https://issues.apache.org/jira/browse/ARROW-6122
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6123) [C++] ArgSort kernel should not materialize the output internal

2019-08-02 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6123:
--
Summary: [C++] ArgSort kernel should not materialize the output internal  
(was: [C++] IsIn kernel should not materialize the output internal)

> [C++] ArgSort kernel should not materialize the output internal
> ---
>
> Key: ARROW-6123
> URL: https://issues.apache.org/jira/browse/ARROW-6123
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> It should use the helpers since the output size is known.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-1566) [C++] Implement non-materializing sort kernels

2019-08-02 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-1566.
---
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4861
[https://github.com/apache/arrow/pull/4861]

> [C++] Implement non-materializing sort kernels
> --
>
> Key: ARROW-1566
> URL: https://issues.apache.org/jira/browse/ARROW-1566
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> The output of such operator would be a permutation vector that if applied to 
> a column, would result in the data being sorted like requested. This is 
> similar to numpy's argsort functionality.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-1566) [C++] Implement non-materializing sort kernels

2019-08-02 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-1566:
-

Assignee: Artem Alekseev

> [C++] Implement non-materializing sort kernels
> --
>
> Key: ARROW-1566
> URL: https://issues.apache.org/jira/browse/ARROW-1566
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Artem Alekseev
>Priority: Major
>  Labels: Analytics, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> The output of such operator would be a permutation vector that if applied to 
> a column, would result in the data being sorted like requested. This is 
> similar to numpy's argsort functionality.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6244) [C++] Implement Partition DataSource

2019-08-14 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-6244:
-

 Summary: [C++] Implement Partition DataSource
 Key: ARROW-6244
 URL: https://issues.apache.org/jira/browse/ARROW-6244
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Francois Saint-Jacques


This is a DataSource that also has partition metadata. The end goal is to 
support filtering with a DataSelector/Filter expression. The initial 
implementation should not deal with PartitionScheme yet.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-6242) [C++] Implements basic Dataset/Scanner/ScannerBuilder

2019-08-14 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-6242:
-

Assignee: Francois Saint-Jacques

> [C++] Implements basic Dataset/Scanner/ScannerBuilder
> -
>
> Key: ARROW-6242
> URL: https://issues.apache.org/jira/browse/ARROW-6242
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: datasets
>
> The goal of this would be to iterate over a Dataset and generate a 
> "flattened" stream of RecordBatches from the union of data sources and data 
> fragments. This should not bother with filtering yet.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6243) [C++] Implement basic Filter expression classes

2019-08-14 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-6243:
-

 Summary: [C++] Implement basic Filter expression classes
 Key: ARROW-6243
 URL: https://issues.apache.org/jira/browse/ARROW-6243
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Francois Saint-Jacques
Assignee: Benjamin Kietzman


This will draft the basic classes for creating boolean expressions that are 
passed to the DataSources/DataFragments for predicate push-down.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6242) [C++] Implements basic Dataset/Scanner/ScannerBuilder

2019-08-14 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-6242:
-

 Summary: [C++] Implements basic Dataset/Scanner/ScannerBuilder
 Key: ARROW-6242
 URL: https://issues.apache.org/jira/browse/ARROW-6242
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Francois Saint-Jacques


The goal of this would be to iterate over a Dataset and generate a "flattened" 
stream of RecordBatches from the union of data sources and data fragments. This 
should not bother with filtering yet.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-6278) [R] Handle raw vector from read_parquet

2019-08-16 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909248#comment-16909248
 ] 

Francois Saint-Jacques edited comment on ARROW-6278 at 8/16/19 5:29 PM:


There's the BufferReader in C++

https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/memory.h#L131-L168

which seems to be referenced/reachable in R bindinds:

https://github.com/apache/arrow/blob/master/r/src/io.cpp#L137-L141
https://github.com/apache/arrow/blob/master/r/R/io.R#L233-L245


was (Author: fsaintjacques):
There's the BufferReader in C++

https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/memory.h#L131-L168

which seems to be referenced/reachable in R bindinds:

https://github.com/apache/arrow/blob/master/r/src/io.cpp#L137-L141

> [R] Handle raw vector from read_parquet 
> 
>
> Key: ARROW-6278
> URL: https://issues.apache.org/jira/browse/ARROW-6278
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Brendan Hogan
>Priority: Major
>
> {{read_parquet}} currently handles a path to a local file or an Arrow input 
> stream.  Would it be possible to add support for a raw vector containing the 
> contents of a parquet file?
> Apologies if there is already a way to do this.  I have tried populating a 
> buffer and passing that as input, but that is unsupported as well.  An 
> example of how to work using an input stream would be useful as well.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6278) [R] Handle raw vector from read_parquet

2019-08-16 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909248#comment-16909248
 ] 

Francois Saint-Jacques commented on ARROW-6278:
---

There's the BufferReader in C++

https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/memory.h#L131-L168

which seems to be referenced/reachable in R bindinds:

https://github.com/apache/arrow/blob/master/r/src/io.cpp#L137-L141

> [R] Handle raw vector from read_parquet 
> 
>
> Key: ARROW-6278
> URL: https://issues.apache.org/jira/browse/ARROW-6278
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Brendan Hogan
>Priority: Major
>
> {{read_parquet}} currently handles a path to a local file or an Arrow input 
> stream.  Would it be possible to add support for a raw vector containing the 
> contents of a parquet file?
> Apologies if there is already a way to do this.  I have tried populating a 
> buffer and passing that as input, but that is unsupported as well.  An 
> example of how to work using an input stream would be useful as well.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6258) [R] Add macOS build scripts

2019-08-19 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-6258.
---
Resolution: Fixed

Issue resolved by pull request 5095
[https://github.com/apache/arrow/pull/5095]

> [R] Add macOS build scripts
> ---
>
> Key: ARROW-6258
> URL: https://issues.apache.org/jira/browse/ARROW-6258
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> CRAN builds binary packages for Windows and macOS. It generally does this by 
> building on its servers and bundling all dependencies in the R package. This 
> has been accomplished by having separate processes for building and hosting 
> system dependencies, and then downloading and bundling those with scripts 
> that get executed at install time (and then create the binary package as a 
> side effect).
> ARROW-3758 added the Windows PKGBUILD and related packaging scripts and ran 
> them on our Appveyor. This ticket is to do the same for the macOS scripts.
> The purpose of these tickets is to bring the whole build pipeline under our 
> version control and CI so that we can address any C++ build and dependency 
> changes as they arise and not be surprised when it comes time to cut a 
> release. A side benefit is that they also enable us to offer a nightly binary 
> package repository with minimal additional effort.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6238) [C++] Implement SimpleDataSource

2019-08-14 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-6238:
-

 Summary: [C++] Implement SimpleDataSource
 Key: ARROW-6238
 URL: https://issues.apache.org/jira/browse/ARROW-6238
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Francois Saint-Jacques
Assignee: Francois Saint-Jacques






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6238) [C++] Implement SimpleDataSource/SimpleDataFragment

2019-08-14 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6238:
--
Summary: [C++] Implement SimpleDataSource/SimpleDataFragment  (was: [C++] 
Implement SimpleDataSource)

> [C++] Implement SimpleDataSource/SimpleDataFragment
> ---
>
> Key: ARROW-6238
> URL: https://issues.apache.org/jira/browse/ARROW-6238
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: datasets
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-3705) [Python] Add "nrows" argument to parquet.read_table read indicated number of rows from file instead of whole file

2019-08-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-3705:
--
Labels: dataset datasets parquet  (was: datasets parquet)

> [Python] Add "nrows" argument to parquet.read_table read indicated number of 
> rows from file instead of whole file
> -
>
> Key: ARROW-3705
> URL: https://issues.apache.org/jira/browse/ARROW-3705
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, datasets, parquet
>
> This patterns {{nrows}} in {{pandas.read_csv}}
> inspired by 
> https://stackoverflow.com/questions/53152671/how-to-read-sample-records-parquet-file-in-s3



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-3379) [C++] Implement regex/multichar delimiter tokenizer

2019-08-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-3379:
--
Labels: csv dataset datasets  (was: csv datasets)

> [C++] Implement regex/multichar delimiter tokenizer
> ---
>
> Key: ARROW-3379
> URL: https://issues.apache.org/jira/browse/ARROW-3379
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: csv, dataset, datasets
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-2801) [Python] Implement splt_row_groups for ParquetDataset

2019-08-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-2801:
--
Labels: dataset datasets parquet pull-request-available  (was: datasets 
parquet pull-request-available)

> [Python] Implement splt_row_groups for ParquetDataset
> -
>
> Key: ARROW-2801
> URL: https://issues.apache.org/jira/browse/ARROW-2801
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Robbie Gruener
>Assignee: Robbie Gruener
>Priority: Minor
>  Labels: dataset, datasets, parquet, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently the split_row_groups argument in ParquetDataset yields a not 
> implemented error. An easy and efficient way to implement this is by using 
> the summary metadata file instead of opening every footer file



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-3538) [Python] ability to override the automated assignment of uuid for filenames when writing datasets

2019-08-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-3538:
--
Labels: dataset datasets features parquet pull-request-available  (was: 
datasets features parquet pull-request-available)

> [Python] ability to override the automated assignment of uuid for filenames 
> when writing datasets
> -
>
> Key: ARROW-3538
> URL: https://issues.apache.org/jira/browse/ARROW-3538
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Ji Xu
>Assignee: Thomas Elvey
>Priority: Major
>  Labels: dataset, datasets, features, parquet, 
> pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Say I have a pandas DataFrame {{df}} that I would like to store on disk as 
> dataset using pyarrow parquet, I would do this:
> {code:java}
> table = pyarrow.Table.from_pandas(df)
> pyarrow.parquet.write_to_dataset(table, root_path=some_path, 
> partition_cols=['a',]){code}
> On disk the dataset would look like something like this:
>  {color:#14892c}some_path{color}
>  {color:#14892c}├── a=1{color}
>  {color:#14892c}├── 4498704937d84fe5abebb3f06515ab2d.parquet{color}
>  {color:#14892c}├── a=2{color}
>  {color:#14892c}├── 8bcfaed8986c4bdba587aaaee532370c.parquet{color}
> *Wished Feature:* It'd be great if I can override the auto-assignment of the 
> long UUID as filename somehow during the *dataset* writing. My purpose is to 
> be able to overwrite the dataset on disk when I have a new version of {{df}}. 
> Currently if I try to write the dataset again, another new uniquely named 
> [UUID].parquet file will be placed next to the old one, with the same, 
> redundant data.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-6238) [C++] Implement SimpleDataSource/SimpleDataFragment

2019-08-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6238:
--
Labels: dataset datasets pull-request-available  (was: datasets 
pull-request-available)

> [C++] Implement SimpleDataSource/SimpleDataFragment
> ---
>
> Key: ARROW-6238
> URL: https://issues.apache.org/jira/browse/ARROW-6238
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, datasets, pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-6242) [C++] Implements basic Dataset/Scanner/ScannerBuilder

2019-08-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6242:
--
Labels: dataset datasets  (was: datasets)

> [C++] Implements basic Dataset/Scanner/ScannerBuilder
> -
>
> Key: ARROW-6242
> URL: https://issues.apache.org/jira/browse/ARROW-6242
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, datasets
>
> The goal of this would be to iterate over a Dataset and generate a 
> "flattened" stream of RecordBatches from the union of data sources and data 
> fragments. This should not bother with filtering yet.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-3764) [C++] Port Python "ParquetDataset" business logic to C++

2019-08-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-3764:
--
Labels: dataset datasets parquet  (was: datasets parquet)

> [C++] Port Python "ParquetDataset" business logic to C++
> 
>
> Key: ARROW-3764
> URL: https://issues.apache.org/jira/browse/ARROW-3764
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, datasets, parquet
> Fix For: 1.0.0
>
>
> Along with defining appropriate abstractions for dealing with generic 
> filesystems in C++, we should implement the machinery for reading multiple 
> Parquet files in C++ so that it can reused in GLib, R, and Ruby. Otherwise 
> these languages will have to reimplement things, and this would surely result 
> in inconsistent features, bugs in some implementations but not others



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-6161) [C++] Implements dataset::ParquetFile and associated Scan structures

2019-08-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6161:
--
Labels: dataset datasets pull-request-available  (was: datasets 
pull-request-available)

> [C++] Implements dataset::ParquetFile and associated Scan structures
> 
>
> Key: ARROW-6161
> URL: https://issues.apache.org/jira/browse/ARROW-6161
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, datasets, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> This is first baby step in supporting datasets. The initial implementation 
> will be minimal and trivial, no parallel, no schema adaptation.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-3408) [C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns

2019-08-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-3408:
--
Labels: csv dataset datasets  (was: csv datasets)

> [C++] Add option to CSV reader to dictionary encode individual columns or all 
> string / binary columns
> -
>
> Key: ARROW-3408
> URL: https://issues.apache.org/jira/browse/ARROW-3408
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: csv, dataset, datasets
> Fix For: 1.0.0
>
>
> For many datasets, dictionary encoding everything can result in drastically 
> lower memory usage and subsequently better performance in doing analytics
> One difficulty of dictionary encoding in multithreaded conversions is that 
> ideally you end up with one dictionary at the end. So you have two options:
> * Implement a concurrent hashing scheme -- for low cardinality dictionaries, 
> the overhead associated with mutex contention will not be meaningful, for 
> high cardinality it can be more of a problem
> * Hash each chunk separately, then normalize at the end
> My guess is that a crude concurrent hash table with a mutex to protect 
> mutations and resizes is going to outperform the latter



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-2882) [C++][Python] Support AWS Firehose partition_scheme implementation for Parquet datasets

2019-08-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-2882:
--
Labels: dataset datasets parquet  (was: datasets parquet)

> [C++][Python] Support AWS Firehose partition_scheme implementation for 
> Parquet datasets
> ---
>
> Key: ARROW-2882
> URL: https://issues.apache.org/jira/browse/ARROW-2882
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Pablo Javier Takara
>Priority: Major
>  Labels: dataset, datasets, parquet
>
> I'd like to be able to read a ParquetDataset generated by AWS Firehose.
> The only implementation at the time of writting was the partition scheme 
> created by hive (year=2018/month=01/day=11).
> AWS Firehose partition scheme is a little bit different (2018/01/11).
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-6244) [C++] Implement Partition DataSource

2019-08-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6244:
--
Labels: dataset datasets  (was: datasets)

> [C++] Implement Partition DataSource
> 
>
> Key: ARROW-6244
> URL: https://issues.apache.org/jira/browse/ARROW-6244
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, datasets
>
> This is a DataSource that also has partition metadata. The end goal is to 
> support filtering with a DataSelector/Filter expression. The initial 
> implementation should not deal with PartitionScheme yet.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-4076) [Python] schema validation and filters

2019-08-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-4076:
--
Labels: dataset datasets easyfix parquet pull-request-available  (was: 
datasets easyfix parquet pull-request-available)

> [Python] schema validation and filters
> --
>
> Key: ARROW-4076
> URL: https://issues.apache.org/jira/browse/ARROW-4076
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: George Sakkis
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: dataset, datasets, easyfix, parquet, 
> pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Currently [schema 
> validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900]
>  of {{ParquetDataset}} takes place before filtering. This may raise a 
> {{ValueError}} if the schema is different in some dataset pieces, even if 
> these pieces would be subsequently filtered out. I think validation should 
> happen after filtering to prevent such spurious errors:
> {noformat}
> --- a/pyarrow/parquet.py  
> +++ b/pyarrow/parquet.py  
> @@ -878,13 +878,13 @@
>  if split_row_groups:
>  raise NotImplementedError("split_row_groups not yet implemented")
>  
> -if validate_schema:
> -self.validate_schemas()
> -
>  if filters is not None:
>  filters = _check_filters(filters)
>  self._filter(filters)
>  
> +if validate_schema:
> +self.validate_schemas()
> +
>  def validate_schemas(self):
>  open_file = self._get_open_file_func()
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (ARROW-6214) [R] Sanitizer errors triggered via R bindings

2019-08-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-6214:
-

Assignee: Francois Saint-Jacques

> [R] Sanitizer errors triggered via R bindings
> -
>
> Key: ARROW-6214
> URL: https://issues.apache.org/jira/browse/ARROW-6214
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
> Environment: Linux
>Reporter: Jeroen
>Assignee: Francois Saint-Jacques
>Priority: Critical
> Fix For: 0.15.0
>
>
> When we run the examples of the R package through the sanitizers, several 
> errors show up. These could be related to the segfaults we saw on the macos 
> builder on CRAN.
> We use the docker container provided by Winston Chang to test this: 
> https://github.com/wch/r-debug
> Steps to reproduce + example outputs at: 
> https://gist.github.com/jeroen/111901c351a4089a9effa90691a1dd81



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-4470) [Python] Pyarrow using considerable more memory when reading partitioned Parquet file

2019-08-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-4470:
--
Labels: dataset datasets parquet  (was: datasets parquet)

> [Python] Pyarrow using considerable more memory when reading partitioned 
> Parquet file
> -
>
> Key: ARROW-4470
> URL: https://issues.apache.org/jira/browse/ARROW-4470
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0
>Reporter: Ivan SPM
>Priority: Major
>  Labels: dataset, datasets, parquet
> Fix For: 1.0.0
>
>
> Hi,
> I have a partitioned Parquet table in Impala in HDFS, using Hive metastore, 
> with the following structure:
> {{/data/myparquettable/year=2016}}{{/data/myparquettable/year=2016/myfile_1.prt}}
> {{/data/myparquettable/year=2016/myfile_2.prt}}
> {{/data/myparquettable/year=2016/myfile_3.prt}}
> {{/data/myparquettable/year=2017}}
> {{/data/myparquettable/year=2017/myfile_1.prt}}
> {{/data/myparquettable/year=2017/myfile_2.prt}}
> {{/data/myparquettable/year=2017/myfile_3.prt}}
> and so on. I need to work with one partition, so I copied one partition to a 
> local filesystem:
> {{hdfs fs -get /data/myparquettable/year=2017 /local/}}
> so now I have some data on the local disk:
> {{/local/year=2017/myfile_1.prt }}{{/local/year=2017/myfile_2.prt }}
> etc.I tried to read it using Pyarrow:
> {{import pyarrow.parquet as pq}}{{pq.read_parquet('/local/year=2017')}}
> and it starts reading. The problem is that the local Parquet files are around 
> 15GB total, and I blew up my machine memory a couple of times because when 
> reading these files, Pyarrow is using more than 60GB of RAM, and I'm not sure 
> how much it will take because it never finishes. Is this expected? Is there a 
> workaround?
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-1036) [C++] Define abstract API for filtering Arrow streams (e.g. predicate evaluation)

2019-08-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-1036:
--
Labels: dataset datasets  (was: datasets)

> [C++] Define abstract API for filtering Arrow streams (e.g. predicate 
> evaluation)
> -
>
> Key: ARROW-1036
> URL: https://issues.apache.org/jira/browse/ARROW-1036
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, datasets
> Fix For: 1.0.0
>
>
> It would be useful to be able to apply analytic predicates to an Arrow stream 
> in a composable way. As soon as we are able to compute some simple predicates 
> on in-memory Arrow data, we could define our first version of this



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-1089) [C++/Python] Add API to write an Arrow stream into either the stream or file formats on disk

2019-08-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-1089:
--
Labels: dataset datasets  (was: datasets)

> [C++/Python] Add API to write an Arrow stream into either the stream or file 
> formats on disk
> 
>
> Key: ARROW-1089
> URL: https://issues.apache.org/jira/browse/ARROW-1089
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, datasets
> Fix For: 1.0.0
>
>
> For Arrow streams with unknown size, it would be useful to be able to write 
> the data to disk either as a stream or as the file format (for random access) 
> with minimal overhead; i.e. we would avoid record batch IPC loading and write 
> the raw messages directly to disk



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-6243) [C++] Implement basic Filter expression classes

2019-08-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6243:
--
Labels: dataset datasets  (was: datasets)

> [C++] Implement basic Filter expression classes
> ---
>
> Key: ARROW-6243
> URL: https://issues.apache.org/jira/browse/ARROW-6243
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Benjamin Kietzman
>Priority: Major
>  Labels: dataset, datasets
>
> This will draft the basic classes for creating boolean expressions that are 
> passed to the DataSources/DataFragments for predicate push-down.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-2366) [Python] Support reading Parquet files having a permutation of column order

2019-08-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-2366:
--
Labels: dataset datasets parquet  (was: datasets parquet)

> [Python] Support reading Parquet files having a permutation of column order
> ---
>
> Key: ARROW-2366
> URL: https://issues.apache.org/jira/browse/ARROW-2366
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, datasets, parquet
> Fix For: 1.0.0
>
>
> See discussion in https://github.com/dask/fastparquet/issues/320



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-3424) [Python] Improved workflow for loading an arbitrary collection of Parquet files

2019-08-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-3424:
--
Labels: dataset datasets parquet  (was: datasets parquet)

> [Python] Improved workflow for loading an arbitrary collection of Parquet 
> files
> ---
>
> Key: ARROW-3424
> URL: https://issues.apache.org/jira/browse/ARROW-3424
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, datasets, parquet
> Fix For: 1.0.0
>
>
> See SO question for use case: 
> https://stackoverflow.com/questions/52613682/load-multiple-parquet-files-into-dataframe-for-analysis



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-5992) [C++] Array::View fails for string/utf8 as binary

2019-08-20 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-5992.
---
Resolution: Fixed

Issue resolved by pull request 5125
[https://github.com/apache/arrow/pull/5125]

> [C++] Array::View fails for string/utf8 as binary
> -
>
> Key: ARROW-5992
> URL: https://issues.apache.org/jira/browse/ARROW-5992
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I encountered this
> {code}
> -- Arrow Fatal Error --
> Invalid: Can't view array of type string as binary: not enough buffers for 
> view type
> In ../src/arrow/array.cc, line 1049, code: CheckInputAvailable()
> In ../src/arrow/array.cc, line 1100, code: impl.MakeDataView(out_field, 
> _data)
> {code}
> when trying to add a {{BinaryWithRepeats}} function to 
> {{RandomArrayGenerator}}
> {code}
>   std::shared_ptr out;
>   auto strings = StringWithRepeats(size, unique, min_length, max_length,
>null_probability);
>   ABORT_NOT_OK(strings->View(binary(), ));
>   return out;
> {code}
> It looks like utf8 <-> binary view simply aren't tested in array-view-test



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6183) [R] Document that you don't have to use tidyselect if you don't want

2019-08-22 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-6183.
---
Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5144
[https://github.com/apache/arrow/pull/5144]

> [R] Document that you don't have to use tidyselect if you don't want
> 
>
> Key: ARROW-6183
> URL: https://issues.apache.org/jira/browse/ARROW-6183
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: R
>Reporter: James Lamb
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I noticed tonight that several functions from the *tidyselect* package are 
> re-exported by *arrow*. Why is this necessary? In my opinion, the *arrow* R 
> package should strive to have as few dependencies as possible and should have 
> no opinion about which parts of the R ecosystem ("tidy" or otherwise) are 
> used with it.
> I think it would be valuable to cut the *tidyselect* re-exports, and to make 
> *feather::read_feather()*'s argument *col_select* take a character vector of 
> column names instead of a "*tidyselect::vars_select()"* object. I think that 
> would be more natural and would be intuitive for a broader group of R users.
> Would you be open to removing *tidyselect* and changing 
> *feather::read_feather()* this way?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6214) [R] Sanitizer errors triggered via R bindings

2019-08-22 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913446#comment-16913446
 ] 

Francois Saint-Jacques commented on ARROW-6214:
---

See attached files for full stack traces of reported errors. Some of them looks 
legit (array__to_vector.cc and array__from_vector.cc) and some looks related to 
upstream package (Rcpp).

 [^RDcsan.failures]  [^RDsan.failures] 

> [R] Sanitizer errors triggered via R bindings
> -
>
> Key: ARROW-6214
> URL: https://issues.apache.org/jira/browse/ARROW-6214
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
> Environment: Linux
>Reporter: Jeroen
>Assignee: Francois Saint-Jacques
>Priority: Critical
> Fix For: 0.15.0
>
> Attachments: RDcsan.failures, RDsan.failures
>
>
> When we run the examples of the R package through the sanitizers, several 
> errors show up. These could be related to the segfaults we saw on the macos 
> builder on CRAN.
> We use the docker container provided by Winston Chang to test this: 
> https://github.com/wch/r-debug
> Steps to reproduce + example outputs at: 
> https://gist.github.com/jeroen/111901c351a4089a9effa90691a1dd81



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-6214) [R] Sanitizer errors triggered via R bindings

2019-08-22 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6214:
--
Attachment: RDsan.failures
RDcsan.failures

> [R] Sanitizer errors triggered via R bindings
> -
>
> Key: ARROW-6214
> URL: https://issues.apache.org/jira/browse/ARROW-6214
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
> Environment: Linux
>Reporter: Jeroen
>Assignee: Francois Saint-Jacques
>Priority: Critical
> Fix For: 0.15.0
>
> Attachments: RDcsan.failures, RDsan.failures
>
>
> When we run the examples of the R package through the sanitizers, several 
> errors show up. These could be related to the segfaults we saw on the macos 
> builder on CRAN.
> We use the docker container provided by Winston Chang to test this: 
> https://github.com/wch/r-debug
> Steps to reproduce + example outputs at: 
> https://gist.github.com/jeroen/111901c351a4089a9effa90691a1dd81



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-5966) [Python] Capacity error when converting large UTF32 numpy array to arrow array

2019-08-20 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-5966.
---
Resolution: Fixed

Issue resolved by pull request 5122
[https://github.com/apache/arrow/pull/5122]

> [Python] Capacity error when converting large UTF32 numpy array to arrow array
> --
>
> Key: ARROW-5966
> URL: https://issues.apache.org/jira/browse/ARROW-5966
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Igor Yastrebov
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Trying to create a large string array fails with 
> ArrowCapacityError: Encoded string length exceeds maximum size (2GB)
> instead of creating a chunked array.
>  
> A reproducible example:
> {code:java}
> import uuid
> import numpy as np
> import pyarrow as pa
> li = []
> for i in range(1):
> li.append(uuid.uuid4().hex)
> arr = np.array(li)
> parr = pa.array(arr)
> {code}
> Is it a regression or was it never properly fixed: 
> [https://github.com/apache/arrow/issues/1855]?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6048) [C++] Add ChunkedArray::View which calls to Array::View

2019-08-20 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-6048.
---
Resolution: Fixed

Issue resolved by pull request 5127
[https://github.com/apache/arrow/pull/5127]

> [C++] Add ChunkedArray::View which calls to Array::View
> ---
>
> Key: ARROW-6048
> URL: https://issues.apache.org/jira/browse/ARROW-6048
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This convenience will help with zero-copy casting from one compatible type to 
> another
> I implemented a workaround for this in ARROW-3772



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (ARROW-5141) [C++] Share more of the IPC testing utils with the rest of Arrow

2019-08-20 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-5141:
-

Assignee: (was: Francois Saint-Jacques)

> [C++] Share more of the IPC testing utils with the rest of Arrow
> 
>
> Key: ARROW-5141
> URL: https://issues.apache.org/jira/browse/ARROW-5141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Minor
>
> Some APIs in {{arrow/ipc/test-common.h}} aren't really IPC-specific. 
> Furthermore, {{arrow/ipc/test-common.h}} is already included in non-IPC 
> tests. Those APIs should be moved to the Arrow-wide testing utilities.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (ARROW-5082) [Python][Packaging] Reduce size of macOS and manylinux1 wheels

2019-08-20 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-5082:
-

Assignee: (was: Francois Saint-Jacques)

> [Python][Packaging] Reduce size of macOS and manylinux1 wheels
> --
>
> Key: ARROW-5082
> URL: https://issues.apache.org/jira/browse/ARROW-5082
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> The wheels more than tripled in size from 0.12.0 to 0.13.0. I think this is 
> mostly because of LLVM but we should take a closer look to see if the size 
> can be reduced



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (ARROW-5630) [Python] Table of nested arrays doesn't round trip

2019-08-20 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-5630:
-

Assignee: (was: Francois Saint-Jacques)

> [Python] Table of nested arrays doesn't round trip
> --
>
> Key: ARROW-5630
> URL: https://issues.apache.org/jira/browse/ARROW-5630
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: pyarrow 0.13, Windows 10
>Reporter: Philip Felton
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
>
> This is pyarrow 0.13 on Windows.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> def make_table(num_rows):
> typ = pa.list_(pa.field("item", pa.float32(), False))
> return pa.Table.from_arrays([
> pa.array([[0] * (i%10) for i in range(0, num_rows)], type=typ),
> pa.array([[0] * ((i+5)%10) for i in range(0, num_rows)], type=typ)
> ], ['a', 'b'])
> pq.write_table(make_table(100), 'test.parquet')
> pq.read_table('test.parquet')
> {code}
> The last line throws the following exception:
> {noformat}
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pq.read_table('full.parquet')
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read_table(source, 
> columns, use_threads, metadata, use_pandas_metadata, memory_map, filesystem)
>1150 return fs.read_parquet(path, columns=columns,
>1151use_threads=use_threads, 
> metadata=metadata,
> -> 1152
> use_pandas_metadata=use_pandas_metadata)
>1153 
>1154 pf = ParquetFile(source, metadata=metadata)
> ~\Anaconda3\lib\site-packages\pyarrow\filesystem.py in read_parquet(self, 
> path, columns, metadata, schema, use_threads, use_pandas_metadata)
> 179  filesystem=self)
> 180 return dataset.read(columns=columns, use_threads=use_threads,
> --> 181 use_pandas_metadata=use_pandas_metadata)
> 182 
> 183 def open(self, path, mode='rb'):
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, use_pandas_metadata)
>1012 table = piece.read(columns=columns, 
> use_threads=use_threads,
>1013partitions=self.partitions,
> -> 1014
> use_pandas_metadata=use_pandas_metadata)
>1015 tables.append(table)
>1016 
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, partitions, open_file_func, file, use_pandas_metadata)
> 562 table = reader.read_row_group(self.row_group, **options)
> 563 else:
> --> 564 table = reader.read(**options)
> 565 
> 566 if len(self.partition_keys) > 0:
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, use_pandas_metadata)
> 212 columns, use_pandas_metadata=use_pandas_metadata)
> 213 return self.reader.read_all(column_indices=column_indices,
> --> 214 use_threads=use_threads)
> 215 
> 216 def scan_contents(self, columns=None, batch_size=65536):
> ~\Anaconda3\lib\site-packages\pyarrow\_parquet.pyx in 
> pyarrow._parquet.ParquetReader.read_all()
> ~\Anaconda3\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Column 1 named b expected length 932066 but got length 932063
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6046) [C++] Slice RecordBatch of String array with offset 0 returns whole batch

2019-08-20 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-6046.
---
Resolution: Fixed

Issue resolved by pull request 5126
[https://github.com/apache/arrow/pull/5126]

> [C++] Slice RecordBatch of String array with offset 0 returns whole batch
> -
>
> Key: ARROW-6046
> URL: https://issues.apache.org/jira/browse/ARROW-6046
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.14.1
>Reporter: Sascha Hofmann
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We are seeing a very similar bug as in ARROW-809, just for a RecordBatch of 
> strings. A slice of a RecordBatch with a string column and offset =0 returns 
> the whole batch instead.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({ 'b': ['test' for x in range(1000_000)]})
> tbl = pa.Table.from_pandas(df)
> batch = tbl.to_batches()[0]
> batch.slice(0,2).serialize().size
> # 4000232
> batch.slice(1,2).serialize().size
> # 240
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6362) [C++] S3: more flexible credential options

2019-08-26 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916011#comment-16916011
 ] 

Francois Saint-Jacques commented on ARROW-6362:
---

I think the exposed (high level) interface is either:
- A static credentials
- A default behavior that follows the [aws cli 
defaults](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)
 this can be achieved with 
[DefaultAWSCredentialsProviderChain](http://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_auth_1_1_a_w_s_credentials_provider.html).

In the code, it would translate to store the AWSCredentialsProviderChain in the 
Option struct. The end goal is that user don't need to change their existing 
credentials setup. 

{code:python}
fs = s3fs(..., auth=None)
# the default which resolves to DefaultChain
fs = s3fs(..., auth=(None, None))
# AnonymousAWSCredentialsProvider
fs = s3fs(..., auth=(access, secret))
# SimpleAWSCredentialsProvider
{code}

> [C++] S3: more flexible credential options
> --
>
> Key: ARROW-6362
> URL: https://issues.apache.org/jira/browse/ARROW-6362
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>
> We should perhaps allow passing an optional {{AWSCredentialsProvider}} to 
> {{S3FileSystem::Make}}, all the while keeping an option for a (access key, 
> secret key) pair.
> http://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_auth_1_1_a_w_s_credentials_provider.html



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


<    1   2   3   4   5   6   7   8   9   10   >