date:20190426

[jira] [Updated] (ARROW-5217) [Rust] [CI] DataFusion test failure

2019-04-26 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5217:
--
Labels: pull-request-available  (was: )

> [Rust] [CI] DataFusion test failure
> ---
>
> Key: ARROW-5217
> URL: https://issues.apache.org/jira/browse/ARROW-5217
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Rust, Rust - DataFusion
>Reporter: Antoine Pitrou
>Assignee: Andy Grove
>Priority: Blocker
>  Labels: pull-request-available
>
> Travis-CI Rust jobs have started failing consistently with a DataFusion test 
> failure.
> Example here:
> https://travis-ci.org/apache/arrow/jobs/524542965
> {code}
>  
> execution::aggregate::tests::test_min_max_sum_count_avg_f64_group_by_uint32 
> stdout 
> thread 
> 'execution::aggregate::tests::test_min_max_sum_count_avg_f64_group_by_uint32' 
> panicked at 'assertion failed: `(left == right)`
>   left: `2`,
>  right: `5`', datafusion/src/execution/aggregate.rs:1437:9
> note: Run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-5176) [Python] Automate formatting of python files

2019-04-26 Thread Neal Richardson (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827299#comment-16827299
 ] 

Neal Richardson commented on ARROW-5176:


FTR running black the first time would touch a lot of files, so this would 
require a bit of work resolving merge conflicts with some open PRs: 
{code:java}
All done! ✨  ✨ 
62 files reformatted, 11 files left unchanged.
{code}

> [Python] Automate formatting of python files
> 
>
> Key: ARROW-5176
> URL: https://issues.apache.org/jira/browse/ARROW-5176
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Benjamin Kietzman
>Priority: Minor
>
> [Black](https://github.com/ambv/black) is a tool for automatically formatting 
> python code in ways which flake8 and our other linters approve of. Adding it 
> to the project will allow more reliably formatted python code and fill a 
> similar role to {{clang-format}} for c++ and {{cmake-format}} for cmake



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-4694) [CI] detect-changes.py is inconsistent

2019-04-26 Thread Neal Richardson (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827271#comment-16827271
 ] 

Neal Richardson commented on ARROW-4694:


I encountered this too. [This PR|https://github.com/apache/arrow/pull/4210] 
altered two files in the `python` directory; however, all Travis builds ran, 
and the Rust build failed. According to the [Rust job 
log|https://travis-ci.org/apache/arrow/jobs/524779520], it thought that two 
other files were modified:
{code:java}
$ eval `python $TRAVIS_BUILD_DIR/ci/detect-changes.py`
Affected files: [u'ci/conda_env_sphinx.yml', 
u'cpp/cmake_modules/ThirdpartyToolchain.cmake', u'python/pyarrow/error.pxi', 
u'python/pyarrow/tests/test_csv.py']
Affected topics:
{'c_glib': True,
'cpp': True,
'csharp': True,
'dev': True,
'docs': True,
'go': True,
'integration': True,
'java': True,
'js': True,
'python': True,
'r': True,
'ruby': True,
'rust': True,
'site': True}
{code}

> [CI] detect-changes.py is inconsistent
> --
>
> Key: ARROW-4694
> URL: https://issues.apache.org/jira/browse/ARROW-4694
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Affects Versions: 0.12.1
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: travis-ci
> Fix For: 0.14.0
>
>
> Some examples of pull-requests with wrong affected files:
>    - [pr-3762|https://github.com/apache/arrow/pull/3762/files] shouldn't 
> trigger [javascript|https://travis-ci.org/apache/arrow/jobs/498805479#L217]
>    - [pr-3767|https://github.com/apache/arrow/pull/3767/files] shouldn't 
> affect files found in 
> [rust|https://travis-ci.org/apache/arrow/jobs/499122044] and 
> [javascript|https://travis-ci.org/apache/arrow/jobs/499122041#L217]
> In 
> [get_travis_commit_range|https://github.com/apache/arrow/blob/master/ci/detect-changes.py#L63-L67]
>  , it references the following 
> [comment|https://github.com/travis-ci/travis-ci/issues/4596#issuecomment-139811122].
>  If read further down in the 
> [thread|https://github.com/travis-ci/travis-ci/issues/4596#issuecomment-434532772],
>  you'll note that it can go bonkers due to shallowness and commit of branch 
> creation. I'm not sure if this is the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-5176) [Python] Automate formatting of python files

2019-04-26 Thread Neal Richardson (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827264#comment-16827264
 ] 

Neal Richardson commented on ARROW-5176:


+1 for this, and I'll go further and propose a pre-commit hook to run `black` 
so that developers don't have to waste energy thinking about linting. At a 
minimum we should add a pre-commit hook that runs flake8 (per the [dev 
instructions|https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst#coding-style]).
 I got a Travis failure for linting, and IMO that should never happen.

> [Python] Automate formatting of python files
> 
>
> Key: ARROW-5176
> URL: https://issues.apache.org/jira/browse/ARROW-5176
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Benjamin Kietzman
>Priority: Minor
>
> [Black](https://github.com/ambv/black) is a tool for automatically formatting 
> python code in ways which flake8 and our other linters approve of. Adding it 
> to the project will allow more reliably formatted python code and fill a 
> similar role to {{clang-format}} for c++ and {{cmake-format}} for cmake



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-4963) [C++] MSVC build invokes CMake repeatedly

2019-04-26 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4963:

Fix Version/s: (was: 0.14.0)

> [C++] MSVC build invokes CMake repeatedly
> -
>
> Key: ARROW-4963
> URL: https://issues.apache.org/jira/browse/ARROW-4963
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> I'm doing a pretty vanilla out of source build with Visual Studio 2015 and I 
> am finding that it's re-running CMake many times throughout the build. I will 
> try to produce a complete log when I can to illustrate. I am using this 
> command:
> {code}
>cmake -G "Visual Studio 14 2015 Win64" ^
>  -DCMAKE_INSTALL_PREFIX=%ARROW_HOME% ^
>  -DARROW_CXXFLAGS="/WX /MP" ^
>  -DARROW_GANDIVA=on ^
>  -DARROW_ORC=on ^
>  -DARROW_PARQUET=on ^
>  -DARROW_PYTHON=on ..
>cmake --build . --target INSTALL --config Release
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-4993) [C++] Display summary at the end of CMake configuration

2019-04-26 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-4993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827257#comment-16827257
 ] 

Wes McKinney commented on ARROW-4993:
-

Here's how they implement

https://github.com/apache/thrift/blob/71afec0ea3fc700d5f0d1c46512723963bf1e2f7/build/cmake/DefineOptions.cmake#L145

https://github.com/apache/thrift/blob/master/CMakeLists.txt#L130



> [C++] Display summary at the end of CMake configuration
> ---
>
> Key: ARROW-4993
> URL: https://issues.apache.org/jira/browse/ARROW-4993
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.12.1
>Reporter: Antoine Pitrou
>Priority: Minor
> Fix For: 0.14.0
>
>
> Some third-party projects like Thrift display a nice and useful summary of 
> the build configuration at the end of the CMake configuration run:
> https://ci.appveyor.com/project/pitrou/arrow/build/job/mgi68rvk0u5jf2s4?fullLog=true#L2325
> It may be good to have a similar thing in Arrow as well. Bonus points if, for 
> each configuration item, it says which CMake variable can be used to 
> influence it.
> Something like:
> {code}
> -- Build ZSTD support: ON  [change using ARROW_WITH_ZSTD]
> -- Build BZ2 support:  OFF [change using ARROW_WITH_BZ2]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-5085) [Python/C++] Conversion of dict encoded null column fails in parquet writing when using RowGroups

2019-04-26 Thread Joris Van den Bossche (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5085:
-
Labels: parquet  (was: )

> [Python/C++] Conversion of dict encoded null column fails in parquet writing 
> when using RowGroups
> -
>
> Key: ARROW-5085
> URL: https://issues.apache.org/jira/browse/ARROW-5085
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Florian Jetter
>Priority: Minor
>  Labels: parquet
>
> Conversion of dict encoded null column fails in parquet writing when using 
> RowGroups
> {code:python}
> import pyarrow.parquet as pq
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({"col": [None] * 100, "int": [1.0] * 100})
> df = df.astype({"col": "category"})
> table = pa.Table.from_pandas(df)
> buf = pa.BufferOutputStream()
> pq.write_table(
> table,
> buf,
> version="2.0",
> chunk_size=10,
> )
> {code}
> fails with 
> {{pyarrow.lib.ArrowIOError: Column 2 had 100 while previous column had 10}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-5089) [C++/Python] Writing dictionary encoded columns to parquet is extremely slow when using chunk size

2019-04-26 Thread Joris Van den Bossche (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5089:
-
Labels: parquet performance  (was: performance)

> [C++/Python] Writing dictionary encoded columns to parquet is extremely slow 
> when using chunk size
> --
>
> Key: ARROW-5089
> URL: https://issues.apache.org/jira/browse/ARROW-5089
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Florian Jetter
>Priority: Major
>  Labels: parquet, performance
>
> Currently, there is a workaround for dict encoded columns in place to handle 
> writing dict encoded columns to parquet.
> The workaround converts the dict encoded array to its plain version before 
> writing to parquet. This is painfully slow since for every row group the 
> entire array is converted over and over again.
> The following example is orders of magnitude slower than the non-dict encoded 
> version:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> df = pd.DataFrame({"col": ["A", "B"] * 10}).astype("category")
> table = pa.Table.from_pandas(df)
> buf = pa.BufferOutputStream()
> pq.write_table(
> table,
> buf,
> chunk_size=100,
> )
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-5222) [Python] Issues with installing pyarrow for development on MacOS

2019-04-26 Thread Neal Richardson (JIRA)

Neal Richardson created ARROW-5222:
--

 Summary: [Python] Issues with installing pyarrow for development 
on MacOS
 Key: ARROW-5222
 URL: https://issues.apache.org/jira/browse/ARROW-5222
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Python
Reporter: Neal Richardson
 Fix For: 0.14.0


I tried following the 
[instructions|https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst]
 for installing pyarrow for developers on macos, and I ran into quite a bit of 
difficulty. I'm hoping we can improve our documentation and/or tooling to make 
this a smoother process. 

I know we can't anticipate every quirk of everyone's dev environment, but in my 
case, I was getting set up on a new machine, so this was from a clean slate. 
I'm also new to contributing to the project, so I'm a "clean slate" in that 
regard too, so my ignorance may be exposing other assumptions in the docs.
 # The instructions recommend using conda, but as this [Stack Overflow 
question|https://stackoverflow.com/questions/55798166/cmake-fails-with-when-attempting-to-compile-simple-test-program]
 notes, cmake fails. Uwe helpfully suggested installing an older MacOS SDK from 
[here|https://github.com/phracker/MacOSX-SDKs/releases]. That may work, but I'm 
personally wary to install binaries from an unofficial github account, let 
alone record that in our docs as an official recommendation. Either way, we 
should update the docs either to note this necessity or to recommend against 
installing with conda on macos.
 # After that, I tried to go the Homebrew path. Ultimately this did succeed, 
but it was rough. It seemed that I had to `brew install` a lot of packages that 
weren't included in the arrow/python/Brewfile (i.e. try to cmake, see what 
missing dependency it failed on, `brew install` it, retry `cmake`, and repeat). 
Among the libs I installed this way were double-conversion snappy brotli 
protobuf gtest rapidjson flatbuffers lz4 zstd c-ares boost. It's not clear how 
many of these extra dependencies I had to install were because I'd only 
installed the xcode command-line tools and not the full xcode from the App 
Store; regardless, the Brewfile should be complete if we want to use it.
 # In searching Jira for the double-conversion issue (the first one I hit), I 
found [this issue/PR|https://github.com/apache/arrow/pull/4132/files], which 
added double-conversion to a different Brewfile, in c_glib. So I tried `brew 
bundle` installing that Brewfile. It would probably be good to have a common 
Brewfile for the C++ setup, which the python and glib ones could load and then 
add any other extra dependencies, if necessary. That way, there's one place to 
add common dependencies.
 # I got close here but still had issues with `BOOST_HOME` not being found, 
even though I had brew-installed it. From the console output, it appeared that 
even though I was not using conda and did not have an active conda environment 
(I'd even done `conda env remove --name pyarrow-dev`), the cmake configuration 
script detected that conda existed and decided to use conda to resolve 
dependencies. I tried setting lots of different environment variables to tell 
cmake not to use conda, but ultimately I was only able to get past this by 
deleting conda from my system entirely.
 # This let me get to the point of being able to `import pyarrow`. But then 
running tests failed because the `hypothesis` package was not installed. I see 
that it is included in requirements-test.txt and setup.py under tests_require, 
but I followed the installation instructions and this package did not end up in 
my virtualenv. `pip install hypothesis` resolved it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-5212) [Go] Array BinaryBuilder in Go library has no access to resize the values buffer

2019-04-26 Thread Sebastien Binet (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastien Binet resolved ARROW-5212.

   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4204
[https://github.com/apache/arrow/pull/4204]

> [Go] Array BinaryBuilder in Go library has no access to resize the values 
> buffer
> 
>
> Key: ARROW-5212
> URL: https://issues.apache.org/jira/browse/ARROW-5212
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Reporter: Jonathan A Sternberg
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When you are dealing with a binary builder, there are three buffers: the null 
> bitmap, the offset indexes, and the values buffer which contains the actual 
> data.
> When {{Reserve}} or {{Resize}} are used, the null bitmap and the offsets are 
> modified to allow for additional appends to function. This seems correct to 
> me. There's no way to know how much the values buffer should be resized until 
> the values are being appended with just the number of values alone.
> But, when you are then appending a bunch of string values, there's no 
> additional API to preallocate the size of that last buffer. That means that 
> batch appending a large amount of strings will constantly allocate even if 
> you know the size ahead of time.
> There should be some additional API to modify this last buffer such as maybe 
> {{ReserveBytes}} and {{ResizeBytes}} that would correspond with the 
> {{Reserve}} and {{Resize}} methods, but would related to the values buffer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-5214) [C++] Offline dependency downloader misses some libraries

2019-04-26 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5214.
-
Resolution: Fixed

Issue resolved by pull request 4214
[https://github.com/apache/arrow/pull/4214]

> [C++] Offline dependency downloader misses some libraries
> -
>
> Key: ARROW-5214
> URL: https://issues.apache.org/jira/browse/ARROW-5214
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Not sure yet but maybe this was introduced by 
> https://github.com/apache/arrow/commit/f913d8f0adff71c288a10f6c1b0ad2d1ab3e9e32
> {code}
> $ thirdparty/download_dependencies.sh /home/wesm/arrow-thirdparty
> # Environment variables for offline Arrow build
> export ARROW_BOOST_URL=/home/wesm/arrow-thirdparty/boost-1.67.0.tar.gz
> export ARROW_BROTLI_URL=/home/wesm/arrow-thirdparty/brotli-v1.0.7.tar.gz
> export ARROW_CARES_URL=/home/wesm/arrow-thirdparty/cares-1.15.0.tar.gz
> export 
> ARROW_DOUBLE_CONVERSION_URL=/home/wesm/arrow-thirdparty/double-conversion-v3.1.4.tar.gz
> export 
> ARROW_FLATBUFFERS_URL=/home/wesm/arrow-thirdparty/flatbuffers-v1.10.0.tar.gz
> export 
> ARROW_GBENCHMARK_URL=/home/wesm/arrow-thirdparty/gbenchmark-v1.4.1.tar.gz
> export ARROW_GFLAGS_URL=/home/wesm/arrow-thirdparty/gflags-v2.2.0.tar.gz
> export ARROW_GLOG_URL=/home/wesm/arrow-thirdparty/glog-v0.3.5.tar.gz
> export ARROW_GRPC_URL=/home/wesm/arrow-thirdparty/grpc-v1.20.0.tar.gz
> export ARROW_GTEST_URL=/home/wesm/arrow-thirdparty/gtest-1.8.1.tar.gz
> export ARROW_LZ4_URL=/home/wesm/arrow-thirdparty/lz4-v1.8.3.tar.gz
> export ARROW_ORC_URL=/home/wesm/arrow-thirdparty/orc-1.5.5.tar.gz
> export ARROW_PROTOBUF_URL=/home/wesm/arrow-thirdparty/protobuf-v3.7.1.tar.gz
> export 
> ARROW_RAPIDJSON_URL=/home/wesm/arrow-thirdparty/rapidjson-2bbd33b33217ff4a73434ebf10cdac41e2ef5e34.tar.gz
> export ARROW_RE2_URL=/home/wesm/arrow-thirdparty/re2-2019-04-01.tar.gz
> {code}
> The 5 dependencies listed after RE2 are not downloaded



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-5130) [Python] Segfault when importing TensorFlow after Pyarrow

2019-04-26 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-5130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826997#comment-16826997
 ] 

Wes McKinney commented on ARROW-5130:
-

See https://github.com/apache/arrow/tree/master/python/manylinux1

> [Python] Segfault when importing TensorFlow after Pyarrow
> -
>
> Key: ARROW-5130
> URL: https://issues.apache.org/jira/browse/ARROW-5130
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Travis Addair
>Priority: Major
>
> This issue is similar to https://jira.apache.org/jira/browse/ARROW-2657 which 
> was fixed in v0.10.0.
> When we import TensorFlow after Pyarrow in Linux Debian Jessie, we get a 
> segfault.  To reproduce:
> {code:java}
> import pyarrow 
> import tensorflow{code}
> Here's the backtrace from gdb:
> {code:java}
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x in ?? ()
> (gdb) bt
> #0 0x in ?? ()
> #1 0x7f529ee04410 in pthread_once () at 
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:103
> #2 0x7f5229a74efa in void std::call_once(std::once_flag&, 
> void (&)()) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #3 0x7f5229a74f3e in 
> tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #4 0x7f522978b561 in tensorflow::port::(anonymous 
> namespace)::CheckFeatureOrDie(tensorflow::port::CPUFeature, std::string 
> const&) ()
> from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #5 0x7f522978b5b4 in _GLOBAL__sub_I_cpu_feature_guard.cc () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #6 0x7f529f224bea in call_init (l=, argc=argc@entry=9, 
> argv=argv@entry=0x7ffc6d8c1488, env=env@entry=0x294c0c0) at dl-init.c:78
> #7 0x7f529f224cd3 in call_init (env=0x294c0c0, argv=0x7ffc6d8c1488, 
> argc=9, l=) at dl-init.c:36
> #8 _dl_init (main_map=main_map@entry=0x2e4aff0, argc=9, argv=0x7ffc6d8c1488, 
> env=0x294c0c0) at dl-init.c:126
> #9 0x7f529f228e38 in dl_open_worker (a=a@entry=0x7ffc6d8bebb8) at 
> dl-open.c:577
> #10 0x7f529f224aa4 in _dl_catch_error 
> (objname=objname@entry=0x7ffc6d8beba8, 
> errstring=errstring@entry=0x7ffc6d8bebb0, 
> mallocedp=mallocedp@entry=0x7ffc6d8beba7,
> operate=operate@entry=0x7f529f228b60 , 
> args=args@entry=0x7ffc6d8bebb8) at dl-error.c:187
> #11 0x7f529f22862b in _dl_open (file=0x7f5248178b54 
> "/usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so",
>  mode=-2147483646, caller_dlopen=,
> nsid=-2, argc=9, argv=0x7ffc6d8c1488, env=0x294c0c0) at dl-open.c:661
> #12 0x7f529ebf402b in dlopen_doit (a=a@entry=0x7ffc6d8bedd0) at 
> dlopen.c:66
> #13 0x7f529f224aa4 in _dl_catch_error (objname=0x2950fc0, 
> errstring=0x2950fc8, mallocedp=0x2950fb8, operate=0x7f529ebf3fd0 
> , args=0x7ffc6d8bedd0) at dl-error.c:187
> #14 0x7f529ebf45dd in _dlerror_run (operate=operate@entry=0x7f529ebf3fd0 
> , args=args@entry=0x7ffc6d8bedd0) at dlerror.c:163
> #15 0x7f529ebf40c1 in __dlopen (file=, mode= out>) at dlopen.c:87
> #16 0x00540859 in _PyImport_GetDynLoadFunc ()
> #17 0x0054024c in _PyImport_LoadDynamicModule ()
> #18 0x005f2bcb in ?? ()
> #19 0x004ca235 in PyEval_EvalFrameEx ()
> #20 0x004ca9c2 in PyEval_EvalFrameEx ()
> #21 0x004c8c39 in PyEval_EvalCodeEx ()
> #22 0x004c84e6 in PyEval_EvalCode ()
> #23 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #24 0x004c3272 in ?? ()
> #25 0x004b19e2 in ?? ()
> #26 0x004b13d7 in ?? ()
> #27 0x004b42f6 in ?? ()
> #28 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #29 0x004ccdb3 in PyEval_EvalFrameEx ()
> #30 0x004c8c39 in PyEval_EvalCodeEx ()
> #31 0x004c84e6 in PyEval_EvalCode ()
> #32 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #33 0x004c3272 in ?? ()
> #34 0x004b1d3f in ?? ()
> #35 0x004b6b2b in ?? ()
> #36 0x004b0d82 in ?? ()
> #37 0x004b42f6 in ?? ()
> #38 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #39 0x004ccdb3 in PyEval_EvalFrameEx (){code}
> It looks like the code changes that fixed the previous issue was recently 
> removed in 
> [https://github.com/apache/arrow/commit/b766bff34b7d85034d26cebef5b3aeef1eb2fd82#diff-16806bcebc1df2fae432db426905b9f0].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-5214) [C++] Offline dependency downloader misses some libraries

2019-04-26 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5214:
--
Labels: pull-request-available  (was: )

> [C++] Offline dependency downloader misses some libraries
> -
>
> Key: ARROW-5214
> URL: https://issues.apache.org/jira/browse/ARROW-5214
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Not sure yet but maybe this was introduced by 
> https://github.com/apache/arrow/commit/f913d8f0adff71c288a10f6c1b0ad2d1ab3e9e32
> {code}
> $ thirdparty/download_dependencies.sh /home/wesm/arrow-thirdparty
> # Environment variables for offline Arrow build
> export ARROW_BOOST_URL=/home/wesm/arrow-thirdparty/boost-1.67.0.tar.gz
> export ARROW_BROTLI_URL=/home/wesm/arrow-thirdparty/brotli-v1.0.7.tar.gz
> export ARROW_CARES_URL=/home/wesm/arrow-thirdparty/cares-1.15.0.tar.gz
> export 
> ARROW_DOUBLE_CONVERSION_URL=/home/wesm/arrow-thirdparty/double-conversion-v3.1.4.tar.gz
> export 
> ARROW_FLATBUFFERS_URL=/home/wesm/arrow-thirdparty/flatbuffers-v1.10.0.tar.gz
> export 
> ARROW_GBENCHMARK_URL=/home/wesm/arrow-thirdparty/gbenchmark-v1.4.1.tar.gz
> export ARROW_GFLAGS_URL=/home/wesm/arrow-thirdparty/gflags-v2.2.0.tar.gz
> export ARROW_GLOG_URL=/home/wesm/arrow-thirdparty/glog-v0.3.5.tar.gz
> export ARROW_GRPC_URL=/home/wesm/arrow-thirdparty/grpc-v1.20.0.tar.gz
> export ARROW_GTEST_URL=/home/wesm/arrow-thirdparty/gtest-1.8.1.tar.gz
> export ARROW_LZ4_URL=/home/wesm/arrow-thirdparty/lz4-v1.8.3.tar.gz
> export ARROW_ORC_URL=/home/wesm/arrow-thirdparty/orc-1.5.5.tar.gz
> export ARROW_PROTOBUF_URL=/home/wesm/arrow-thirdparty/protobuf-v3.7.1.tar.gz
> export 
> ARROW_RAPIDJSON_URL=/home/wesm/arrow-thirdparty/rapidjson-2bbd33b33217ff4a73434ebf10cdac41e2ef5e34.tar.gz
> export ARROW_RE2_URL=/home/wesm/arrow-thirdparty/re2-2019-04-01.tar.gz
> {code}
> The 5 dependencies listed after RE2 are not downloaded



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (ARROW-5214) [C++] Offline dependency downloader misses some libraries

2019-04-26 Thread Francois Saint-Jacques (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826979#comment-16826979
 ] 

Francois Saint-Jacques edited comment on ARROW-5214 at 4/26/19 1:59 PM:


The script is exiting silently, but with a non-zero error code. I'll fix this. 
The real issue is that this snappy version url (change path) does not exists 
anymore.


was (Author: fsaintjacques):
The script is exiting silently, but with a non-zero error code. I'll fix this. 
The real issue is that this snappy version does not exists anymore.

> [C++] Offline dependency downloader misses some libraries
> -
>
> Key: ARROW-5214
> URL: https://issues.apache.org/jira/browse/ARROW-5214
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 0.14.0
>
>
> Not sure yet but maybe this was introduced by 
> https://github.com/apache/arrow/commit/f913d8f0adff71c288a10f6c1b0ad2d1ab3e9e32
> {code}
> $ thirdparty/download_dependencies.sh /home/wesm/arrow-thirdparty
> # Environment variables for offline Arrow build
> export ARROW_BOOST_URL=/home/wesm/arrow-thirdparty/boost-1.67.0.tar.gz
> export ARROW_BROTLI_URL=/home/wesm/arrow-thirdparty/brotli-v1.0.7.tar.gz
> export ARROW_CARES_URL=/home/wesm/arrow-thirdparty/cares-1.15.0.tar.gz
> export 
> ARROW_DOUBLE_CONVERSION_URL=/home/wesm/arrow-thirdparty/double-conversion-v3.1.4.tar.gz
> export 
> ARROW_FLATBUFFERS_URL=/home/wesm/arrow-thirdparty/flatbuffers-v1.10.0.tar.gz
> export 
> ARROW_GBENCHMARK_URL=/home/wesm/arrow-thirdparty/gbenchmark-v1.4.1.tar.gz
> export ARROW_GFLAGS_URL=/home/wesm/arrow-thirdparty/gflags-v2.2.0.tar.gz
> export ARROW_GLOG_URL=/home/wesm/arrow-thirdparty/glog-v0.3.5.tar.gz
> export ARROW_GRPC_URL=/home/wesm/arrow-thirdparty/grpc-v1.20.0.tar.gz
> export ARROW_GTEST_URL=/home/wesm/arrow-thirdparty/gtest-1.8.1.tar.gz
> export ARROW_LZ4_URL=/home/wesm/arrow-thirdparty/lz4-v1.8.3.tar.gz
> export ARROW_ORC_URL=/home/wesm/arrow-thirdparty/orc-1.5.5.tar.gz
> export ARROW_PROTOBUF_URL=/home/wesm/arrow-thirdparty/protobuf-v3.7.1.tar.gz
> export 
> ARROW_RAPIDJSON_URL=/home/wesm/arrow-thirdparty/rapidjson-2bbd33b33217ff4a73434ebf10cdac41e2ef5e34.tar.gz
> export ARROW_RE2_URL=/home/wesm/arrow-thirdparty/re2-2019-04-01.tar.gz
> {code}
> The 5 dependencies listed after RE2 are not downloaded



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-5214) [C++] Offline dependency downloader misses some libraries

2019-04-26 Thread Francois Saint-Jacques (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826979#comment-16826979
 ] 

Francois Saint-Jacques commented on ARROW-5214:
---

The script is exiting silently, but with a non-zero error code. I'll fix this. 
The real issue is that this snappy version does not exists anymore.

> [C++] Offline dependency downloader misses some libraries
> -
>
> Key: ARROW-5214
> URL: https://issues.apache.org/jira/browse/ARROW-5214
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 0.14.0
>
>
> Not sure yet but maybe this was introduced by 
> https://github.com/apache/arrow/commit/f913d8f0adff71c288a10f6c1b0ad2d1ab3e9e32
> {code}
> $ thirdparty/download_dependencies.sh /home/wesm/arrow-thirdparty
> # Environment variables for offline Arrow build
> export ARROW_BOOST_URL=/home/wesm/arrow-thirdparty/boost-1.67.0.tar.gz
> export ARROW_BROTLI_URL=/home/wesm/arrow-thirdparty/brotli-v1.0.7.tar.gz
> export ARROW_CARES_URL=/home/wesm/arrow-thirdparty/cares-1.15.0.tar.gz
> export 
> ARROW_DOUBLE_CONVERSION_URL=/home/wesm/arrow-thirdparty/double-conversion-v3.1.4.tar.gz
> export 
> ARROW_FLATBUFFERS_URL=/home/wesm/arrow-thirdparty/flatbuffers-v1.10.0.tar.gz
> export 
> ARROW_GBENCHMARK_URL=/home/wesm/arrow-thirdparty/gbenchmark-v1.4.1.tar.gz
> export ARROW_GFLAGS_URL=/home/wesm/arrow-thirdparty/gflags-v2.2.0.tar.gz
> export ARROW_GLOG_URL=/home/wesm/arrow-thirdparty/glog-v0.3.5.tar.gz
> export ARROW_GRPC_URL=/home/wesm/arrow-thirdparty/grpc-v1.20.0.tar.gz
> export ARROW_GTEST_URL=/home/wesm/arrow-thirdparty/gtest-1.8.1.tar.gz
> export ARROW_LZ4_URL=/home/wesm/arrow-thirdparty/lz4-v1.8.3.tar.gz
> export ARROW_ORC_URL=/home/wesm/arrow-thirdparty/orc-1.5.5.tar.gz
> export ARROW_PROTOBUF_URL=/home/wesm/arrow-thirdparty/protobuf-v3.7.1.tar.gz
> export 
> ARROW_RAPIDJSON_URL=/home/wesm/arrow-thirdparty/rapidjson-2bbd33b33217ff4a73434ebf10cdac41e2ef5e34.tar.gz
> export ARROW_RE2_URL=/home/wesm/arrow-thirdparty/re2-2019-04-01.tar.gz
> {code}
> The 5 dependencies listed after RE2 are not downloaded



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-5130) [Python] Segfault when importing TensorFlow after Pyarrow

2019-04-26 Thread Francois Saint-Jacques (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-5130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826968#comment-16826968
 ] 

Francois Saint-Jacques commented on ARROW-5130:
---

You'll have to replicate 
https://github.com/apache/arrow/blob/master/dev/tasks/python-wheels/travis.linux.yml

> [Python] Segfault when importing TensorFlow after Pyarrow
> -
>
> Key: ARROW-5130
> URL: https://issues.apache.org/jira/browse/ARROW-5130
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Travis Addair
>Priority: Major
>
> This issue is similar to https://jira.apache.org/jira/browse/ARROW-2657 which 
> was fixed in v0.10.0.
> When we import TensorFlow after Pyarrow in Linux Debian Jessie, we get a 
> segfault.  To reproduce:
> {code:java}
> import pyarrow 
> import tensorflow{code}
> Here's the backtrace from gdb:
> {code:java}
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x in ?? ()
> (gdb) bt
> #0 0x in ?? ()
> #1 0x7f529ee04410 in pthread_once () at 
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:103
> #2 0x7f5229a74efa in void std::call_once(std::once_flag&, 
> void (&)()) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #3 0x7f5229a74f3e in 
> tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #4 0x7f522978b561 in tensorflow::port::(anonymous 
> namespace)::CheckFeatureOrDie(tensorflow::port::CPUFeature, std::string 
> const&) ()
> from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #5 0x7f522978b5b4 in _GLOBAL__sub_I_cpu_feature_guard.cc () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #6 0x7f529f224bea in call_init (l=, argc=argc@entry=9, 
> argv=argv@entry=0x7ffc6d8c1488, env=env@entry=0x294c0c0) at dl-init.c:78
> #7 0x7f529f224cd3 in call_init (env=0x294c0c0, argv=0x7ffc6d8c1488, 
> argc=9, l=) at dl-init.c:36
> #8 _dl_init (main_map=main_map@entry=0x2e4aff0, argc=9, argv=0x7ffc6d8c1488, 
> env=0x294c0c0) at dl-init.c:126
> #9 0x7f529f228e38 in dl_open_worker (a=a@entry=0x7ffc6d8bebb8) at 
> dl-open.c:577
> #10 0x7f529f224aa4 in _dl_catch_error 
> (objname=objname@entry=0x7ffc6d8beba8, 
> errstring=errstring@entry=0x7ffc6d8bebb0, 
> mallocedp=mallocedp@entry=0x7ffc6d8beba7,
> operate=operate@entry=0x7f529f228b60 , 
> args=args@entry=0x7ffc6d8bebb8) at dl-error.c:187
> #11 0x7f529f22862b in _dl_open (file=0x7f5248178b54 
> "/usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so",
>  mode=-2147483646, caller_dlopen=,
> nsid=-2, argc=9, argv=0x7ffc6d8c1488, env=0x294c0c0) at dl-open.c:661
> #12 0x7f529ebf402b in dlopen_doit (a=a@entry=0x7ffc6d8bedd0) at 
> dlopen.c:66
> #13 0x7f529f224aa4 in _dl_catch_error (objname=0x2950fc0, 
> errstring=0x2950fc8, mallocedp=0x2950fb8, operate=0x7f529ebf3fd0 
> , args=0x7ffc6d8bedd0) at dl-error.c:187
> #14 0x7f529ebf45dd in _dlerror_run (operate=operate@entry=0x7f529ebf3fd0 
> , args=args@entry=0x7ffc6d8bedd0) at dlerror.c:163
> #15 0x7f529ebf40c1 in __dlopen (file=, mode= out>) at dlopen.c:87
> #16 0x00540859 in _PyImport_GetDynLoadFunc ()
> #17 0x0054024c in _PyImport_LoadDynamicModule ()
> #18 0x005f2bcb in ?? ()
> #19 0x004ca235 in PyEval_EvalFrameEx ()
> #20 0x004ca9c2 in PyEval_EvalFrameEx ()
> #21 0x004c8c39 in PyEval_EvalCodeEx ()
> #22 0x004c84e6 in PyEval_EvalCode ()
> #23 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #24 0x004c3272 in ?? ()
> #25 0x004b19e2 in ?? ()
> #26 0x004b13d7 in ?? ()
> #27 0x004b42f6 in ?? ()
> #28 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #29 0x004ccdb3 in PyEval_EvalFrameEx ()
> #30 0x004c8c39 in PyEval_EvalCodeEx ()
> #31 0x004c84e6 in PyEval_EvalCode ()
> #32 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #33 0x004c3272 in ?? ()
> #34 0x004b1d3f in ?? ()
> #35 0x004b6b2b in ?? ()
> #36 0x004b0d82 in ?? ()
> #37 0x004b42f6 in ?? ()
> #38 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #39 0x004ccdb3 in PyEval_EvalFrameEx (){code}
> It looks like the code changes that fixed the previous issue was recently 
> removed in 
> [https://github.com/apache/arrow/commit/b766bff34b7d85034d26cebef5b3aeef1eb2fd82#diff-16806bcebc1df2fae432db426905b9f0].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-5130) [Python] Segfault when importing TensorFlow after Pyarrow

2019-04-26 Thread Francois Saint-Jacques (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-5130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826967#comment-16826967
 ] 

Francois Saint-Jacques commented on ARROW-5130:
---

It's a component called crossbow, the gist of what you need is 
[here|https://github.com/apache/arrow/tree/master/dev/tasks/python-wheels]

> [Python] Segfault when importing TensorFlow after Pyarrow
> -
>
> Key: ARROW-5130
> URL: https://issues.apache.org/jira/browse/ARROW-5130
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Travis Addair
>Priority: Major
>
> This issue is similar to https://jira.apache.org/jira/browse/ARROW-2657 which 
> was fixed in v0.10.0.
> When we import TensorFlow after Pyarrow in Linux Debian Jessie, we get a 
> segfault.  To reproduce:
> {code:java}
> import pyarrow 
> import tensorflow{code}
> Here's the backtrace from gdb:
> {code:java}
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x in ?? ()
> (gdb) bt
> #0 0x in ?? ()
> #1 0x7f529ee04410 in pthread_once () at 
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:103
> #2 0x7f5229a74efa in void std::call_once(std::once_flag&, 
> void (&)()) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #3 0x7f5229a74f3e in 
> tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #4 0x7f522978b561 in tensorflow::port::(anonymous 
> namespace)::CheckFeatureOrDie(tensorflow::port::CPUFeature, std::string 
> const&) ()
> from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #5 0x7f522978b5b4 in _GLOBAL__sub_I_cpu_feature_guard.cc () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #6 0x7f529f224bea in call_init (l=, argc=argc@entry=9, 
> argv=argv@entry=0x7ffc6d8c1488, env=env@entry=0x294c0c0) at dl-init.c:78
> #7 0x7f529f224cd3 in call_init (env=0x294c0c0, argv=0x7ffc6d8c1488, 
> argc=9, l=) at dl-init.c:36
> #8 _dl_init (main_map=main_map@entry=0x2e4aff0, argc=9, argv=0x7ffc6d8c1488, 
> env=0x294c0c0) at dl-init.c:126
> #9 0x7f529f228e38 in dl_open_worker (a=a@entry=0x7ffc6d8bebb8) at 
> dl-open.c:577
> #10 0x7f529f224aa4 in _dl_catch_error 
> (objname=objname@entry=0x7ffc6d8beba8, 
> errstring=errstring@entry=0x7ffc6d8bebb0, 
> mallocedp=mallocedp@entry=0x7ffc6d8beba7,
> operate=operate@entry=0x7f529f228b60 , 
> args=args@entry=0x7ffc6d8bebb8) at dl-error.c:187
> #11 0x7f529f22862b in _dl_open (file=0x7f5248178b54 
> "/usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so",
>  mode=-2147483646, caller_dlopen=,
> nsid=-2, argc=9, argv=0x7ffc6d8c1488, env=0x294c0c0) at dl-open.c:661
> #12 0x7f529ebf402b in dlopen_doit (a=a@entry=0x7ffc6d8bedd0) at 
> dlopen.c:66
> #13 0x7f529f224aa4 in _dl_catch_error (objname=0x2950fc0, 
> errstring=0x2950fc8, mallocedp=0x2950fb8, operate=0x7f529ebf3fd0 
> , args=0x7ffc6d8bedd0) at dl-error.c:187
> #14 0x7f529ebf45dd in _dlerror_run (operate=operate@entry=0x7f529ebf3fd0 
> , args=args@entry=0x7ffc6d8bedd0) at dlerror.c:163
> #15 0x7f529ebf40c1 in __dlopen (file=, mode= out>) at dlopen.c:87
> #16 0x00540859 in _PyImport_GetDynLoadFunc ()
> #17 0x0054024c in _PyImport_LoadDynamicModule ()
> #18 0x005f2bcb in ?? ()
> #19 0x004ca235 in PyEval_EvalFrameEx ()
> #20 0x004ca9c2 in PyEval_EvalFrameEx ()
> #21 0x004c8c39 in PyEval_EvalCodeEx ()
> #22 0x004c84e6 in PyEval_EvalCode ()
> #23 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #24 0x004c3272 in ?? ()
> #25 0x004b19e2 in ?? ()
> #26 0x004b13d7 in ?? ()
> #27 0x004b42f6 in ?? ()
> #28 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #29 0x004ccdb3 in PyEval_EvalFrameEx ()
> #30 0x004c8c39 in PyEval_EvalCodeEx ()
> #31 0x004c84e6 in PyEval_EvalCode ()
> #32 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #33 0x004c3272 in ?? ()
> #34 0x004b1d3f in ?? ()
> #35 0x004b6b2b in ?? ()
> #36 0x004b0d82 in ?? ()
> #37 0x004b42f6 in ?? ()
> #38 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #39 0x004ccdb3 in PyEval_EvalFrameEx (){code}
> It looks like the code changes that fixed the previous issue was recently 
> removed in 
> [https://github.com/apache/arrow/commit/b766bff34b7d85034d26cebef5b3aeef1eb2fd82#diff-16806bcebc1df2fae432db426905b9f0].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-5208) [Python] Inconsistent resulting type during casting in pa.array() when mask is present

2019-04-26 Thread Joris Van den Bossche (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826904#comment-16826904
 ] 

Joris Van den Bossche commented on ARROW-5208:
--

To get started, I think the developer docs are the place to look. Specifically 
the python docs have a good section on how to setup and build arrow and 
pyarrow: 
https://arrow.apache.org/docs/developers/python.html#building-on-linux-and-macos

> [Python] Inconsistent resulting type during casting in pa.array() when mask 
> is present
> --
>
> Key: ARROW-5208
> URL: https://issues.apache.org/jira/browse/ARROW-5208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
> Fix For: 0.14.0
>
>
> I would expect Int64Array type in all cases below :
> {code:java}
> >>> pa.array([4, None, 4, None], mask=np.array([False, True, False, True]))   
> >>>                                                                           
> >>>      
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True]))  
> >>>                                                                           
> >>>         
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True]))     
> >>>                                                                           
> >>>           [   4,   null,   
> >>> 4,   null ]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-5208) [Python] Inconsistent resulting type during casting in pa.array() when mask is present

2019-04-26 Thread Artem KOZHEVNIKOV (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826896#comment-16826896
 ] 

Artem KOZHEVNIKOV commented on ARROW-5208:
--

yes, absolutely, it would be nice to get involved! Any doc that can be useful 
to start with ? CI best practices ?

> [Python] Inconsistent resulting type during casting in pa.array() when mask 
> is present
> --
>
> Key: ARROW-5208
> URL: https://issues.apache.org/jira/browse/ARROW-5208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
> Fix For: 0.14.0
>
>
> I would expect Int64Array type in all cases below :
> {code:java}
> >>> pa.array([4, None, 4, None], mask=np.array([False, True, False, True]))   
> >>>                                                                           
> >>>      
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True]))  
> >>>                                                                           
> >>>         
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True]))     
> >>>                                                                           
> >>>           [   4,   null,   
> >>> 4,   null ]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-5117) [Go] Panic when appending zero slices after initializing a builder

2019-04-26 Thread Sebastien Binet (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastien Binet resolved ARROW-5117.

   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4131
[https://github.com/apache/arrow/pull/4131]

> [Go] Panic when appending zero slices after initializing a builder
> --
>
> Key: ARROW-5117
> URL: https://issues.apache.org/jira/browse/ARROW-5117
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Reporter: Alfonso Subiotto
>Assignee: Sebastien Binet
>Priority: Critical
>  Labels: easyfix, newbie, pull-request-available
> Fix For: 0.14.0
>
>   Original Estimate: 1h
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> {code:java}
> array.NewInt8Builder(memory.DefaultAllocator).AppendValues([]int8{}, 
> []bool{}){code}
> results in a panic
> {code:java}
> === RUN TestArrowPanic
> --- FAIL: TestArrowPanic (0.00s)
> panic: runtime error: invalid memory address or nil pointer dereference 
> [recovered]
>  panic: runtime error: invalid memory address or nil pointer dereference
> [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 
> pc=0x414f6fd]goroutine 5 [running]:
> testing.tRunner.func1(0xc000492a00)
>  /usr/local/Cellar/go/1.11.5/libexec/src/testing/testing.go:792 +0x387
> panic(0x4cd1fe0, 0x5bb3fb0)
>  /usr/local/Cellar/go/1.11.5/libexec/src/runtime/panic.go:513 +0x1b9
> github.com/cockroachdb/cockroach/vendor/github.com/apache/arrow/go/arrow/memory.(*Buffer).Bytes(...)
>  
> /Users/asubiotto/go/src/github.com/cockroachdb/cockroach/vendor/github.com/apache/arrow/go/arrow/memory/buffer.go:67
> github.com/cockroachdb/cockroach/vendor/github.com/apache/arrow/go/arrow/array.(*builder).unsafeSetValid(0xc000382a80,
>  0x0)
>  
> /Users/asubiotto/go/src/github.com/cockroachdb/cockroach/vendor/github.com/apache/arrow/go/arrow/array/builder.go:184
>  +0x6d
> github.com/cockroachdb/cockroach/vendor/github.com/apache/arrow/go/arrow/array.(*builder).unsafeAppendBoolsToBitmap(0xc000382a80,
>  0xc00040df88, 0x0, 0x0, 0x0)
>  
> /Users/asubiotto/go/src/github.com/cockroachdb/cockroach/vendor/github.com/apache/arrow/go/arrow/array/builder.go:146
>  +0x17a
> github.com/cockroachdb/cockroach/vendor/github.com/apache/arrow/go/arrow/array.(*Int8Builder).AppendValues(0xc000382a80,
>  0xc00040df88, 0x0, 0x0, 0xc00040df88, 0x0, 0x0)
>  
> /Users/asubiotto/go/src/github.com/cockroachdb/cockroach/vendor/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:1168
>  +0xcb
> github.com/cockroachdb/cockroach/pkg/util/arrow_test.TestArrowPanic(0xc000492a00)
>  
> /Users/asubiotto/go/src/github.com/cockroachdb/cockroach/pkg/util/arrow/record_batch_test.go:273
>  +0x9a
> testing.tRunner(0xc000492a00, 0x4ec5370)
>  /usr/local/Cellar/go/1.11.5/libexec/src/testing/testing.go:827 +0xbf
> created by testing.(*T).Run
>  /usr/local/Cellar/go/1.11.5/libexec/src/testing/testing.go:878 +0x35cProcess 
> finished with exit code 1{code}
> due to the underlying null bitmap never being initialized. I believe the 
> expectation is for `Resize` to initialize this bitmap. This never happens 
> because a length of 0 (elements in this block) fails this check:
> {code:java}
> func (b *builder) reserve(elements int, resize func(int)) {
>     if b.length+elements > b.capacity {
>         newCap := bitutil.NextPowerOf2(b.length + elements)
>         resize(newCap)
>     }
> }{code}
> As far as I can tell the arguments to AppendValues are valid. I'd be happy to 
> submit a patch but I can see several ways of fixing this so would prefer 
> someone familiar with the code to take a look and define expectations in this 
> case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Closed] (ARROW-5221) Improvement the performance of class SegmentsUtil

2019-04-26 Thread Liya Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan closed ARROW-5221.
---
Resolution: Invalid

> Improvement the performance of class SegmentsUtil
> -
>
> Key: ARROW-5221
> URL: https://issues.apache.org/jira/browse/ARROW-5221
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>
> Improve the performance of class SegmentsUtil from two points:
>  # In method allocateReuseBytes, the generated byte array should be cached 
> for reuse, if the size does not exceed MAX_BYTES_LENGTH. However, the array 
> is not cached if bytes.length < length, and this will lead to performance 
> overhead:
>  
> if (bytes == null) {
>   if (length <= MAX_BYTES_LENGTH) {
>     bytes = new byte[MAX_BYTES_LENGTH];
>     BYTES_LOCAL.set(bytes);
>   } else {
>     bytes = new byte[length];
>   }
>  } else if (bytes.length < length) {
>   bytes = new byte[length];
>  }
>  
> 2. To evaluate the offset, an integer is bitand with a mask to clear to low 
> bits, and then shift right. The bitand is useless:
>  
> ((index & BIT_BYTE_POSITION_MASK) >>> 3)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-5221) Improvement the performance of class SegmentsUtil

2019-04-26 Thread Liya Fan (JIRA)

Liya Fan created ARROW-5221:
---

 Summary: Improvement the performance of class SegmentsUtil
 Key: ARROW-5221
 URL: https://issues.apache.org/jira/browse/ARROW-5221
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Liya Fan
Assignee: Liya Fan


Improve the performance of class SegmentsUtil from two points:
 # In method allocateReuseBytes, the generated byte array should be cached for 
reuse, if the size does not exceed MAX_BYTES_LENGTH. However, the array is not 
cached if bytes.length < length, and this will lead to performance overhead:

 

if (bytes == null) {
  if (length <= MAX_BYTES_LENGTH) {
    bytes = new byte[MAX_BYTES_LENGTH];
    BYTES_LOCAL.set(bytes);
  } else {
    bytes = new byte[length];
  }
 } else if (bytes.length < length) {
  bytes = new byte[length];
 }

 

2. To evaluate the offset, an integer is bitand with a mask to clear to low 
bits, and then shift right. The bitand is useless:

 

((index & BIT_BYTE_POSITION_MASK) >>> 3)

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-5200) [Java] Provide light-weight arrow APIs

2019-04-26 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5200:
--
Labels: pull-request-available  (was: )

> [Java] Provide light-weight arrow APIs
> --
>
> Key: ARROW-5200
> URL: https://issues.apache.org/jira/browse/ARROW-5200
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2019-04-23-15-19-34-187.png
>
>
> We are trying to incorporate Apache Arrow to Apache Flink runtime. We find 
> Arrow an amazing library, which greatly simplifies the support of columnar 
> data format.
> However, for many scenarios, we find the performance unacceptable. Our 
> investigation shows the reason is that, there are too many redundant checks 
> and computations in Arrow API.
> For example, the following figures shows that in a single call to 
> Float8Vector.get(int) method (this is one of the most frequently used APIs in 
> Flink computation),  there are 20+ method invocations.
> !image-2019-04-23-15-19-34-187.png!
>  
> There are many other APIs with similar problems. We believe that these checks 
> will make sure of the integrity of the program. However, it also impacts 
> performance severely. For our evaluation, the performance may degrade by two 
> or three orders of magnitude slower, compared to access data on heap memory. 
> We think at least for some scenarios, we can give the responsibility of 
> integrity check to application owners. If they can be sure all the checks 
> have been passed, we can provide some light-weight APIs and the inherent high 
> performance, to them.
> In the light-weight APIs, we only provide minimum checks, or avoid checks at 
> all. The application owner can still develop and debug their code using the 
> original heavy-weight APIs. Once all bugs have been fixed, they can switch to 
> light-weight APIs in their products and enjoy the consequent high performance.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column

2019-04-26 Thread Joris Van den Bossche (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826772#comment-16826772
 ] 

Joris Van den Bossche commented on ARROW-3861:
--

[~cthi] note that the way you create and pass the schema (with "new" columns 
and the index column specified) now raises an error. I opened ARROW-5220 for 
that.  
What was your intent to add "new_column" to the schema? That it would be 
created in the actual table?

> [Python] ParquetDataset().read columns argument always returns partition 
> column
> ---
>
> Key: ARROW-3861
> URL: https://issues.apache.org/jira/browse/ARROW-3861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Christian Thiel
>Priority: Major
>  Labels: parquet, python
> Fix For: 0.14.0
>
>
> I just noticed that no matter which columns are specified on load of a 
> dataset, the partition column is always returned. This might lead to strange 
> behaviour, as the resulting dataframe has more than the expected columns:
> {code}
> import dask as da
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
> shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>('partition_column', pa.int32()),
>('arrays', pa.list_(pa.int32())),
>('strings', pa.string()),
>('new_column', pa.string())])
> table = pa.Table.from_pandas(df, schema=my_schema)
> pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
> partition_cols=['partition_column'])
> df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 
> 'strings']).to_pandas()
> # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], 
> engine='pyarrow')
> df_pq
> {code}
> df_pq has column `partition_column`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-04-26 Thread Joris Van den Bossche (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5220:
-
Description: 
The {{Table.from_pandas}} method allows to specify a schema ("This can be used 
to indicate the type of columns if we cannot infer it automatically.").

But, if you also want to specify the type of the index, you get an error:

{code:python}
df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
df.index = pd.Index(['a', 'b', 'c'], name='index')

my_schema = pa.schema([('index', pa.string()),
   ('a', pa.int64()),
   ('b', pa.float64()),
  ])

table = pa.Table.from_pandas(df, schema=my_schema)
{code}

gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
from the schema in the dataframe, and thus does not find column 'index').

This also has the consequence that re-using the schema does not work: {{table1 
= pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
schema=table1.schema)}}

Extra note: also unknown columns in general give this error (column specified 
in the schema that are not in the dataframe).

At least in pyarrow 0.11, this did not give an error (eg noticed this from the 
code in example in ARROW-3861). So before, unknown columns in the specified 
schema were ignored, while now they raise an error. Was this a conscious 
change?  
So before also specifying the index in the schema "worked" in the sense that it 
didn't raise an error, but it was also ignored, so didn't actually do what you 
would expect)

Questions:

- I think that we should support specifying the index in the passed {{schema}} 
? So that the example above works (although this might be complicated with 
RangeIndex that is not serialized any more)
- But what to do in general with additional columns in the schema that are not 
in the DataFrame? Are we fine with keep raising an error as it is now (the 
error message could be improved then)? Or do we again want to ignore them? (or, 
it could actually also add them as all nulls to the table)

  was:
The {{Table.from_pandas}} method allows to specify a schema ("This can be used 
to indicate the type of columns if we cannot infer it automatically.").

But, if you also want to specify the type of the index, you get an error:

{code:python}
df = pd.DataFrame(\{'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
df.index = pd.Index(['a', 'b', 'c'], name='index')

my_schema = pa.schema([('index', pa.string()),
   ('a', pa.int64()),
   ('b', pa.float64()),
  ])

table = pa.Table.from_pandas(df, schema=my_schema)
{code}

gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
from the schema in the dataframe, and thus does not find column 'index').

This also has the consequence that re-using the schema does not work: {{table1 
= pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
schema=table1.schema)}}

Extra note: also unknown columns in general give this error (column specified 
in the schema that are not in the dataframe).

At least in pyarrow 0.11, this did not give an error (eg noticed this from the 
code in example in ARROW-3861). So before, unknown columns in the specified 
schema were ignored, while now they raise an error. Was this a conscious 
change?  
So before also specifying the index in the schema "worked" in the sense that it 
didn't raise an error, but it was also ignored, so didn't actually do what you 
would expect)

Questions:

- I think that we should support specifying the index in the passed {{schema}} 
? So that the example above works (although this might be complicated with 
RangeIndex that is not serialized any more)
- But what to do in general with additional columns in the schema that are not 
in the DataFrame? Are we fine with keep raising an error as it is now (the 
error message could be improved then)? Or do we again want to ignore them? (or, 
it could actually also add them as all nulls to the table)


> [Python] index / unknown columns in specified schema in Table.from_pandas
> -
>
> Key: ARROW-5220
> URL: https://issues.apache.org/jira/browse/ARROW-5220
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be 
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
>

[jira] [Created] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-04-26 Thread Joris Van den Bossche (JIRA)

Joris Van den Bossche created ARROW-5220:


 Summary: [Python] index / unknown columns in specified schema in 
Table.from_pandas
 Key: ARROW-5220
 URL: https://issues.apache.org/jira/browse/ARROW-5220
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


The {{Table.from_pandas}} method allows to specify a schema ("This can be used 
to indicate the type of columns if we cannot infer it automatically.").

But, if you also want to specify the type of the index, you get an error:

{code:python}
df = pd.DataFrame(\{'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
df.index = pd.Index(['a', 'b', 'c'], name='index')

my_schema = pa.schema([('index', pa.string()),
   ('a', pa.int64()),
   ('b', pa.float64()),
  ])

table = pa.Table.from_pandas(df, schema=my_schema)
{code}

gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
from the schema in the dataframe, and thus does not find column 'index').

This also has the consequence that re-using the schema does not work: {{table1 
= pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
schema=table1.schema)}}

Extra note: also unknown columns in general give this error (column specified 
in the schema that are not in the dataframe).

At least in pyarrow 0.11, this did not give an error (eg noticed this from the 
code in example in ARROW-3861). So before, unknown columns in the specified 
schema were ignored, while now they raise an error. Was this a conscious 
change?  
So before also specifying the index in the schema "worked" in the sense that it 
didn't raise an error, but it was also ignored, so didn't actually do what you 
would expect)

Questions:

- I think that we should support specifying the index in the passed {{schema}} 
? So that the example above works (although this might be complicated with 
RangeIndex that is not serialized any more)
- But what to do in general with additional columns in the schema that are not 
in the DataFrame? Are we fine with keep raising an error as it is now (the 
error message could be improved then)? Or do we again want to ignore them? (or, 
it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column

2019-04-26 Thread Joris Van den Bossche (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-3861:
-
Labels: parquet python  (was: parquet pyarrow python)

> [Python] ParquetDataset().read columns argument always returns partition 
> column
> ---
>
> Key: ARROW-3861
> URL: https://issues.apache.org/jira/browse/ARROW-3861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Christian Thiel
>Priority: Major
>  Labels: parquet, python
> Fix For: 0.14.0
>
>
> I just noticed that no matter which columns are specified on load of a 
> dataset, the partition column is always returned. This might lead to strange 
> behaviour, as the resulting dataframe has more than the expected columns:
> {code}
> import dask as da
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
> shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>('partition_column', pa.int32()),
>('arrays', pa.list_(pa.int32())),
>('strings', pa.string()),
>('new_column', pa.string())])
> table = pa.Table.from_pandas(df, schema=my_schema)
> pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
> partition_cols=['partition_column'])
> df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 
> 'strings']).to_pandas()
> # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], 
> engine='pyarrow')
> df_pq
> {code}
> df_pq has column `partition_column`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column

2019-04-26 Thread Joris Van den Bossche (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-3861:
-
Labels: parquet pyarrow python  (was: pyarrow python)

> [Python] ParquetDataset().read columns argument always returns partition 
> column
> ---
>
> Key: ARROW-3861
> URL: https://issues.apache.org/jira/browse/ARROW-3861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Christian Thiel
>Priority: Major
>  Labels: parquet, pyarrow, python
> Fix For: 0.14.0
>
>
> I just noticed that no matter which columns are specified on load of a 
> dataset, the partition column is always returned. This might lead to strange 
> behaviour, as the resulting dataframe has more than the expected columns:
> {code}
> import dask as da
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
> shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>('partition_column', pa.int32()),
>('arrays', pa.list_(pa.int32())),
>('strings', pa.string()),
>('new_column', pa.string())])
> table = pa.Table.from_pandas(df, schema=my_schema)
> pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
> partition_cols=['partition_column'])
> df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 
> 'strings']).to_pandas()
> # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], 
> engine='pyarrow')
> df_pq
> {code}
> df_pq has column `partition_column`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-5130) [Python] Segfault when importing TensorFlow after Pyarrow

2019-04-26 Thread Alexander Sergeev (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-5130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826698#comment-16826698
 ] 

Alexander Sergeev commented on ARROW-5130:
--

[~fsaintjacques] is the build process for wheels that end up in PyPI documented 
somewhere, so I could reproduce the issue locally with containers & spread the 
[https://github.com/apache/arrow/pull/2096] around?

> [Python] Segfault when importing TensorFlow after Pyarrow
> -
>
> Key: ARROW-5130
> URL: https://issues.apache.org/jira/browse/ARROW-5130
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Travis Addair
>Priority: Major
>
> This issue is similar to https://jira.apache.org/jira/browse/ARROW-2657 which 
> was fixed in v0.10.0.
> When we import TensorFlow after Pyarrow in Linux Debian Jessie, we get a 
> segfault.  To reproduce:
> {code:java}
> import pyarrow 
> import tensorflow{code}
> Here's the backtrace from gdb:
> {code:java}
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x in ?? ()
> (gdb) bt
> #0 0x in ?? ()
> #1 0x7f529ee04410 in pthread_once () at 
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:103
> #2 0x7f5229a74efa in void std::call_once(std::once_flag&, 
> void (&)()) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #3 0x7f5229a74f3e in 
> tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #4 0x7f522978b561 in tensorflow::port::(anonymous 
> namespace)::CheckFeatureOrDie(tensorflow::port::CPUFeature, std::string 
> const&) ()
> from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #5 0x7f522978b5b4 in _GLOBAL__sub_I_cpu_feature_guard.cc () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #6 0x7f529f224bea in call_init (l=, argc=argc@entry=9, 
> argv=argv@entry=0x7ffc6d8c1488, env=env@entry=0x294c0c0) at dl-init.c:78
> #7 0x7f529f224cd3 in call_init (env=0x294c0c0, argv=0x7ffc6d8c1488, 
> argc=9, l=) at dl-init.c:36
> #8 _dl_init (main_map=main_map@entry=0x2e4aff0, argc=9, argv=0x7ffc6d8c1488, 
> env=0x294c0c0) at dl-init.c:126
> #9 0x7f529f228e38 in dl_open_worker (a=a@entry=0x7ffc6d8bebb8) at 
> dl-open.c:577
> #10 0x7f529f224aa4 in _dl_catch_error 
> (objname=objname@entry=0x7ffc6d8beba8, 
> errstring=errstring@entry=0x7ffc6d8bebb0, 
> mallocedp=mallocedp@entry=0x7ffc6d8beba7,
> operate=operate@entry=0x7f529f228b60 , 
> args=args@entry=0x7ffc6d8bebb8) at dl-error.c:187
> #11 0x7f529f22862b in _dl_open (file=0x7f5248178b54 
> "/usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so",
>  mode=-2147483646, caller_dlopen=,
> nsid=-2, argc=9, argv=0x7ffc6d8c1488, env=0x294c0c0) at dl-open.c:661
> #12 0x7f529ebf402b in dlopen_doit (a=a@entry=0x7ffc6d8bedd0) at 
> dlopen.c:66
> #13 0x7f529f224aa4 in _dl_catch_error (objname=0x2950fc0, 
> errstring=0x2950fc8, mallocedp=0x2950fb8, operate=0x7f529ebf3fd0 
> , args=0x7ffc6d8bedd0) at dl-error.c:187
> #14 0x7f529ebf45dd in _dlerror_run (operate=operate@entry=0x7f529ebf3fd0 
> , args=args@entry=0x7ffc6d8bedd0) at dlerror.c:163
> #15 0x7f529ebf40c1 in __dlopen (file=, mode= out>) at dlopen.c:87
> #16 0x00540859 in _PyImport_GetDynLoadFunc ()
> #17 0x0054024c in _PyImport_LoadDynamicModule ()
> #18 0x005f2bcb in ?? ()
> #19 0x004ca235 in PyEval_EvalFrameEx ()
> #20 0x004ca9c2 in PyEval_EvalFrameEx ()
> #21 0x004c8c39 in PyEval_EvalCodeEx ()
> #22 0x004c84e6 in PyEval_EvalCode ()
> #23 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #24 0x004c3272 in ?? ()
> #25 0x004b19e2 in ?? ()
> #26 0x004b13d7 in ?? ()
> #27 0x004b42f6 in ?? ()
> #28 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #29 0x004ccdb3 in PyEval_EvalFrameEx ()
> #30 0x004c8c39 in PyEval_EvalCodeEx ()
> #31 0x004c84e6 in PyEval_EvalCode ()
> #32 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #33 0x004c3272 in ?? ()
> #34 0x004b1d3f in ?? ()
> #35 0x004b6b2b in ?? ()
> #36 0x004b0d82 in ?? ()
> #37 0x004b42f6 in ?? ()
> #38 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #39 0x004ccdb3 in PyEval_EvalFrameEx (){code}
> It looks like the code changes that fixed the previous issue was recently 
> removed in 
> [https://github.com/apache/arrow/commit/b766bff34b7d85034d26cebef5b3aeef1eb2fd82#diff-16806bcebc1df2fae432db426905b9f0].



--
This message was sent by

[jira] [Updated] (ARROW-5217) [Rust] [CI] DataFusion test failure

[jira] [Commented] (ARROW-5176) [Python] Automate formatting of python files

[jira] [Commented] (ARROW-4694) [CI] detect-changes.py is inconsistent

[jira] [Commented] (ARROW-5176) [Python] Automate formatting of python files

[jira] [Updated] (ARROW-4963) [C++] MSVC build invokes CMake repeatedly

[jira] [Commented] (ARROW-4993) [C++] Display summary at the end of CMake configuration

[jira] [Updated] (ARROW-5085) [Python/C++] Conversion of dict encoded null column fails in parquet writing when using RowGroups

[jira] [Updated] (ARROW-5089) [C++/Python] Writing dictionary encoded columns to parquet is extremely slow when using chunk size

[jira] [Created] (ARROW-5222) [Python] Issues with installing pyarrow for development on MacOS

[jira] [Resolved] (ARROW-5212) [Go] Array BinaryBuilder in Go library has no access to resize the values buffer

[jira] [Resolved] (ARROW-5214) [C++] Offline dependency downloader misses some libraries

[jira] [Commented] (ARROW-5130) [Python] Segfault when importing TensorFlow after Pyarrow

[jira] [Updated] (ARROW-5214) [C++] Offline dependency downloader misses some libraries

[jira] [Comment Edited] (ARROW-5214) [C++] Offline dependency downloader misses some libraries

[jira] [Commented] (ARROW-5214) [C++] Offline dependency downloader misses some libraries

[jira] [Commented] (ARROW-5130) [Python] Segfault when importing TensorFlow after Pyarrow

[jira] [Commented] (ARROW-5130) [Python] Segfault when importing TensorFlow after Pyarrow

[jira] [Commented] (ARROW-5208) [Python] Inconsistent resulting type during casting in pa.array() when mask is present

[jira] [Commented] (ARROW-5208) [Python] Inconsistent resulting type during casting in pa.array() when mask is present

[jira] [Resolved] (ARROW-5117) [Go] Panic when appending zero slices after initializing a builder

[jira] [Closed] (ARROW-5221) Improvement the performance of class SegmentsUtil

[jira] [Created] (ARROW-5221) Improvement the performance of class SegmentsUtil

[jira] [Updated] (ARROW-5200) [Java] Provide light-weight arrow APIs

[jira] [Commented] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column

[jira] [Updated] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

[jira] [Created] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column

[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column

[jira] [Commented] (ARROW-5130) [Python] Segfault when importing TensorFlow after Pyarrow

29 matches

Site Navigation

Mail list logo

Footer information