[jira] [Commented] (ARROW-4717) [C#] Consider exposing ValueTask instead of Task

2019-04-24 Thread Mani Gandham (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825532#comment-16825532
 ] 

Mani Gandham commented on ARROW-4717:
-

[~jthelin]

 Since this is a performance-focused project, using ValueTask seems like the 
right call.

.NET Core 2.0 is already [end of life as of October 
2018|https://dotnet.microsoft.com/platform/support/policy/dotnet-core] and .NET 
Core 2.1 is the current LTS release.

> [C#] Consider exposing ValueTask instead of Task
> 
>
> Key: ARROW-4717
> URL: https://issues.apache.org/jira/browse/ARROW-4717
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Reporter: Eric Erhardt
>Assignee: Eric Erhardt
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See [https://github.com/apache/arrow/pull/3736#pullrequestreview-207169204] 
> for the discussion and 
> [https://devblogs.microsoft.com/dotnet/understanding-the-whys-whats-and-whens-of-valuetask/]
>  for the reasoning.
> Using `Task` in public API requires that a new Task instance be allocated 
> on every call. When returning synchronously, using ValueTask will allow the 
> method to not allocate.
> In order to do this, we will need to take a new dependency on  
> {{System.Threading.Tasks.Extensions}} NuGet package.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5210) [Python] editable install (pip install -e .) is failing

2019-04-24 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825170#comment-16825170
 ] 

Joris Van den Bossche commented on ARROW-5210:
--

The reason it is currently failing is because we don't list numpy as a build 
requirement (not in {{setup_requires}} and not in {{pyproject.toml}}). 

This also seems to indicate the that current {{pyproject.toml}} is actually not 
tested (because building a wheel using an isolated environment based on the 
build dependencies specified in the file, should fail with missing numpy).

Patch by [~pitrou] :

 
{code:none}
diff --git a/python/pyproject.toml b/python/pyproject.toml
index 712647e4f..a6c51ec20 100644
--- a/python/pyproject.toml
+++ b/python/pyproject.toml
@@ -16,4 +16,4 @@
 # under the License.
 
 [build-system]
-requires = ["setuptools", "wheel", "setuptools_scm", "cython >= 0.29"]
+requires = ["setuptools", "wheel", "setuptools_scm", "cython >= 0.29", "numpy 
>= 1.14"]
diff --git a/python/setup.py b/python/setup.py
index 907524a60..63014a80a 100755
--- a/python/setup.py
+++ b/python/setup.py
@@ -542,19 +542,20 @@ class BinaryDistribution(Distribution):
 return True
 
 
+numpy_requires = 'numpy >= 1.14'
+
 install_requires = (
-    'numpy >= 1.14',
+    numpy_requires,
 'six >= 1.0.0',
 'futures; python_version < "3.2"',
 'enum34 >= 1.1.6; python_version < "3.4"',
 )
 
+setup_requires = ['setuptools_scm', 'cython >= 0.29', numpy_requires]
 
 # Only include pytest-runner in setup_requires if we're invoking tests
 if {'pytest', 'test', 'ptr'}.intersection(sys.argv):
-    setup_requires = ['pytest-runner']
-else:
-    setup_requires = []
+    setup_requires.append('pytest-runner')
 
 
 setup(
@@ -581,7 +582,7 @@ setup(
 'write_to': os.path.join(scm_version_write_to_prefix,
  'pyarrow/_generated_version.py')
 },
-    setup_requires=['setuptools_scm', 'cython >= 0.29'] + setup_requires,
+    setup_requires=setup_requires,
 install_requires=install_requires,
 tests_require=['pytest', 'pandas', 'hypothesis',
    'pathlib2; python_version < "3.4"'],{code}
 

with that patch, one still needs {{pip install -e . --no-use-pep517}} (for the 
latest pip 19.1 release) to specify to pip that we _do_ want to do an editable 
install. 

But I would actually argue that even if the above is fixed, doing {{pip install 
-e . --no-use-pep517 --no-build-isolation}} is better, as when doing an 
editable install, you don't need to the build isolation feature of numpy, you 
just want to build pyarrow against your existing development environment.

> [Python] editable install (pip install -e .) is failing 
> 
>
> Key: ARROW-5210
> URL: https://issues.apache.org/jira/browse/ARROW-5210
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> Following the python development documentation on building arrow and pyarrow 
> ([https://arrow.apache.org/docs/developers/python.html#build-and-test),] 
> building pyarrow inplace with {{python setup.py build_ext --inplace}} works 
> fine.
>  
> But if you want to also install this inplace version in the current python 
> environment (editable install / development install) using pip ({{pip install 
> -e .}}), this fails during the {{built_ext}} / cmake phase:
> {code:none}
>  
> -- Looking for python3.7m
>     -- Found Python lib 
> /home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so
>     CMake Error at cmake_modules/FindNumPy.cmake:62 (message):
>   NumPy import failure:
>   Traceback (most recent call last):
>     File "", line 1, in 
>   ModuleNotFoundError: No module named 'numpy'
>     Call Stack (most recent call first):
>   CMakeLists.txt:186 (find_package)
>     -- Configuring incomplete, errors occurred!
>     See also 
> "/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeOutput.log".
>     See also 
> "/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeError.log".
>     error: command 'cmake' failed with exit status 1
> Cleaning up...
> {code}
>  
> Alternatively, doing {{python setup.py develop}} to achieve the same still 
> works.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4935) [C++] Errors from jemalloc when building pyarrow from source on OSX and Debian

2019-04-24 Thread Ian Mateus Vieira Manor (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825517#comment-16825517
 ] 

Ian Mateus Vieira Manor commented on ARROW-4935:


Solved the jemalloc problem on my machine by installing macOS SDK headers.
{code:java}
cd /Library/Developer/CommandLineTools/Packages/
open macOS_SDK_headers_for_macOS_10.14.pkg{code}

> [C++] Errors from jemalloc when building pyarrow from source on OSX and Debian
> --
>
> Key: ARROW-4935
> URL: https://issues.apache.org/jira/browse/ARROW-4935
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.12.1
> Environment: OSX, Debian, Python==3.6.7
>Reporter: Gregory Hayes
>Priority: Critical
>  Labels: build, newbie
>
> My attempts to build pyarrow from source are failing. I've set up the conda 
> environment using the instructions provided in the Develop instructions, and 
> have tried this on both Debian and OSX. When I run CMAKE in debug mode on 
> OSX, the output is:
> {code:java}
> -- Building using CMake version: 3.14.0
> -- Arrow version: 0.13.0 (full: '0.13.0-SNAPSHOT')
> -- clang-tidy not found
> -- clang-format not found
> -- infer found at /usr/local/bin/infer
> -- Using ccache: /usr/local/bin/ccache
> -- Found cpplint executable at 
> /Users/Greg/documents/repos/arrow/cpp/build-support/cpplint.py
> -- Compiler command: env LANG=C 
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
>  -v
> -- Compiler version: Apple LLVM version 10.0.0 (clang-1000.11.45.5)
> Target: x86_64-apple-darwin18.2.0
> Thread model: posix
> InstalledDir: 
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
> -- Compiler id: AppleClang
> Selected compiler clang 4.1.0svn
> -- Arrow build warning level: CHECKIN
> Configured for DEBUG build (set with cmake 
> -DCMAKE_BUILD_TYPE={release,debug,...})
> -- Build Type: DEBUG
> -- BOOST_VERSION: 1.67.0
> -- BROTLI_VERSION: v0.6.0
> -- CARES_VERSION: 1.15.0
> -- DOUBLE_CONVERSION_VERSION: v3.1.1
> -- FLATBUFFERS_VERSION: v1.10.0
> -- GBENCHMARK_VERSION: v1.4.1
> -- GFLAGS_VERSION: v2.2.0
> -- GLOG_VERSION: v0.3.5
> -- GRPC_VERSION: v1.18.0
> -- GTEST_VERSION: 1.8.1
> -- JEMALLOC_VERSION: 17c897976c60b0e6e4f4a365c751027244dada7a
> -- LZ4_VERSION: v1.8.3
> -- ORC_VERSION: 1.5.4
> -- PROTOBUF_VERSION: v3.6.1
> -- RAPIDJSON_VERSION: v1.1.0
> -- RE2_VERSION: 2018-10-01
> -- SNAPPY_VERSION: 1.1.3
> -- THRIFT_VERSION: 0.11.0
> -- ZLIB_VERSION: 1.2.8
> -- ZSTD_VERSION: v1.3.7
> -- Boost version: 1.68.0
> -- Found the following Boost libraries:
> --   regex
> --   system
> --   filesystem
> -- Boost include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- Boost libraries: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_regex.dylib/Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_system.dylib/Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_filesystem.dylib
> Added shared library dependency boost_system_shared: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_system.dylib
> Added shared library dependency boost_filesystem_shared: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_filesystem.dylib
> Added shared library dependency boost_regex_shared: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_regex.dylib
> Added static library dependency double-conversion_static: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libdouble-conversion.a
> -- double-conversion include dir: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- double-conversion static library: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libdouble-conversion.a
> -- GFLAGS_HOME: /Users/Greg/anaconda3/envs/pyarrow-dev
> -- GFlags include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- GFlags static library: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libgflags.a
> Added static library dependency gflags_static: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libgflags.a
> -- RapidJSON include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- Found the Flatbuffers library: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libflatbuffers.a
> -- Flatbuffers include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- Flatbuffers compiler: /Users/Greg/anaconda3/envs/pyarrow-dev/bin/flatc
> Added static library dependency jemalloc_static: 
> /Users/Greg/documents/repos/arrow/cpp/build/jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a
> Added shared library dependency jemalloc_shared: 
> /Users/Greg/documents/repos/arrow/cpp/build/jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc.dylib
> -- Found hdfs.h at: 
> /Users/Greg/documents/repos/arrow/cpp/thirdparty/hadoop/include/hdfs.h
> -- Found the ZLIB shared library: 
> 

[jira] [Commented] (ARROW-5130) [Python] Segfault when importing TensorFlow after Pyarrow

2019-04-24 Thread Alexander Sergeev (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825554#comment-16825554
 ] 

Alexander Sergeev commented on ARROW-5130:
--

One workaround we found is to LD_PRELOAD 
/usr/lib/x86_64-linux-gnu/libstdc++.so.6.

 

Wes, is there a reason PyArrow re-exports a bunch of C++ std library symbols?

> [Python] Segfault when importing TensorFlow after Pyarrow
> -
>
> Key: ARROW-5130
> URL: https://issues.apache.org/jira/browse/ARROW-5130
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Travis Addair
>Priority: Major
>
> This issue is similar to https://jira.apache.org/jira/browse/ARROW-2657 which 
> was fixed in v0.10.0.
> When we import TensorFlow after Pyarrow in Linux Debian Jessie, we get a 
> segfault.  To reproduce:
> {code:java}
> import pyarrow 
> import tensorflow{code}
> Here's the backtrace from gdb:
> {code:java}
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x in ?? ()
> (gdb) bt
> #0 0x in ?? ()
> #1 0x7f529ee04410 in pthread_once () at 
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:103
> #2 0x7f5229a74efa in void std::call_once(std::once_flag&, 
> void (&)()) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #3 0x7f5229a74f3e in 
> tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #4 0x7f522978b561 in tensorflow::port::(anonymous 
> namespace)::CheckFeatureOrDie(tensorflow::port::CPUFeature, std::string 
> const&) ()
> from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #5 0x7f522978b5b4 in _GLOBAL__sub_I_cpu_feature_guard.cc () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #6 0x7f529f224bea in call_init (l=, argc=argc@entry=9, 
> argv=argv@entry=0x7ffc6d8c1488, env=env@entry=0x294c0c0) at dl-init.c:78
> #7 0x7f529f224cd3 in call_init (env=0x294c0c0, argv=0x7ffc6d8c1488, 
> argc=9, l=) at dl-init.c:36
> #8 _dl_init (main_map=main_map@entry=0x2e4aff0, argc=9, argv=0x7ffc6d8c1488, 
> env=0x294c0c0) at dl-init.c:126
> #9 0x7f529f228e38 in dl_open_worker (a=a@entry=0x7ffc6d8bebb8) at 
> dl-open.c:577
> #10 0x7f529f224aa4 in _dl_catch_error 
> (objname=objname@entry=0x7ffc6d8beba8, 
> errstring=errstring@entry=0x7ffc6d8bebb0, 
> mallocedp=mallocedp@entry=0x7ffc6d8beba7,
> operate=operate@entry=0x7f529f228b60 , 
> args=args@entry=0x7ffc6d8bebb8) at dl-error.c:187
> #11 0x7f529f22862b in _dl_open (file=0x7f5248178b54 
> "/usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so",
>  mode=-2147483646, caller_dlopen=,
> nsid=-2, argc=9, argv=0x7ffc6d8c1488, env=0x294c0c0) at dl-open.c:661
> #12 0x7f529ebf402b in dlopen_doit (a=a@entry=0x7ffc6d8bedd0) at 
> dlopen.c:66
> #13 0x7f529f224aa4 in _dl_catch_error (objname=0x2950fc0, 
> errstring=0x2950fc8, mallocedp=0x2950fb8, operate=0x7f529ebf3fd0 
> , args=0x7ffc6d8bedd0) at dl-error.c:187
> #14 0x7f529ebf45dd in _dlerror_run (operate=operate@entry=0x7f529ebf3fd0 
> , args=args@entry=0x7ffc6d8bedd0) at dlerror.c:163
> #15 0x7f529ebf40c1 in __dlopen (file=, mode= out>) at dlopen.c:87
> #16 0x00540859 in _PyImport_GetDynLoadFunc ()
> #17 0x0054024c in _PyImport_LoadDynamicModule ()
> #18 0x005f2bcb in ?? ()
> #19 0x004ca235 in PyEval_EvalFrameEx ()
> #20 0x004ca9c2 in PyEval_EvalFrameEx ()
> #21 0x004c8c39 in PyEval_EvalCodeEx ()
> #22 0x004c84e6 in PyEval_EvalCode ()
> #23 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #24 0x004c3272 in ?? ()
> #25 0x004b19e2 in ?? ()
> #26 0x004b13d7 in ?? ()
> #27 0x004b42f6 in ?? ()
> #28 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #29 0x004ccdb3 in PyEval_EvalFrameEx ()
> #30 0x004c8c39 in PyEval_EvalCodeEx ()
> #31 0x004c84e6 in PyEval_EvalCode ()
> #32 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #33 0x004c3272 in ?? ()
> #34 0x004b1d3f in ?? ()
> #35 0x004b6b2b in ?? ()
> #36 0x004b0d82 in ?? ()
> #37 0x004b42f6 in ?? ()
> #38 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #39 0x004ccdb3 in PyEval_EvalFrameEx (){code}
> It looks like the code changes that fixed the previous issue was recently 
> removed in 
> [https://github.com/apache/arrow/commit/b766bff34b7d85034d26cebef5b3aeef1eb2fd82#diff-16806bcebc1df2fae432db426905b9f0].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4139) [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is set

2019-04-24 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-4139:
--
Labels: parquet pull-request-available python  (was: parquet python)

> [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is 
> set
> ---
>
> Key: ARROW-4139
> URL: https://issues.apache.org/jira/browse/ARROW-4139
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Matthew Rocklin
>Priority: Minor
>  Labels: parquet, pull-request-available, python
> Fix For: 0.14.0
>
>
> When writing Pandas data to Parquet format and reading it back again I find 
> that that statistics of text columns are stored as byte arrays rather than as 
> unicode text. 
> I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding 
> of how best to manage statistics.  (I'd be quite happy to learn that it was 
> the latter).
> Here is a minimal example
> {code:python}
> import pandas as pd
> df = pd.DataFrame({'x': ['a']})
> df.to_parquet('df.parquet')
> import pyarrow.parquet as pq
> pf = pq.ParquetDataset('df.parquet')
> piece = pf.pieces[0]
> rg = piece.row_group(0)
> md = piece.get_metadata(pq.ParquetFile)
> rg = md.row_group(0)
> c = rg.column(0)
> >>> c
> 
>   file_offset: 63
>   file_path: 
>   physical_type: BYTE_ARRAY
>   num_values: 1
>   path_in_schema: x
>   is_stats_set: True
>   statistics:
> 
>   has_min_max: True
>   min: b'a'
>   max: b'a'
>   null_count: 0
>   distinct_count: 0
>   num_values: 1
>   physical_type: BYTE_ARRAY
>   compression: SNAPPY
>   encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
>   has_dictionary_page: True
>   dictionary_page_offset: 4
>   data_page_offset: 25
>   total_compressed_size: 59
>   total_uncompressed_size: 55
> >>> type(c.statistics.min)
> bytes
> {code}
> My guess is that we would want to store a logical type in the statistics like 
> UNICODE, though I don't have enough experience with Parquet data types to 
> know if this is a good idea or possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5130) [Python] Segfault when importing TensorFlow after Pyarrow

2019-04-24 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825562#comment-16825562
 ] 

Wes McKinney commented on ARROW-5130:
-

We aren't doing so on purpose

> [Python] Segfault when importing TensorFlow after Pyarrow
> -
>
> Key: ARROW-5130
> URL: https://issues.apache.org/jira/browse/ARROW-5130
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Travis Addair
>Priority: Major
>
> This issue is similar to https://jira.apache.org/jira/browse/ARROW-2657 which 
> was fixed in v0.10.0.
> When we import TensorFlow after Pyarrow in Linux Debian Jessie, we get a 
> segfault.  To reproduce:
> {code:java}
> import pyarrow 
> import tensorflow{code}
> Here's the backtrace from gdb:
> {code:java}
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x in ?? ()
> (gdb) bt
> #0 0x in ?? ()
> #1 0x7f529ee04410 in pthread_once () at 
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:103
> #2 0x7f5229a74efa in void std::call_once(std::once_flag&, 
> void (&)()) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #3 0x7f5229a74f3e in 
> tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #4 0x7f522978b561 in tensorflow::port::(anonymous 
> namespace)::CheckFeatureOrDie(tensorflow::port::CPUFeature, std::string 
> const&) ()
> from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #5 0x7f522978b5b4 in _GLOBAL__sub_I_cpu_feature_guard.cc () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #6 0x7f529f224bea in call_init (l=, argc=argc@entry=9, 
> argv=argv@entry=0x7ffc6d8c1488, env=env@entry=0x294c0c0) at dl-init.c:78
> #7 0x7f529f224cd3 in call_init (env=0x294c0c0, argv=0x7ffc6d8c1488, 
> argc=9, l=) at dl-init.c:36
> #8 _dl_init (main_map=main_map@entry=0x2e4aff0, argc=9, argv=0x7ffc6d8c1488, 
> env=0x294c0c0) at dl-init.c:126
> #9 0x7f529f228e38 in dl_open_worker (a=a@entry=0x7ffc6d8bebb8) at 
> dl-open.c:577
> #10 0x7f529f224aa4 in _dl_catch_error 
> (objname=objname@entry=0x7ffc6d8beba8, 
> errstring=errstring@entry=0x7ffc6d8bebb0, 
> mallocedp=mallocedp@entry=0x7ffc6d8beba7,
> operate=operate@entry=0x7f529f228b60 , 
> args=args@entry=0x7ffc6d8bebb8) at dl-error.c:187
> #11 0x7f529f22862b in _dl_open (file=0x7f5248178b54 
> "/usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so",
>  mode=-2147483646, caller_dlopen=,
> nsid=-2, argc=9, argv=0x7ffc6d8c1488, env=0x294c0c0) at dl-open.c:661
> #12 0x7f529ebf402b in dlopen_doit (a=a@entry=0x7ffc6d8bedd0) at 
> dlopen.c:66
> #13 0x7f529f224aa4 in _dl_catch_error (objname=0x2950fc0, 
> errstring=0x2950fc8, mallocedp=0x2950fb8, operate=0x7f529ebf3fd0 
> , args=0x7ffc6d8bedd0) at dl-error.c:187
> #14 0x7f529ebf45dd in _dlerror_run (operate=operate@entry=0x7f529ebf3fd0 
> , args=args@entry=0x7ffc6d8bedd0) at dlerror.c:163
> #15 0x7f529ebf40c1 in __dlopen (file=, mode= out>) at dlopen.c:87
> #16 0x00540859 in _PyImport_GetDynLoadFunc ()
> #17 0x0054024c in _PyImport_LoadDynamicModule ()
> #18 0x005f2bcb in ?? ()
> #19 0x004ca235 in PyEval_EvalFrameEx ()
> #20 0x004ca9c2 in PyEval_EvalFrameEx ()
> #21 0x004c8c39 in PyEval_EvalCodeEx ()
> #22 0x004c84e6 in PyEval_EvalCode ()
> #23 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #24 0x004c3272 in ?? ()
> #25 0x004b19e2 in ?? ()
> #26 0x004b13d7 in ?? ()
> #27 0x004b42f6 in ?? ()
> #28 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #29 0x004ccdb3 in PyEval_EvalFrameEx ()
> #30 0x004c8c39 in PyEval_EvalCodeEx ()
> #31 0x004c84e6 in PyEval_EvalCode ()
> #32 0x004c6e5c in PyImport_ExecCodeModuleEx ()
> #33 0x004c3272 in ?? ()
> #34 0x004b1d3f in ?? ()
> #35 0x004b6b2b in ?? ()
> #36 0x004b0d82 in ?? ()
> #37 0x004b42f6 in ?? ()
> #38 0x004d1aab in PyEval_CallObjectWithKeywords ()
> #39 0x004ccdb3 in PyEval_EvalFrameEx (){code}
> It looks like the code changes that fixed the previous issue was recently 
> removed in 
> [https://github.com/apache/arrow/commit/b766bff34b7d85034d26cebef5b3aeef1eb2fd82#diff-16806bcebc1df2fae432db426905b9f0].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5186) [Plasma] Crash on deleting CUDA memory

2019-04-24 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5186:
---

Assignee: shengjun.li

> [Plasma] Crash on deleting CUDA memory
> --
>
> Key: ARROW-5186
> URL: https://issues.apache.org/jira/browse/ARROW-5186
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: shengjun.li
>Assignee: shengjun.li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> cpp/CMakeLists.txt
>   option(ARROW_CUDA "Build the Arrow CUDA extensions (requires CUDA toolkit)" 
> ON)
>   option(ARROW_PLASMA "Build the plasma object store along with Arrow" ON)
> [sample sequence]
> (1) call PlasmaClient::Create(id_object, data_size, 0, 0, , 1) // where 
> device_num != 0
> (2) call PlasmaClient::Seal(id_object)
> (3) call PlasmaClient::Release(id_object)
> (4) call PlasmaClient::Delete(id_object) // server carsh!
> *** Aborted at 1555645923 (unix time) try "date -d @1555645923" if you are 
> using GNU date ***
> PC: @ 0x7f65bcfa1428 gsignal
> *** SIGABRT (@0x3e86d67) received by PID 28007 (TID 0x7f65bf225740) from 
> PID 28007; stack trace: ***
>     @ 0x7f65bd347390 (unknown)
>     @ 0x7f65bcfa1428 gsignal
>     @ 0x7f65bcfa302a abort
>     @   0x4a56cd dlfree
>     @   0x4b4bc2 plasma::PlasmaAllocator::Free()
>     @   0x4b7da3 plasma::PlasmaStore::EraseFromObjectTable()
>     @   0x4b87d2 plasma::PlasmaStore::DeleteObject()
>     @   0x4bb3d2 plasma::PlasmaStore::ProcessMessage()
>     @   0x4b9195 _ZZN6plasma11PlasmaStore13ConnectClientEiENKUliE_clEi
>     @   0x4bd752 
> _ZNSt17_Function_handlerIFviEZN6plasma11PlasmaStore13ConnectClientEiEUliE_E9_M_invokeERKSt9_Any_dataOi
>     @   0x4ab998 std::function<>::operator()()
>     @   0x4aaea7 plasma::EventLoop::FileEventCallback()
>     @   0x4dbd8f aeProcessEvents
>     @   0x4dbf50 aeMain
>     @   0x4ab19b plasma::EventLoop::Start()
>     @   0x4bfc93 plasma::PlasmaStoreRunner::Start()
>     @   0x4bc34d plasma::StartServer()
>     @   0x4bcfbd main
>     @ 0x7f65bcf8c830 __libc_start_main
>     @   0x49e939 _start
>     @    0x0 (unknown)
> Aborted (core dumped)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5186) [Plasma] Crash on deleting CUDA memory

2019-04-24 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5186:

Summary: [Plasma] Crash on deleting CUDA memory  (was: [plasma] carsh on 
delete gpu memory)

> [Plasma] Crash on deleting CUDA memory
> --
>
> Key: ARROW-5186
> URL: https://issues.apache.org/jira/browse/ARROW-5186
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: shengjun.li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> cpp/CMakeLists.txt
>   option(ARROW_CUDA "Build the Arrow CUDA extensions (requires CUDA toolkit)" 
> ON)
>   option(ARROW_PLASMA "Build the plasma object store along with Arrow" ON)
> [sample sequence]
> (1) call PlasmaClient::Create(id_object, data_size, 0, 0, , 1) // where 
> device_num != 0
> (2) call PlasmaClient::Seal(id_object)
> (3) call PlasmaClient::Release(id_object)
> (4) call PlasmaClient::Delete(id_object) // server carsh!
> *** Aborted at 1555645923 (unix time) try "date -d @1555645923" if you are 
> using GNU date ***
> PC: @ 0x7f65bcfa1428 gsignal
> *** SIGABRT (@0x3e86d67) received by PID 28007 (TID 0x7f65bf225740) from 
> PID 28007; stack trace: ***
>     @ 0x7f65bd347390 (unknown)
>     @ 0x7f65bcfa1428 gsignal
>     @ 0x7f65bcfa302a abort
>     @   0x4a56cd dlfree
>     @   0x4b4bc2 plasma::PlasmaAllocator::Free()
>     @   0x4b7da3 plasma::PlasmaStore::EraseFromObjectTable()
>     @   0x4b87d2 plasma::PlasmaStore::DeleteObject()
>     @   0x4bb3d2 plasma::PlasmaStore::ProcessMessage()
>     @   0x4b9195 _ZZN6plasma11PlasmaStore13ConnectClientEiENKUliE_clEi
>     @   0x4bd752 
> _ZNSt17_Function_handlerIFviEZN6plasma11PlasmaStore13ConnectClientEiEUliE_E9_M_invokeERKSt9_Any_dataOi
>     @   0x4ab998 std::function<>::operator()()
>     @   0x4aaea7 plasma::EventLoop::FileEventCallback()
>     @   0x4dbd8f aeProcessEvents
>     @   0x4dbf50 aeMain
>     @   0x4ab19b plasma::EventLoop::Start()
>     @   0x4bfc93 plasma::PlasmaStoreRunner::Start()
>     @   0x4bc34d plasma::StartServer()
>     @   0x4bcfbd main
>     @ 0x7f65bcf8c830 __libc_start_main
>     @   0x49e939 _start
>     @    0x0 (unknown)
> Aborted (core dumped)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3873) [C++] Build shared libraries consistently with -fvisibility=hidden

2019-04-24 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825571#comment-16825571
 ] 

Wes McKinney commented on ARROW-3873:
-

I just closed https://github.com/apache/arrow/pull/2437 and will plan to return 
to this once the Parquet symbol visibility issue is dealt with

> [C++] Build shared libraries consistently with -fvisibility=hidden
> --
>
> Key: ARROW-3873
> URL: https://issues.apache.org/jira/browse/ARROW-3873
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> See https://github.com/apache/arrow/pull/2437



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5130) [Python] Segfault when importing TensorFlow after Pyarrow

2019-04-24 Thread Alexander Sergeev (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825601#comment-16825601
 ] 

Alexander Sergeev commented on ARROW-5130:
--

Wes, would you take a PR that cleans these things up?

 
{code:java}
# for f in $(ls -1 /usr/local/lib/python2.7/dist-packages/pyarrow/*.so*); do 
echo $f; nm -D $f | c++filt | grep std::_Hash_bytes; done
/usr/local/lib/python2.7/dist-packages/pyarrow/_csv.so
U std::_Hash_bytes(void const*, unsigned long, unsigned long)
/usr/local/lib/python2.7/dist-packages/pyarrow/libarrow_boost_filesystem.so
/usr/local/lib/python2.7/dist-packages/pyarrow/libarrow_boost_filesystem.so.1.66.0
/usr/local/lib/python2.7/dist-packages/pyarrow/libarrow_boost_regex.so
/usr/local/lib/python2.7/dist-packages/pyarrow/libarrow_boost_regex.so.1.66.0
/usr/local/lib/python2.7/dist-packages/pyarrow/libarrow_boost_system.so
/usr/local/lib/python2.7/dist-packages/pyarrow/libarrow_boost_system.so.1.66.0
/usr/local/lib/python2.7/dist-packages/pyarrow/libarrow_python.so
000e2250 T std::_Hash_bytes(void const*, unsigned long, unsigned long)
/usr/local/lib/python2.7/dist-packages/pyarrow/libarrow_python.so.13
000e2250 T std::_Hash_bytes(void const*, unsigned long, unsigned long)
/usr/local/lib/python2.7/dist-packages/pyarrow/libarrow.so
/usr/local/lib/python2.7/dist-packages/pyarrow/libarrow.so.13
/usr/local/lib/python2.7/dist-packages/pyarrow/libparquet.so
001ce380 T std::_Hash_bytes(void const*, unsigned long, unsigned long)
/usr/local/lib/python2.7/dist-packages/pyarrow/libparquet.so.13
001ce380 T std::_Hash_bytes(void const*, unsigned long, unsigned long)
/usr/local/lib/python2.7/dist-packages/pyarrow/libplasma.so
/usr/local/lib/python2.7/dist-packages/pyarrow/libplasma.so.13
/usr/local/lib/python2.7/dist-packages/pyarrow/lib.so
U std::_Hash_bytes(void const*, unsigned long, unsigned long)
/usr/local/lib/python2.7/dist-packages/pyarrow/libz-7f57503f.so.1.2.11
/usr/local/lib/python2.7/dist-packages/pyarrow/_orc.so
/usr/local/lib/python2.7/dist-packages/pyarrow/_parquet.so
U std::_Hash_bytes(void const*, unsigned long, unsigned long)
/usr/local/lib/python2.7/dist-packages/pyarrow/_plasma.so
{code}
 

> [Python] Segfault when importing TensorFlow after Pyarrow
> -
>
> Key: ARROW-5130
> URL: https://issues.apache.org/jira/browse/ARROW-5130
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Travis Addair
>Priority: Major
>
> This issue is similar to https://jira.apache.org/jira/browse/ARROW-2657 which 
> was fixed in v0.10.0.
> When we import TensorFlow after Pyarrow in Linux Debian Jessie, we get a 
> segfault.  To reproduce:
> {code:java}
> import pyarrow 
> import tensorflow{code}
> Here's the backtrace from gdb:
> {code:java}
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x in ?? ()
> (gdb) bt
> #0 0x in ?? ()
> #1 0x7f529ee04410 in pthread_once () at 
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:103
> #2 0x7f5229a74efa in void std::call_once(std::once_flag&, 
> void (&)()) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #3 0x7f5229a74f3e in 
> tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #4 0x7f522978b561 in tensorflow::port::(anonymous 
> namespace)::CheckFeatureOrDie(tensorflow::port::CPUFeature, std::string 
> const&) ()
> from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #5 0x7f522978b5b4 in _GLOBAL__sub_I_cpu_feature_guard.cc () from 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
> #6 0x7f529f224bea in call_init (l=, argc=argc@entry=9, 
> argv=argv@entry=0x7ffc6d8c1488, env=env@entry=0x294c0c0) at dl-init.c:78
> #7 0x7f529f224cd3 in call_init (env=0x294c0c0, argv=0x7ffc6d8c1488, 
> argc=9, l=) at dl-init.c:36
> #8 _dl_init (main_map=main_map@entry=0x2e4aff0, argc=9, argv=0x7ffc6d8c1488, 
> env=0x294c0c0) at dl-init.c:126
> #9 0x7f529f228e38 in dl_open_worker (a=a@entry=0x7ffc6d8bebb8) at 
> dl-open.c:577
> #10 0x7f529f224aa4 in _dl_catch_error 
> (objname=objname@entry=0x7ffc6d8beba8, 
> errstring=errstring@entry=0x7ffc6d8bebb0, 
> mallocedp=mallocedp@entry=0x7ffc6d8beba7,
> operate=operate@entry=0x7f529f228b60 , 
> args=args@entry=0x7ffc6d8bebb8) at dl-error.c:187
> #11 0x7f529f22862b in _dl_open (file=0x7f5248178b54 
> "/usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so",
>  mode=-2147483646, caller_dlopen=,
> nsid=-2, argc=9, argv=0x7ffc6d8c1488, env=0x294c0c0) at 

[jira] [Comment Edited] (ARROW-5210) [Python] editable install (pip install -e .) is failing

2019-04-24 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825155#comment-16825155
 ] 

Joris Van den Bossche edited comment on ARROW-5210 at 4/24/19 1:42 PM:
---

With pip 19.1 (released yesterday), one needs to do {{pip install -e . 
--no-use-pep517 --no-build-isolation}} to get it running with our current 
set-up.


was (Author: jorisvandenbossche):
With pip 19.1 (released yesterday), one needs to do pip install -e . 
--no-use-pep517 --no-build-isolation to get it running with our current 
set-up.

> [Python] editable install (pip install -e .) is failing 
> 
>
> Key: ARROW-5210
> URL: https://issues.apache.org/jira/browse/ARROW-5210
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> Following the python development documentation on building arrow and pyarrow 
> ([https://arrow.apache.org/docs/developers/python.html#build-and-test),] 
> building pyarrow inplace with {{python setup.py build_ext --inplace}} works 
> fine.
>  
> But if you want to also install this inplace version in the current python 
> environment (editable install / development install) using pip ({{pip install 
> -e .}}), this fails during the {{built_ext}} / cmake phase:
> {code:none}
>  
> -- Looking for python3.7m
>     -- Found Python lib 
> /home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so
>     CMake Error at cmake_modules/FindNumPy.cmake:62 (message):
>   NumPy import failure:
>   Traceback (most recent call last):
>     File "", line 1, in 
>   ModuleNotFoundError: No module named 'numpy'
>     Call Stack (most recent call first):
>   CMakeLists.txt:186 (find_package)
>     -- Configuring incomplete, errors occurred!
>     See also 
> "/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeOutput.log".
>     See also 
> "/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeError.log".
>     error: command 'cmake' failed with exit status 1
> Cleaning up...
> {code}
>  
> Alternatively, doing {{python setup.py develop}} to achieve the same still 
> works.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5210) [Python] editable install (pip install -e .) is failing

2019-04-24 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825155#comment-16825155
 ] 

Joris Van den Bossche commented on ARROW-5210:
--

With pip 19.1 (released yesterday), one needs to do pip install -e . 
--no-use-pep517 --no-build-isolation to get it running with our current 
set-up.

> [Python] editable install (pip install -e .) is failing 
> 
>
> Key: ARROW-5210
> URL: https://issues.apache.org/jira/browse/ARROW-5210
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> Following the python development documentation on building arrow and pyarrow 
> ([https://arrow.apache.org/docs/developers/python.html#build-and-test),] 
> building pyarrow inplace with {{python setup.py build_ext --inplace}} works 
> fine.
>  
> But if you want to also install this inplace version in the current python 
> environment (editable install / development install) using pip ({{pip install 
> -e .}}), this fails during the {{built_ext}} / cmake phase:
> {code:none}
>  
> -- Looking for python3.7m
>     -- Found Python lib 
> /home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so
>     CMake Error at cmake_modules/FindNumPy.cmake:62 (message):
>   NumPy import failure:
>   Traceback (most recent call last):
>     File "", line 1, in 
>   ModuleNotFoundError: No module named 'numpy'
>     Call Stack (most recent call first):
>   CMakeLists.txt:186 (find_package)
>     -- Configuring incomplete, errors occurred!
>     See also 
> "/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeOutput.log".
>     See also 
> "/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeError.log".
>     error: command 'cmake' failed with exit status 1
> Cleaning up...
> {code}
>  
> Alternatively, doing {{python setup.py develop}} to achieve the same still 
> works.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-3176) [Python] Overflow in Date32 column conversion to pandas

2019-04-24 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824911#comment-16824911
 ] 

Joris Van den Bossche edited comment on ARROW-3176 at 4/24/19 2:02 PM:
---

Note that the default type changed: it now gives back datetime.date objects, 
instead of datetime64[D] (https://issues.apache.org/jira/browse/ARROW-3910). So 
by default you no longer have this problem. But, setting 
{{date_as_object=False}} (to have back the old behaviour), you still have the 
same overflow issue. 

Updated the original bug report to add this keyword, to keep it a reproducible 
example.


was (Author: jorisvandenbossche):
Note that the default type changed: it now gives back datetime.date objects, 
instead of datetime64[D]. Do by default you no longer have this problem. But, 
setting {{date_as_object=False}} (to have back the old behaviour), you still 
have the same overflow issue. 

Updated the original bug report to add this keyword, to keep it a reproducible 
example.

> [Python] Overflow in Date32 column conversion to pandas
> ---
>
> Key: ARROW-3176
> URL: https://issues.apache.org/jira/browse/ARROW-3176
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Florian Jetter
>Priority: Minor
> Fix For: 0.14.0
>
>
> When converting an arrow column holding a {{Date32Array}} to {{pandas}} there 
> seems to be an overflow at the date {{2262-04-12}} such that the type and 
> value are wrong. The issue only occurs for columns, not for arrays.
> Running on debian 9.5 w/ python2 gives
>   
> {code}
> In [1]: import numpy as np
> In [2]: import datetime
> In [3]: import pyarrow as pa
> In [4]: pa.__version__
> Out[4]: '0.10.0'
> In [5]: arr = pa.array(np.array([datetime.date(2262, 4, 12)], 
> dtype='datetime64[D]'))
> In [6]: arr.to_pandas(date_as_object=False)
> Out[6]: array(['2262-04-12'], dtype='datetime64[D]')
> In [7]: pa.column('name', arr).to_pandas(date_as_object=False)
> Out[7]:
> 0 1677-09-21 00:25:26.290448384
> Name: name, dtype: datetime64[ns]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2835) [C++] ReadAt/WriteAt are inconsistent with moving the files position

2019-04-24 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825258#comment-16825258
 ] 

Antoine Pitrou commented on ARROW-2835:
---

I see two other ways around this:

1) As soon as ReadAt or WriteAt is called, change the internal file state so 
that any implicitly-positioning operation (such as Read, Write or Tell) fails 
until Seek is called first.

or 2) Have an internal "positioning" lock that ensures that we can have several 
ReadAt or WriteAt calls simultaneously, but that implicitly positioning 
operations wait for the last *At call to end and to restore the file pointer.

I'm not sure how easy #2 is, but should be doable.

> [C++] ReadAt/WriteAt are inconsistent with moving the files position
> 
>
> Key: ARROW-2835
> URL: https://issues.apache.org/jira/browse/ARROW-2835
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Dimitri Vorona
>Priority: Major
> Fix For: 0.14.0
>
>
> Right now, there is inconsistent behaviour regarding moving the files 
> position pointer after calling ReadAt or WriteAt. For example, the default 
> implementation of ReadAt seeks to the desired offset and calls Read which 
> moves the position pointer. MemoryMappedFile::ReadAt, however, doesn't change 
> the position. WriteableFile::WriteAt seem to move the position in the current 
> implementation, but there is no docstring which prescribes this behaviour.
> Antoine suggested that *At methods shouldn't touch the position and it makes 
> more sense, IMHO. The change isn't huge and doesn't seem to break anything 
> internally, but it might break the existing user code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5208) [Python] Inconsistent resulting type during casting in pa.array() when mask is present

2019-04-24 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825402#comment-16825402
 ] 

Wes McKinney commented on ARROW-5208:
-

Seems reasonable. Would you like to submit a pull request?

> [Python] Inconsistent resulting type during casting in pa.array() when mask 
> is present
> --
>
> Key: ARROW-5208
> URL: https://issues.apache.org/jira/browse/ARROW-5208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
> Fix For: 0.14.0
>
>
> I would expect Int64Array type in all cases below :
> {code:java}
> >>> pa.array([4, None, 4, None], mask=np.array([False, True, False, True]))   
> >>>                                                                           
> >>>      
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True]))  
> >>>                                                                           
> >>>         
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True]))     
> >>>                                                                           
> >>>           [   4,   null,   
> >>> 4,   null ]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5208) [Python] Inconsistent resulting type during casting in pa.array() when mask is present

2019-04-24 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5208:

Fix Version/s: 0.14.0

> [Python] Inconsistent resulting type during casting in pa.array() when mask 
> is present
> --
>
> Key: ARROW-5208
> URL: https://issues.apache.org/jira/browse/ARROW-5208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
> Fix For: 0.14.0
>
>
> I would expect Int64Array type in all cases below :
> {code:java}
> >>> pa.array([4, None, 4, None], mask=np.array([False, True, False, True]))   
> >>>                                                                           
> >>>      
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True]))  
> >>>                                                                           
> >>>         
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True]))     
> >>>                                                                           
> >>>           [   4,   null,   
> >>> 4,   null ]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3176) [Python] Overflow in Date32 column conversion to pandas

2019-04-24 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825407#comment-16825407
 ] 

Joris Van den Bossche commented on ARROW-3176:
--

Yes, I think, ideally, arrow should be responsible of checking that the values 
fit in the range supported by pandas. From the two remaining options, I agree 
raising is probably the best option.

 

 

> [Python] Overflow in Date32 column conversion to pandas
> ---
>
> Key: ARROW-3176
> URL: https://issues.apache.org/jira/browse/ARROW-3176
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Florian Jetter
>Priority: Minor
> Fix For: 0.14.0
>
>
> When converting an arrow column holding a {{Date32Array}} to {{pandas}} there 
> seems to be an overflow at the date {{2262-04-12}} such that the type and 
> value are wrong. The issue only occurs for columns, not for arrays.
> Running on debian 9.5 w/ python2 gives
>   
> {code}
> In [1]: import numpy as np
> In [2]: import datetime
> In [3]: import pyarrow as pa
> In [4]: pa.__version__
> Out[4]: '0.10.0'
> In [5]: arr = pa.array(np.array([datetime.date(2262, 4, 12)], 
> dtype='datetime64[D]'))
> In [6]: arr.to_pandas(date_as_object=False)
> Out[6]: array(['2262-04-12'], dtype='datetime64[D]')
> In [7]: pa.column('name', arr).to_pandas(date_as_object=False)
> Out[7]:
> 0 1677-09-21 00:25:26.290448384
> Name: name, dtype: datetime64[ns]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3176) [Python] Overflow in Date32 column conversion to pandas

2019-04-24 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825426#comment-16825426
 ] 

Joris Van den Bossche commented on ARROW-3176:
--

This seems to be a pandas regression: 
https://github.com/pandas-dev/pandas/issues/26206

> [Python] Overflow in Date32 column conversion to pandas
> ---
>
> Key: ARROW-3176
> URL: https://issues.apache.org/jira/browse/ARROW-3176
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Florian Jetter
>Priority: Minor
> Fix For: 0.14.0
>
>
> When converting an arrow column holding a {{Date32Array}} to {{pandas}} there 
> seems to be an overflow at the date {{2262-04-12}} such that the type and 
> value are wrong. The issue only occurs for columns, not for arrays.
> Running on debian 9.5 w/ python2 gives
>   
> {code}
> In [1]: import numpy as np
> In [2]: import datetime
> In [3]: import pyarrow as pa
> In [4]: pa.__version__
> Out[4]: '0.10.0'
> In [5]: arr = pa.array(np.array([datetime.date(2262, 4, 12)], 
> dtype='datetime64[D]'))
> In [6]: arr.to_pandas(date_as_object=False)
> Out[6]: array(['2262-04-12'], dtype='datetime64[D]')
> In [7]: pa.column('name', arr).to_pandas(date_as_object=False)
> Out[7]:
> 0 1677-09-21 00:25:26.290448384
> Name: name, dtype: datetime64[ns]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5211) Missing documentation under `Dictionary encoding` section on MetaData page

2019-04-24 Thread Lennox Stevenson (JIRA)
Lennox Stevenson created ARROW-5211:
---

 Summary: Missing documentation under `Dictionary encoding` section 
on MetaData page
 Key: ARROW-5211
 URL: https://issues.apache.org/jira/browse/ARROW-5211
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Lennox Stevenson


First time throwing up an issue here so let me know if there's anything I 
missed / more details I can provide.

Just going through the arrow documentation at 
[https://arrow.apache.org/docs/python/] and I noticed that there's a section 
that is currently blank. From what I can tell the section 
[https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding] 
currently contains nothing in it. Is that intended? It was confusing to see a 
blank section, but that is just my opinion so it may not be worth changing.

If this is something work fixing / improving, then it's probably worth either 
filling out that section or simply removing header to avoid future confusion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5212) Array BinaryBuilder in Go library has no access to resize the values buffer

2019-04-24 Thread Jonathan A Sternberg (JIRA)
Jonathan A Sternberg created ARROW-5212:
---

 Summary: Array BinaryBuilder in Go library has no access to resize 
the values buffer
 Key: ARROW-5212
 URL: https://issues.apache.org/jira/browse/ARROW-5212
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Jonathan A Sternberg


When you are dealing with a binary builder, there are three buffers: the null 
bitmap, the offset indexes, and the values buffer which contains the actual 
data.

When {{Reserve}} or {{Resize}} are used, the null bitmap and the offsets are 
modified to allow for additional appends to function. This seems correct to me. 
There's no way to know how much the values buffer should be resized until the 
values are being appended with just the number of values alone.

But, when you are then appending a bunch of string values, there's no 
additional API to preallocate the size of that last buffer. That means that 
batch appending a large amount of strings will constantly allocate even if you 
know the size ahead of time.

There should be some additional API to modify this last buffer such as maybe 
{{ReserveBytes}} and {{ResizeBytes}} that would correspond with the {{Reserve}} 
and {{Resize}} methods, but would related to the values buffer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3176) [Python] Overflow in Date32 column conversion to pandas

2019-04-24 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825405#comment-16825405
 ] 

Wes McKinney commented on ARROW-3176:
-

This is a limitation with pandas's {{datetime64[ns]}}. One could argue for 
overflow checking on the to_pandas code path. There are three options 

* Current behavior (not that big of a deal now since we return 
{{datetime.date}} by default now)
* Raise on overflow
* Return NULL on overflow

None of these options are great but maybe option 2 is the best?

> [Python] Overflow in Date32 column conversion to pandas
> ---
>
> Key: ARROW-3176
> URL: https://issues.apache.org/jira/browse/ARROW-3176
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Florian Jetter
>Priority: Minor
> Fix For: 0.14.0
>
>
> When converting an arrow column holding a {{Date32Array}} to {{pandas}} there 
> seems to be an overflow at the date {{2262-04-12}} such that the type and 
> value are wrong. The issue only occurs for columns, not for arrays.
> Running on debian 9.5 w/ python2 gives
>   
> {code}
> In [1]: import numpy as np
> In [2]: import datetime
> In [3]: import pyarrow as pa
> In [4]: pa.__version__
> Out[4]: '0.10.0'
> In [5]: arr = pa.array(np.array([datetime.date(2262, 4, 12)], 
> dtype='datetime64[D]'))
> In [6]: arr.to_pandas(date_as_object=False)
> Out[6]: array(['2262-04-12'], dtype='datetime64[D]')
> In [7]: pa.column('name', arr).to_pandas(date_as_object=False)
> Out[7]:
> 0 1677-09-21 00:25:26.290448384
> Name: name, dtype: datetime64[ns]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3176) [Python] Overflow in Date32 column conversion to pandas

2019-04-24 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825413#comment-16825413
 ] 

Joris Van den Bossche commented on ARROW-3176:
--

Actually, I take that back. It seems that it is pandas that is not doing a 
proper check (assuming that arrow passes datetime64[D] data, similarly as what 
the Array.to_pandas returns), and it is pandas that converts the datetime64[D] 
to incorrect datetime64[ns]:

 
{code}
In [22]: pd.Series(np.array(['2262-04-12'], dtype='datetime64[D]'))
Out[22]: 
0   1677-09-21 00:25:26.290448384
dtype: datetime64[ns]{code}

Of course, you still get the "wrong" behaviour when using arrow's 
{{to_pandas}}, but I might consider this a bug on the pandas side.

> [Python] Overflow in Date32 column conversion to pandas
> ---
>
> Key: ARROW-3176
> URL: https://issues.apache.org/jira/browse/ARROW-3176
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Florian Jetter
>Priority: Minor
> Fix For: 0.14.0
>
>
> When converting an arrow column holding a {{Date32Array}} to {{pandas}} there 
> seems to be an overflow at the date {{2262-04-12}} such that the type and 
> value are wrong. The issue only occurs for columns, not for arrays.
> Running on debian 9.5 w/ python2 gives
>   
> {code}
> In [1]: import numpy as np
> In [2]: import datetime
> In [3]: import pyarrow as pa
> In [4]: pa.__version__
> Out[4]: '0.10.0'
> In [5]: arr = pa.array(np.array([datetime.date(2262, 4, 12)], 
> dtype='datetime64[D]'))
> In [6]: arr.to_pandas(date_as_object=False)
> Out[6]: array(['2262-04-12'], dtype='datetime64[D]')
> In [7]: pa.column('name', arr).to_pandas(date_as_object=False)
> Out[7]:
> 0 1677-09-21 00:25:26.290448384
> Name: name, dtype: datetime64[ns]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3978) [C++] Implement hashing, dictionary-encoding for StructArray

2019-04-24 Thread Jacques Nadeau (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825451#comment-16825451
 ] 

Jacques Nadeau commented on ARROW-3978:
---

Here is some info about what we found worked well. Note that it doesn't go into 
a lot of detail about the pivot algorithm beyond the basic concepts of fixed 
and variable vectors.

[https://docs.google.com/document/d/1Yk6IvDL28IzEjqcqSkFdevRyMrC8_kwzEatHvcOnawM/edit]

 

Main idea around pivot: 
 * separate fixed and variable and have each continguous
 * coalesce bits for nullability and values together at the start of the data 
structure (save space, increase likelihood of mismatch early)
 * include length of variable in fixed container to reduce likelihood of 
jumping to variable container.
 * Have specialized cases that look at actual existence of nulls for each word 
and fork behavior based on that to improve performance of common case where 
things are mostly null or not null.

The latest code for the Arrow pivot algorithms specifically that we use can be 
found here:

Pivots: 
[https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/common/ht2/Pivots.java]

Unpivots: 
[https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/common/ht2/Unpivots.java]

Hash Table: 
[https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/common/ht2/LBlockHashTable.java]

We'd be happy to donate this code/algo to the community as it would probably 
serve as a good foundation.

 

Note the doc is probably somewhat out of date with the actual implementation as 
it was written early on in development.

 

> [C++] Implement hashing, dictionary-encoding for StructArray
> 
>
> Key: ARROW-3978
> URL: https://issues.apache.org/jira/browse/ARROW-3978
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> This is a central requirement for hash-aggregations such as
> {code}
> SELECT AGG_FUNCTION(expr)
> FROM table
> GROUP BY expr1, expr2, ...
> {code}
> The materialized keys in the GROUP BY section form a struct, which can be 
> incrementally hashed to produce dictionary codes suitable for computing 
> aggregates or any other purpose. 
> There are a few subtasks related to this, such as efficiently constructing a 
> record (that can be hashed quickly) to identify each "row" in the struct. 
> Maybe we should start with that first



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-4717) [C#] Consider exposing ValueTask instead of Task

2019-04-24 Thread Eric Erhardt (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Erhardt reassigned ARROW-4717:
---

Assignee: Eric Erhardt

> [C#] Consider exposing ValueTask instead of Task
> 
>
> Key: ARROW-4717
> URL: https://issues.apache.org/jira/browse/ARROW-4717
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Reporter: Eric Erhardt
>Assignee: Eric Erhardt
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See [https://github.com/apache/arrow/pull/3736#pullrequestreview-207169204] 
> for the discussion and 
> [https://devblogs.microsoft.com/dotnet/understanding-the-whys-whats-and-whens-of-valuetask/]
>  for the reasoning.
> Using `Task` in public API requires that a new Task instance be allocated 
> on every call. When returning synchronously, using ValueTask will allow the 
> method to not allocate.
> In order to do this, we will need to take a new dependency on  
> {{System.Threading.Tasks.Extensions}} NuGet package.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4935) [C++] Errors from jemalloc when building pyarrow from source on OSX and Debian

2019-04-24 Thread Ian Mateus Vieira Manor (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825464#comment-16825464
 ] 

Ian Mateus Vieira Manor commented on ARROW-4935:


Having the same problem running
{code:java}
cmake install{code}
on OSX, but have no
{code:java}
/build/jemalloc_ep-prefix/src/jemalloc_ep/dist{code}
 directory to delete.

> [C++] Errors from jemalloc when building pyarrow from source on OSX and Debian
> --
>
> Key: ARROW-4935
> URL: https://issues.apache.org/jira/browse/ARROW-4935
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.12.1
> Environment: OSX, Debian, Python==3.6.7
>Reporter: Gregory Hayes
>Priority: Critical
>  Labels: build, newbie
>
> My attempts to build pyarrow from source are failing. I've set up the conda 
> environment using the instructions provided in the Develop instructions, and 
> have tried this on both Debian and OSX. When I run CMAKE in debug mode on 
> OSX, the output is:
> {code:java}
> -- Building using CMake version: 3.14.0
> -- Arrow version: 0.13.0 (full: '0.13.0-SNAPSHOT')
> -- clang-tidy not found
> -- clang-format not found
> -- infer found at /usr/local/bin/infer
> -- Using ccache: /usr/local/bin/ccache
> -- Found cpplint executable at 
> /Users/Greg/documents/repos/arrow/cpp/build-support/cpplint.py
> -- Compiler command: env LANG=C 
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
>  -v
> -- Compiler version: Apple LLVM version 10.0.0 (clang-1000.11.45.5)
> Target: x86_64-apple-darwin18.2.0
> Thread model: posix
> InstalledDir: 
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
> -- Compiler id: AppleClang
> Selected compiler clang 4.1.0svn
> -- Arrow build warning level: CHECKIN
> Configured for DEBUG build (set with cmake 
> -DCMAKE_BUILD_TYPE={release,debug,...})
> -- Build Type: DEBUG
> -- BOOST_VERSION: 1.67.0
> -- BROTLI_VERSION: v0.6.0
> -- CARES_VERSION: 1.15.0
> -- DOUBLE_CONVERSION_VERSION: v3.1.1
> -- FLATBUFFERS_VERSION: v1.10.0
> -- GBENCHMARK_VERSION: v1.4.1
> -- GFLAGS_VERSION: v2.2.0
> -- GLOG_VERSION: v0.3.5
> -- GRPC_VERSION: v1.18.0
> -- GTEST_VERSION: 1.8.1
> -- JEMALLOC_VERSION: 17c897976c60b0e6e4f4a365c751027244dada7a
> -- LZ4_VERSION: v1.8.3
> -- ORC_VERSION: 1.5.4
> -- PROTOBUF_VERSION: v3.6.1
> -- RAPIDJSON_VERSION: v1.1.0
> -- RE2_VERSION: 2018-10-01
> -- SNAPPY_VERSION: 1.1.3
> -- THRIFT_VERSION: 0.11.0
> -- ZLIB_VERSION: 1.2.8
> -- ZSTD_VERSION: v1.3.7
> -- Boost version: 1.68.0
> -- Found the following Boost libraries:
> --   regex
> --   system
> --   filesystem
> -- Boost include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- Boost libraries: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_regex.dylib/Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_system.dylib/Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_filesystem.dylib
> Added shared library dependency boost_system_shared: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_system.dylib
> Added shared library dependency boost_filesystem_shared: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_filesystem.dylib
> Added shared library dependency boost_regex_shared: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_regex.dylib
> Added static library dependency double-conversion_static: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libdouble-conversion.a
> -- double-conversion include dir: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- double-conversion static library: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libdouble-conversion.a
> -- GFLAGS_HOME: /Users/Greg/anaconda3/envs/pyarrow-dev
> -- GFlags include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- GFlags static library: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libgflags.a
> Added static library dependency gflags_static: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libgflags.a
> -- RapidJSON include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- Found the Flatbuffers library: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libflatbuffers.a
> -- Flatbuffers include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- Flatbuffers compiler: /Users/Greg/anaconda3/envs/pyarrow-dev/bin/flatc
> Added static library dependency jemalloc_static: 
> /Users/Greg/documents/repos/arrow/cpp/build/jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a
> Added shared library dependency jemalloc_shared: 
> /Users/Greg/documents/repos/arrow/cpp/build/jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc.dylib
> -- Found hdfs.h at: 
> /Users/Greg/documents/repos/arrow/cpp/thirdparty/hadoop/include/hdfs.h
> -- Found the ZLIB shared library: 
> 

[jira] [Assigned] (ARROW-5207) [Java] add APIs to support vector reuse

2019-04-24 Thread Ji Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Liu reassigned ARROW-5207:
-

Assignee: Ji Liu

> [Java] add APIs to support vector reuse
> ---
>
> Key: ARROW-5207
> URL: https://issues.apache.org/jira/browse/ARROW-5207
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>
> In some scenarios we hope that ValueVector could be reused to reduce creation 
> overhead. This is very common in shuffle stage, it's no need to create 
> ValueVector or realloc buffers every time, suppose that the recordCount of 
> ValueVector and capacity of its buffers is written in stream, when we 
> deserialize it, we can simply judge whether realloc is needed through 
> dataLength.
> My proposal is that add APIs in ValueVector to process this logic, otherwise 
> users have to implement by themselves if they want to reuse which is not 
> user-friendly. 
> If you agree with this, I would like to take this ticket. Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5206) [Java] Add APIs in MessageSerializer to directly serialize/deserialize ArrowBuf

2019-04-24 Thread Ji Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Liu reassigned ARROW-5206:
-

Assignee: Ji Liu

> [Java] Add APIs in MessageSerializer to directly serialize/deserialize 
> ArrowBuf
> ---
>
> Key: ARROW-5206
> URL: https://issues.apache.org/jira/browse/ARROW-5206
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>
> It seems there no APIs to directly write ArrowBuf to OutputStream or read 
> ArrowBuf from InputStream. These APIs may be helpful when users use Vectors 
> directly instead of RecordBatch, in this case, provide APIs to 
> serialize/deserialize dataBuffer/validityBuffer/offsetBuffer is necessary.
> I would like to work on this and make it my first contribution to Arrow. What 
> do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5071) [Benchmarking] Performs a benchmark run with archery

2019-04-24 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-5071:
--
Description: 
Run all regression benchmarks, consume output and re-format according to the 
format required by dev/benchmarking specification and/or push to upstream 
database.

This would be implemented as `archery benchmark run`. Provide facility to 
save/load results as a StaticRunner (such that it can be re-used in comparison 
without running the benchmark again).

  was:
Run all regression benchmarks, consume output and re-format according to the 
format required by dev/benchmarking specification.

This would be implemented as `archery benchmark run`. Provide facility to 
save/load results as a StaticRunner (such that it can be re-used in comparison 
without running the benchmark again).


> [Benchmarking] Performs a benchmark run with archery
> 
>
> Key: ARROW-5071
> URL: https://issues.apache.org/jira/browse/ARROW-5071
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Run all regression benchmarks, consume output and re-format according to the 
> format required by dev/benchmarking specification and/or push to upstream 
> database.
> This would be implemented as `archery benchmark run`. Provide facility to 
> save/load results as a StaticRunner (such that it can be re-used in 
> comparison without running the benchmark again).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5125) [Python] Cannot roundtrip extreme dates through pyarrow

2019-04-24 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5125:
-
Labels: parquet windows  (was: parquet)

> [Python] Cannot roundtrip extreme dates through pyarrow
> ---
>
> Key: ARROW-5125
> URL: https://issues.apache.org/jira/browse/ARROW-5125
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Windows 10, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 
> 2019, 22:22:05)
>Reporter: Max Bolingbroke
>Priority: Major
>  Labels: parquet, windows
> Fix For: 0.14.0
>
>
> You can roundtrip many dates through a pyarrow array:
>  
> {noformat}
> >>> pa.array([datetime.date(1980, 1, 1)], type=pa.date32())[0]
> datetime.date(1980, 1, 1){noformat}
>  
> But (on Windows at least), not extreme ones:
>  
> {noformat}
> >>> pa.array([datetime.date(1960, 1, 1)], type=pa.date32())[0]
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "pyarrow\scalar.pxi", line 74, in pyarrow.lib.ArrayValue.__repr__
>  File "pyarrow\scalar.pxi", line 226, in pyarrow.lib.Date32Value.as_py
> OSError: [Errno 22] Invalid argument
> >>> pa.array([datetime.date(3200, 1, 1)], type=pa.date32())[0]
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "pyarrow\scalar.pxi", line 74, in pyarrow.lib.ArrayValue.__repr__
>  File "pyarrow\scalar.pxi", line 226, in pyarrow.lib.Date32Value.as_py
> {noformat}
> This is because datetime.utcfromtimestamp and datetime.timestamp fail on 
> these dates, but it seems we should be able to totally avoid invoking this 
> function when deserializing dates. Ideally we would be able to roundtrip 
> these as datetimes too, of course, but it's less clear that this will be 
> easy. For some context on this see [https://bugs.python.org/issue29097].
> This may be related to ARROW-3176 and ARROW-4746



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5209) [Java] Add performance benchmarks from SQL workloads

2019-04-24 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5209:
---

 Summary: [Java] Add performance benchmarks from SQL workloads
 Key: ARROW-5209
 URL: https://issues.apache.org/jira/browse/ARROW-5209
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


To improve the performance of Arrow implementations. Some performance 
benchmarks must be setup first. 

In this issue, we want to provide some performance benchmarks extracted from 
our SQL engine, which is going to be made open source soon. The workloads are 
obtained by running an open SQL benchmarks TPC-H. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5208) [Python] Inconsistent resulting type during casting in pa.array() when mask is present

2019-04-24 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5208:
--
Summary: [Python] Inconsistent resulting type during casting in pa.array() 
when mask is present  (was: Inconsistent resulting type during casting in 
pa.array() when mask is present)

> [Python] Inconsistent resulting type during casting in pa.array() when mask 
> is present
> --
>
> Key: ARROW-5208
> URL: https://issues.apache.org/jira/browse/ARROW-5208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
>
> I would expect Int64Array type in all cases below :
> {code:java}
> >>> pa.array([4, None, 4, None], mask=np.array([False, True, False, True]))   
> >>>                                                                           
> >>>      
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True]))  
> >>>                                                                           
> >>>         
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True]))     
> >>>                                                                           
> >>>           [   4,   null,   
> >>> 4,   null ]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3176) [Python] Overflow in Date32 column conversion to pandas

2019-04-24 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-3176:
-
Description: 
When converting an arrow column holding a {{Date32Array}} to {{pandas}} there 
seems to be an overflow at the date {{2262-04-12}} such that the type and value 
are wrong. The issue only occurs for columns, not for arrays.

Running on debian 9.5 w/ python2 gives
  
{code}
In [1]: import numpy as np

In [2]: import datetime

In [3]: import pyarrow as pa

In [4]: pa.__version__
Out[4]: '0.10.0'

In [5]: arr = pa.array(np.array([datetime.date(2262, 4, 12)], 
dtype='datetime64[D]'))

In [6]: arr.to_pandas(date_as_object=False)
Out[6]: array(['2262-04-12'], dtype='datetime64[D]')

In [7]: pa.column('name', arr).to_pandas(date_as_object=False)
Out[7]:
0 1677-09-21 00:25:26.290448384
Name: name, dtype: datetime64[ns]
{code}

  was:
When converting an arrow column holding a {{Date32Array}} to {{pandas}} there 
seems to be an overflow at the date {{2262-04-12}} such that the type and value 
are wrong. The issue only occurs for columns, not for arrays.

Running on debian 9.5 w/ python2 gives
  
{code}
In [1]: import numpy as np

In [2]: import datetime

In [3]: import pyarrow as pa

In [4]: pa.__version__
Out[4]: '0.10.0'

In [5]: arr = pa.array(np.array([datetime.date(2262, 4, 12)], 
dtype='datetime64[D]'))

In [6]: arr.to_pandas()
Out[6]: array(['2262-04-12'], dtype='datetime64[D]')

In [7]: pa.column('name', arr).to_pandas()
Out[7]:
0 1677-09-21 00:25:26.290448384
Name: name, dtype: datetime64[ns]
{code}


> [Python] Overflow in Date32 column conversion to pandas
> ---
>
> Key: ARROW-3176
> URL: https://issues.apache.org/jira/browse/ARROW-3176
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Florian Jetter
>Priority: Minor
> Fix For: 0.14.0
>
>
> When converting an arrow column holding a {{Date32Array}} to {{pandas}} there 
> seems to be an overflow at the date {{2262-04-12}} such that the type and 
> value are wrong. The issue only occurs for columns, not for arrays.
> Running on debian 9.5 w/ python2 gives
>   
> {code}
> In [1]: import numpy as np
> In [2]: import datetime
> In [3]: import pyarrow as pa
> In [4]: pa.__version__
> Out[4]: '0.10.0'
> In [5]: arr = pa.array(np.array([datetime.date(2262, 4, 12)], 
> dtype='datetime64[D]'))
> In [6]: arr.to_pandas(date_as_object=False)
> Out[6]: array(['2262-04-12'], dtype='datetime64[D]')
> In [7]: pa.column('name', arr).to_pandas(date_as_object=False)
> Out[7]:
> 0 1677-09-21 00:25:26.290448384
> Name: name, dtype: datetime64[ns]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5165) [Python][Documentation] Build docs don't suggest assigning $ARROW_BUILD_TYPE

2019-04-24 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-5165:
-

Assignee: Joris Van den Bossche  (was: Rok Mihevc)

> [Python][Documentation] Build docs don't suggest assigning $ARROW_BUILD_TYPE
> 
>
> Key: ARROW-5165
> URL: https://issues.apache.org/jira/browse/ARROW-5165
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools, Documentation, Python
>Affects Versions: 0.14.0
>Reporter: Rok Mihevc
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [Build documentation|https://arrow.apache.org/docs/developers/python.html] is 
> great. However it does not explicitly suggest assigning a value to 
> `ARROW_BUILD_TYPE` and the error thrown is not obvious:
> {code:bash}
> ...
>  [100%] Built target _parquet
>  – Finished cmake --build for pyarrow
>  Bundling includes: include
>  error: [Errno 2] No such file or directory: 'include'
> {code}
> This cost me a couple of hours to debug.
> Could we include a note in [build 
> documentation|https://arrow.apache.org/docs/developers/python.html] 
> suggesting devs to run:
> {code:bash}
> export ARROW_BUILD_TYPE=release
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5165) [Python][Documentation] Build docs don't suggest assigning $ARROW_BUILD_TYPE

2019-04-24 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-5165.
---
   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4192
[https://github.com/apache/arrow/pull/4192]

> [Python][Documentation] Build docs don't suggest assigning $ARROW_BUILD_TYPE
> 
>
> Key: ARROW-5165
> URL: https://issues.apache.org/jira/browse/ARROW-5165
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools, Documentation, Python
>Affects Versions: 0.14.0
>Reporter: Rok Mihevc
>Assignee: Rok Mihevc
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> [Build documentation|https://arrow.apache.org/docs/developers/python.html] is 
> great. However it does not explicitly suggest assigning a value to 
> `ARROW_BUILD_TYPE` and the error thrown is not obvious:
> {code:bash}
> ...
>  [100%] Built target _parquet
>  – Finished cmake --build for pyarrow
>  Bundling includes: include
>  error: [Errno 2] No such file or directory: 'include'
> {code}
> This cost me a couple of hours to debug.
> Could we include a note in [build 
> documentation|https://arrow.apache.org/docs/developers/python.html] 
> suggesting devs to run:
> {code:bash}
> export ARROW_BUILD_TYPE=release
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5210) [Python] editable install (pip install -e .) is failing

2019-04-24 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5210:


 Summary: [Python] editable install (pip install -e .) is failing 
 Key: ARROW-5210
 URL: https://issues.apache.org/jira/browse/ARROW-5210
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Following the python development documentation on building arrow and pyarrow 
([https://arrow.apache.org/docs/developers/python.html#build-and-test),] 
building pyarrow inplace with {{python setup.py build_ext --inplace}} works 
fine.

 

But if you want to also install this inplace version in the current python 
environment (editable install / development install) using pip ({{pip install 
-e .}}), this fails during the {{built_ext}} / cmake phase:
{code:none}
 
-- Looking for python3.7m
    -- Found Python lib 
/home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so
    CMake Error at cmake_modules/FindNumPy.cmake:62 (message):
  NumPy import failure:

  Traceback (most recent call last):

    File "", line 1, in 

  ModuleNotFoundError: No module named 'numpy'

    Call Stack (most recent call first):
  CMakeLists.txt:186 (find_package)


    -- Configuring incomplete, errors occurred!
    See also 
"/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeOutput.log".
    See also 
"/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeError.log".
    error: command 'cmake' failed with exit status 1
Cleaning up...
{code}
 

Alternatively, doing `python setup.py develop` to achieve the same does work.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5200) [Java] Provide light-weight arrow APIs

2019-04-24 Thread Liya Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824900#comment-16824900
 ] 

Liya Fan commented on ARROW-5200:
-

Sounds reasonable. Thanks a lot for your comments. 

We have opened a new Jira (ARROW-5209) to setup some performance benchmarks 
from our SQL engine, which is going to be made open source. The benchmarks are 
extracted by running an open SQL benchmark TPC-H. 

> [Java] Provide light-weight arrow APIs
> --
>
> Key: ARROW-5200
> URL: https://issues.apache.org/jira/browse/ARROW-5200
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
> Attachments: image-2019-04-23-15-19-34-187.png
>
>
> We are trying to incorporate Apache Arrow to Apache Flink runtime. We find 
> Arrow an amazing library, which greatly simplifies the support of columnar 
> data format.
> However, for many scenarios, we find the performance unacceptable. Our 
> investigation shows the reason is that, there are too many redundant checks 
> and computations in Arrow API.
> For example, the following figures shows that in a single call to 
> Float8Vector.get(int) method (this is one of the most frequently used APIs in 
> Flink computation),  there are 20+ method invocations.
> !image-2019-04-23-15-19-34-187.png!
>  
> There are many other APIs with similar problems. We believe that these checks 
> will make sure of the integrity of the program. However, it also impacts 
> performance severely. For our evaluation, the performance may degrade by two 
> or three orders of magnitude slower, compared to access data on heap memory. 
> We think at least for some scenarios, we can give the responsibility of 
> integrity check to application owners. If they can be sure all the checks 
> have been passed, we can provide some light-weight APIs and the inherent high 
> performance, to them.
> In the light-weight APIs, we only provide minimum checks, or avoid checks at 
> all. The application owner can still develop and debug their code using the 
> original heavy-weight APIs. Once all bugs have been fixed, they can switch to 
> light-weight APIs in their products and enjoy the consequent high performance.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3176) [Python] Overflow in Date32 column conversion to pandas

2019-04-24 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824911#comment-16824911
 ] 

Joris Van den Bossche commented on ARROW-3176:
--

Note that the default type changed: it now gives back datetime.date objects, 
instead of datetime64[D]. Do by default you no longer have this problem. But, 
setting {{date_as_object=False}} (to have back the old behaviour), you still 
have the same overflow issue. 

Updated the original bug report to add this keyword, to keep it a reproducible 
example.

> [Python] Overflow in Date32 column conversion to pandas
> ---
>
> Key: ARROW-3176
> URL: https://issues.apache.org/jira/browse/ARROW-3176
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Florian Jetter
>Priority: Minor
> Fix For: 0.14.0
>
>
> When converting an arrow column holding a {{Date32Array}} to {{pandas}} there 
> seems to be an overflow at the date {{2262-04-12}} such that the type and 
> value are wrong. The issue only occurs for columns, not for arrays.
> Running on debian 9.5 w/ python2 gives
>   
> {code}
> In [1]: import numpy as np
> In [2]: import datetime
> In [3]: import pyarrow as pa
> In [4]: pa.__version__
> Out[4]: '0.10.0'
> In [5]: arr = pa.array(np.array([datetime.date(2262, 4, 12)], 
> dtype='datetime64[D]'))
> In [6]: arr.to_pandas(date_as_object=False)
> Out[6]: array(['2262-04-12'], dtype='datetime64[D]')
> In [7]: pa.column('name', arr).to_pandas(date_as_object=False)
> Out[7]:
> 0 1677-09-21 00:25:26.290448384
> Name: name, dtype: datetime64[ns]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5201) [Python] Import ABCs from collections is deprecated in Python 3.7

2019-04-24 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-5201.
---
   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4187
[https://github.com/apache/arrow/pull/4187]

> [Python] Import ABCs from collections is deprecated in Python 3.7
> -
>
> Key: ARROW-5201
> URL: https://issues.apache.org/jira/browse/ARROW-5201
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> From running the tests, I see a few deprecation warnings related to that on 
> Python 3, abstract base classes should be imported from `collections.abc` 
> instead of `collections`:
> {code:none}
> pyarrow/tests/test_array.py:808
>   /home/joris/scipy/repos/arrow/python/pyarrow/tests/test_array.py:808: 
> DeprecationWarning: Using or importing the ABCs from 'collections' instead of 
> from 'collections.abc' is deprecated, and in 3.8 it will stop working
>     pa.struct([pa.field('a', pa.int64()), pa.field('b', pa.string())]))
> pyarrow/tests/test_table.py:18
>   /home/joris/scipy/repos/arrow/python/pyarrow/tests/test_table.py:18: 
> DeprecationWarning: Using or importing the ABCs from 'collections' instead of 
> from 'collections.abc' is deprecated, and in 3.8 it will stop working
>     from collections import OrderedDict, Iterable
> pyarrow/tests/test_feather.py::TestFeatherReader::test_non_string_columns
>   /home/joris/scipy/repos/arrow/python/pyarrow/pandas_compat.py:294: 
> DeprecationWarning: Using or importing the ABCs from 'collections' instead of 
> from 'collections.abc' is deprecated, and in 3.8 it will stop working
>     elif isinstance(name, collections.Sequence):{code}
> Those could be imported depending on python 2/3 in the ``pyarrow.compat`` 
> module.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-4934) [Python] Address deprecation notice that will be a bug in Python 3.8

2019-04-24 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-4934.
--
Resolution: Fixed

Apparently https://issues.apache.org/jira/browse/ARROW-5201 (which is just 
fixed) was a duplicate of this.

> [Python] Address deprecation notice that will be a bug in Python 3.8 
> -
>
> Key: ARROW-4934
> URL: https://issues.apache.org/jira/browse/ARROW-4934
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> originally reported as https://github.com/apache/arrow/issues/3839



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5204) [C++] Improve BufferBuilder performance

2019-04-24 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-5204.
---
   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4193
[https://github.com/apache/arrow/pull/4193]

> [C++] Improve BufferBuilder performance
> ---
>
> Key: ARROW-5204
> URL: https://issues.apache.org/jira/browse/ARROW-5204
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> BufferBuilder makes a spurious memset() when extending the buffer size.
> We could also tweak the overallocation strategy in Reserve().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5208) Inconsistent resulting type during casting in pa.array() when mask is present

2019-04-24 Thread Artem KOZHEVNIKOV (JIRA)
Artem KOZHEVNIKOV created ARROW-5208:


 Summary: Inconsistent resulting type during casting in pa.array() 
when mask is present
 Key: ARROW-5208
 URL: https://issues.apache.org/jira/browse/ARROW-5208
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.13.0
Reporter: Artem KOZHEVNIKOV


I would expect Int64Array type in all cases below :
{code:java}
pa.array([4, None, 4, None], mask=np.array([False, True, False, True]))         
                                                                         


[

  4,

  null,

  4,

  null

]


pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True]))        
                                                                            


[

  4,

  null,

  4,

  null

]

pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True]))           
                                                                             
 [   4,   null,   4,   null 
]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5208) Inconsistent resulting type during casting in pa.array() when mask is present

2019-04-24 Thread Artem KOZHEVNIKOV (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem KOZHEVNIKOV updated ARROW-5208:
-
Description: 
I would expect Int64Array type in all cases below :
{code:java}
>>> pa.array([4, None, 4, None], mask=np.array([False, True, False, True]))     
>>>                                                                             
>>>  
 [4, null, 4,  null ]

>>> pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True]))    
>>>                                                                             
>>>     
 [4, null, 4,  null ]

>>> pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True]))       
>>>                                                                             
>>>       [   4,   null,   4,   
>>> null ]{code}

  was:
I would expect Int64Array type in all cases below :
{code:java}
pa.array([4, None, 4, None], mask=np.array([False, True, False, True]))         
                                                                         


[

  4,

  null,

  4,

  null

]


pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True]))        
                                                                            


[

  4,

  null,

  4,

  null

]

pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True]))           
                                                                             
 [   4,   null,   4,   null 
]{code}


> Inconsistent resulting type during casting in pa.array() when mask is present
> -
>
> Key: ARROW-5208
> URL: https://issues.apache.org/jira/browse/ARROW-5208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
>
> I would expect Int64Array type in all cases below :
> {code:java}
> >>> pa.array([4, None, 4, None], mask=np.array([False, True, False, True]))   
> >>>                                                                           
> >>>      
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True]))  
> >>>                                                                           
> >>>         
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True]))     
> >>>                                                                           
> >>>           [   4,   null,   
> >>> 4,   null ]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5210) [Python] editable install (pip install -e .) is failing

2019-04-24 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5210:
-
Description: 
Following the python development documentation on building arrow and pyarrow 
([https://arrow.apache.org/docs/developers/python.html#build-and-test),] 
building pyarrow inplace with {{python setup.py build_ext --inplace}} works 
fine.

 

But if you want to also install this inplace version in the current python 
environment (editable install / development install) using pip ({{pip install 
-e .}}), this fails during the {{built_ext}} / cmake phase:
{code:none}
 
-- Looking for python3.7m
    -- Found Python lib 
/home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so
    CMake Error at cmake_modules/FindNumPy.cmake:62 (message):
  NumPy import failure:

  Traceback (most recent call last):

    File "", line 1, in 

  ModuleNotFoundError: No module named 'numpy'

    Call Stack (most recent call first):
  CMakeLists.txt:186 (find_package)


    -- Configuring incomplete, errors occurred!
    See also 
"/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeOutput.log".
    See also 
"/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeError.log".
    error: command 'cmake' failed with exit status 1
Cleaning up...
{code}
 

Alternatively, doing {{python setup.py develop}} to achieve the same still 
works.

 

  was:
Following the python development documentation on building arrow and pyarrow 
([https://arrow.apache.org/docs/developers/python.html#build-and-test),] 
building pyarrow inplace with {{python setup.py build_ext --inplace}} works 
fine.

 

But if you want to also install this inplace version in the current python 
environment (editable install / development install) using pip ({{pip install 
-e .}}), this fails during the {{built_ext}} / cmake phase:
{code:none}
 
-- Looking for python3.7m
    -- Found Python lib 
/home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so
    CMake Error at cmake_modules/FindNumPy.cmake:62 (message):
  NumPy import failure:

  Traceback (most recent call last):

    File "", line 1, in 

  ModuleNotFoundError: No module named 'numpy'

    Call Stack (most recent call first):
  CMakeLists.txt:186 (find_package)


    -- Configuring incomplete, errors occurred!
    See also 
"/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeOutput.log".
    See also 
"/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeError.log".
    error: command 'cmake' failed with exit status 1
Cleaning up...
{code}
 

Alternatively, doing `python setup.py develop` to achieve the same does work.

 


> [Python] editable install (pip install -e .) is failing 
> 
>
> Key: ARROW-5210
> URL: https://issues.apache.org/jira/browse/ARROW-5210
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> Following the python development documentation on building arrow and pyarrow 
> ([https://arrow.apache.org/docs/developers/python.html#build-and-test),] 
> building pyarrow inplace with {{python setup.py build_ext --inplace}} works 
> fine.
>  
> But if you want to also install this inplace version in the current python 
> environment (editable install / development install) using pip ({{pip install 
> -e .}}), this fails during the {{built_ext}} / cmake phase:
> {code:none}
>  
> -- Looking for python3.7m
>     -- Found Python lib 
> /home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so
>     CMake Error at cmake_modules/FindNumPy.cmake:62 (message):
>   NumPy import failure:
>   Traceback (most recent call last):
>     File "", line 1, in 
>   ModuleNotFoundError: No module named 'numpy'
>     Call Stack (most recent call first):
>   CMakeLists.txt:186 (find_package)
>     -- Configuring incomplete, errors occurred!
>     See also 
> "/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeOutput.log".
>     See also 
> "/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeError.log".
>     error: command 'cmake' failed with exit status 1
> Cleaning up...
> {code}
>  
> Alternatively, doing {{python setup.py develop}} to achieve the same still 
> works.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3767) [C++] Add cast for Null to any type

2019-04-24 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-3767:
-

Assignee: Antoine Pitrou

> [C++] Add cast for Null to any type
> ---
>
> Key: ARROW-3767
> URL: https://issues.apache.org/jira/browse/ARROW-3767
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe L. Korn
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 0.14.0
>
>
> Casting a column from NullType to any other type is possible as the resulting 
> array will also be all-null but simply with a different type annotation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3767) [C++] Add cast for Null to any type

2019-04-24 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3767:
--
Labels: pull-request-available  (was: )

> [C++] Add cast for Null to any type
> ---
>
> Key: ARROW-3767
> URL: https://issues.apache.org/jira/browse/ARROW-3767
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe L. Korn
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Casting a column from NullType to any other type is possible as the resulting 
> array will also be all-null but simply with a different type annotation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)