date:20190830

[jira] [Comment Edited] (ARROW-6015) [Python] pyarrow wheel: `DLL load failed` when importing on windows

2019-08-30 Thread Kazuaki Ishizaki (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920036#comment-16920036
 ] 

Kazuaki Ishizaki edited comment on ARROW-6015 at 8/31/19 5:57 AM:
--

I believe that I identified how to fix this issue.

To install {{Microsoft Visual C++ Redistributable for Visual Studio 2015, 2017 
and 2019.}} from 
[https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads]
 avoids this error.
I think that this problem does not occur with conda. This problem occurs with 
only pip.

The following is my validation step. If someone double-checks it, we would 
appreciate it.
{code:java}
// Install Windows10 enterprise (no additional application is installed)
> mkdir c:\pyarrow
> cd c:\pyarrow
> bitsadmin /TRANSFER htmlget 
> [https://www.python.org/ftp/python/3.7.4/python-3.7.4-embed-amd64.zip] 
> c:\pyarrow\python-3.7.4-embed-amd64.zip
extract all python-3.7.4-embed-amd64.zip to c:\pyarrow\python-3.7.4-embed-amd64 
from Explorer
> cd python-3.7.4-embed-amd64
notepad python37._pth
...
#import site <=== remove # in this line
> type python37._pth
python37.zip
.
 # Uncomment to run site.main() automatically
import site

> python get-pip.py
...
Successfully installed pip-19.2.3 setuptools-41.2.0 wheel-0.33.6
> python -m pip install pyarrow
C:\pyarrow\python-3.7.4-embed-amd64>python -m pip install pyarrow
Collecting pyarrow
Downloading 
[https://files.pythonhosted.org/packages/97/7c/0ea4554d64c6ed3d6d4f8da492df287d2496adbab2b35c01433cf1344521/pyarrow-0.14.0-cp37-cp37m-win_amd64.whl]
 (17.4MB)
...
Collecting numpy>=1.14 (from pyarrow)
Downloading 
[https://files.pythonhosted.org/packages/cb/41/05fbf6944b098eb9d53e8a29a9dbfa20a7448f3254fb71499746a29a1b2d/numpy-1.17.1-cp37-cp37m-win_amd64.whl]
 (12.8MB)|
...
Collecting six>=1.0.0 (from pyarrow)
Downloading 
[https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl]
Installing collected packages: numpy, six, pyarrow
WARNING: The script f2py.exe is installed in 
'C:\pyarrow\python-3.7.4-embed-amd64\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this 
warning, use --no-warn-script-location.
WARNING: The script plasma_store.exe is installed in 
'C:\pyarrow\python-3.7.4-embed-amd64\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this 
warning, use --no-warn-script-location.
Successfully installed numpy-1.17.1 pyarrow-0.14.0 six-1.12.0
> python -c "import pyarrow"
Traceback (most recent call last):
File "", line 1, in 
File 
"C:\pyarrow\python-3.7.4-embed-amd64\lib\site-packages\pyarrow__init__.py", 
line 49, in 
from pyarrow.lib import cpu_count, set_cpu_count
ImportError: DLL load failed: The specified module could not be found.
> python -m pip freeze
numpy==1.17.1
pyarrow==0.14.0
six==1.12.0
> dir Lib\site-packages\pyarrow
Volume in drive C is OS
Volume Serial Number is 1234-5678|

Directory of C:\pyarrow\python-3.7.4-embed-amd64\Lib\site-packages\pyarrow

08/31/2019 05:42 AM  .
08/31/2019 05:42 AM  ..
08/31/2019 05:42 AM 47,658 array.pxi
08/31/2019 05:42 AM 5,748,736 arrow.dll
08/31/2019 05:42 AM 1,653,120 arrow.lib
08/31/2019 05:42 AM 1,795,072 arrow_flight.dll
08/31/2019 05:42 AM 121,062 arrow_flight.lib
08/31/2019 05:42 AM 910,848 arrow_python.dll
08/31/2019 05:42 AM 119,994 arrow_python.lib
08/31/2019 05:42 AM 869 benchmark.pxi
08/31/2019 05:42 AM 895 benchmark.py
08/31/2019 05:42 AM 2,774 builder.pxi
08/31/2019 05:42 AM 81,920 cares.dll
08/31/2019 05:42 AM 3,691 compat.py
08/31/2019 05:42 AM 911 csv.py
08/31/2019 05:42 AM 1,126 cuda.py
08/31/2019 05:42 AM 3,161 error.pxi
08/31/2019 05:42 AM 4,026 feather.pxi
08/31/2019 05:42 AM 7,291 feather.py
08/31/2019 05:42 AM 12,472 filesystem.py
08/31/2019 05:42 AM 1,286 flight.py
08/31/2019 05:42 AM 186,880 gandiva.cp37-win_amd64.pyd
08/31/2019 05:42 AM 791,664 gandiva.cpp
08/31/2019 05:42 AM 22,094,848 gandiva.dll
08/31/2019 05:42 AM 305,626 gandiva.lib
08/31/2019 05:42 AM 16,553 gandiva.pyx
08/31/2019 05:42 AM 7,032 hdfs.py
08/31/2019 05:42 AM  include
08/31/2019 05:42 AM  includes
08/31/2019 05:42 AM 13,995 io-hdfs.pxi
08/31/2019 05:42 AM 48,879 io.pxi
08/31/2019 05:42 AM 15,981 ipc.pxi
08/31/2019 05:42 AM 6,178 ipc.py
08/31/2019 05:42 AM 897 json.py
08/31/2019 05:42 AM 8,623 jvm.py
08/31/2019 05:42 AM 1,553,408 lib.cp37-win_amd64.pyd
08/31/2019 05:42 AM 6,756,155 lib.cpp
08/31/2019 05:42 AM 10,652 lib.pxd
08/31/2019 05:42 AM 3,570 lib.pyx
08/31/2019 05:42 AM 3,243,008 libcrypto-1_1-x64.dll
08/31/2019 05:42 AM 2,613,248 libprotobuf.dll
08/31/2019 05:42 AM 650,240 libssl-1_1-x64.dll
08/31/2019 05:42 AM 13,435 lib_api.h
08/31/2019 05:42 AM 4,724 memory.pxi
08/31/2019 05:42 AM 4,912 orc.py
08/31/2019 05:42 AM 5,789 pandas-shim.pxi
08/31/2019 05:42 AM 33,456 pandas_compat.py
08/31/2019

[jira] [Commented] (ARROW-6015) [Python] pyarrow wheel: `DLL load failed` when importing on windows

2019-08-30 Thread Kazuaki Ishizaki (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920036#comment-16920036
 ] 

Kazuaki Ishizaki commented on ARROW-6015:
-

I believe that I identified how to fix this issue.

To install {{Microsoft Visual C++ Redistributable for Visual Studio 2015, 2017 
and 2019.}} from 
https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads
 avoids this error.
I think that this problem does not occur with conda. This problem occurs with 
only pip.

The following is my validation step. If someone double-checks it, we would 
appreciate it.
{{code}}
https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads


Install Windows10 enterprise (no additional application is installed)

> mkdir c:\pyarrow
> cd c:\pyarrow
> bitsadmin /TRANSFER htmlget 
> https://www.python.org/ftp/python/3.7.4/python-3.7.4-embed-amd64.zip 
> c:\pyarrow\python-3.7.4-embed-amd64.zip
extract all python-3.7.4-embed-amd64.zip to c:\pyarrow\python-3.7.4-embed-amd64 
from Explorer
> cd python-3.7.4-embed-amd64
notepad python37._pth
...
#import site  <=== remove # in this line
> type python37._pth
python37.zip
.

# Uncomment to run site.main() automatically
import site

> python get-pip.py
...
Successfully installed pip-19.2.3 setuptools-41.2.0 wheel-0.33.6
> python -m pip install pyarrow
C:\pyarrow\python-3.7.4-embed-amd64>python -m pip install pyarrow
Collecting pyarrow
  Downloading 
https://files.pythonhosted.org/packages/97/7c/0ea4554d64c6ed3d6d4f8da492df287d2496adbab2b35c01433cf1344521/pyarrow-0.14.0-cp37-cp37m-win_amd64.whl
 (17.4MB)
 || 17.4MB 6.4MB/s
Collecting numpy>=1.14 (from pyarrow)
  Downloading 
https://files.pythonhosted.org/packages/cb/41/05fbf6944b098eb9d53e8a29a9dbfa20a7448f3254fb71499746a29a1b2d/numpy-1.17.1-cp37-cp37m-win_amd64.whl
 (12.8MB)
 || 12.8MB ...
Collecting six>=1.0.0 (from pyarrow)
  Downloading 
https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
Installing collected packages: numpy, six, pyarrow
  WARNING: The script f2py.exe is installed in 
'C:\pyarrow\python-3.7.4-embed-amd64\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this 
warning, use --no-warn-script-location.
  WARNING: The script plasma_store.exe is installed in 
'C:\pyarrow\python-3.7.4-embed-amd64\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this 
warning, use --no-warn-script-location.
Successfully installed numpy-1.17.1 pyarrow-0.14.0 six-1.12.0
> python -c "import pyarrow"
Traceback (most recent call last):
  File "", line 1, in 
  File 
"C:\pyarrow\python-3.7.4-embed-amd64\lib\site-packages\pyarrow\__init__.py", 
line 49, in 
from pyarrow.lib import cpu_count, set_cpu_count
ImportError: DLL load failed: The specified module could not be found.
> python -m pip freeze
numpy==1.17.1
pyarrow==0.14.0
six==1.12.0
> dir Lib\site-packages\pyarrow
 Volume in drive C is OS
 Volume Serial Number is 1234-5678

 Directory of C:\pyarrow\python-3.7.4-embed-amd64\Lib\site-packages\pyarrow

08/31/2019  05:42 AM  .
08/31/2019  05:42 AM  ..
08/31/2019  05:42 AM47,658 array.pxi
08/31/2019  05:42 AM 5,748,736 arrow.dll
08/31/2019  05:42 AM 1,653,120 arrow.lib
08/31/2019  05:42 AM 1,795,072 arrow_flight.dll
08/31/2019  05:42 AM   121,062 arrow_flight.lib
08/31/2019  05:42 AM   910,848 arrow_python.dll
08/31/2019  05:42 AM   119,994 arrow_python.lib
08/31/2019  05:42 AM   869 benchmark.pxi
08/31/2019  05:42 AM   895 benchmark.py
08/31/2019  05:42 AM 2,774 builder.pxi
08/31/2019  05:42 AM81,920 cares.dll
08/31/2019  05:42 AM 3,691 compat.py
08/31/2019  05:42 AM   911 csv.py
08/31/2019  05:42 AM 1,126 cuda.py
08/31/2019  05:42 AM 3,161 error.pxi
08/31/2019  05:42 AM 4,026 feather.pxi
08/31/2019  05:42 AM 7,291 feather.py
08/31/2019  05:42 AM12,472 filesystem.py
08/31/2019  05:42 AM 1,286 flight.py
08/31/2019  05:42 AM   186,880 gandiva.cp37-win_amd64.pyd
08/31/2019  05:42 AM   791,664 gandiva.cpp
08/31/2019  05:42 AM22,094,848 gandiva.dll
08/31/2019  05:42 AM   305,626 gandiva.lib
08/31/2019  05:42 AM16,553 gandiva.pyx
08/31/2019  05:42 AM 7,032 hdfs.py
08/31/2019  05:42 AM  include
08/31/2019  05:42 AM  includes
08/31/2019  05:42 AM13,995 io-hdfs.pxi
08/31/2019  05:42 AM48,879 io.pxi
08/31/2019  05:42 AM15,981 ipc.pxi
08/31/2019  05:42 AM 6,178 ipc.py
08/31/2019  05:42 AM

[jira] [Resolved] (ARROW-6099) [JAVA] Has the ability to not using slf4j logging framework

2019-08-30 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6099.

Resolution: Won't Fix

Closing for now, more discussion on the mailing list might be warranted and we 
can reopen.

> [JAVA] Has the ability to not using slf4j logging framework
> ---
>
> Key: ARROW-6099
> URL: https://issues.apache.org/jira/browse/ARROW-6099
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.14.1
>Reporter: Haowei Yu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Currently, the java library directly calls slf4j api, and there is no 
> abstract layer. This leads to user need to install slf4j as a requirement 
> even if we don't use slf4j at all. 
>  
> It is best if you can change the slf4j dependency scope to provided and log 
> content only if slf4j jar file is provided at runtime.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6099) [JAVA] Has the ability to not using slf4j logging framework

2019-08-30 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920034#comment-16920034
 ] 

Micah Kornfield commented on ARROW-6099:


See discussion on the PR [~jacq...@dremio.com] vetoed the patch.

> [JAVA] Has the ability to not using slf4j logging framework
> ---
>
> Key: ARROW-6099
> URL: https://issues.apache.org/jira/browse/ARROW-6099
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.14.1
>Reporter: Haowei Yu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Currently, the java library directly calls slf4j api, and there is no 
> abstract layer. This leads to user need to install slf4j as a requirement 
> even if we don't use slf4j at all. 
>  
> It is best if you can change the slf4j dependency scope to provided and log 
> content only if slf4j jar file is provided at runtime.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6247) [Java] Provide a common interface for float4 and float8 vectors

2019-08-30 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6247.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5132
[https://github.com/apache/arrow/pull/5132]

> [Java] Provide a common interface for float4 and float8 vectors
> ---
>
> Key: ARROW-6247
> URL: https://issues.apache.org/jira/browse/ARROW-6247
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We want to provide an interface for floating point vectors (float4 & float8). 
> This interface will make it convenient for many operations on a vector. With 
> this interface, the client code will be greatly simplified, with many 
> branches/switch removed.
>  
> The design is similar to BaseIntVector (the interface for all integer 
> vectors). We provide 3 methods for setting & getting floating point values:
>  setWithPossibleTruncate
>  setSafeWithPossibleTruncate
>  getValueAsDouble



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6031) [Java] Support iterating a vector by ArrowBufPointer

2019-08-30 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6031.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 4950
[https://github.com/apache/arrow/pull/4950]

> [Java] Support iterating a vector by ArrowBufPointer
> 
>
> Key: ARROW-6031
> URL: https://issues.apache.org/jira/browse/ARROW-6031
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> Provide the functionality to traverse a vector (fixed-width vector & 
> variable-width vector) by an iterator. This is convenient for scenarios when 
> accessing vector elements in sequence.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6397) [C++][CI] Fix S3 minio failure

2019-08-30 Thread Sutou Kouhei (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei resolved ARROW-6397.
-
Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5238
[https://github.com/apache/arrow/pull/5238]

> [C++][CI] Fix S3 minio failure
> --
>
> Key: ARROW-6397
> URL: https://issues.apache.org/jira/browse/ARROW-6397
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Continuous Integration
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> See 
> [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27065941/job/gwjmr2hudm7693ef]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-4095) [C++] Implement optimizations for dictionary unification where dictionaries are prefixes of the unified dictionary

2019-08-30 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-4095.

Resolution: Fixed

Issue resolved by pull request 5230
[https://github.com/apache/arrow/pull/5230]

> [C++] Implement optimizations for dictionary unification where dictionaries 
> are prefixes of the unified dictionary
> --
>
> Key: ARROW-4095
> URL: https://issues.apache.org/jira/browse/ARROW-4095
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In the event that the unified dictionary contains other dictionaries as 
> prefixes (e.g. as the result of delta dictionaries), we can avoid memory 
> allocation and index transposition.
> See discussion at 
> https://github.com/apache/arrow/pull/3165#discussion_r243020982



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6220) [Java] Add API to avro adapter to limit number of rows returned at a time.

2019-08-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6220:
--
Labels: avro pull-request-available  (was: avro)

> [Java] Add API to avro adapter to limit number of rows returned at a time.
> --
>
> Key: ARROW-6220
> URL: https://issues.apache.org/jira/browse/ARROW-6220
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Major
>  Labels: avro, pull-request-available
>
> We can either let clients iterate or ideally provide an iterator interface.  
> This is important for large avro data and was also discussed as something 
> readers/adapters should haven.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6119) [Python] PyArrow wheel import fails on Windows Python 3.7

2019-08-30 Thread Kazuaki Ishizaki (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920012#comment-16920012
 ] 

Kazuaki Ishizaki commented on ARROW-6119:
-

In my environment, I can reproduce this error using 0.13.0 with embeddable 
Python on Windows 10 that have been just installed (i.e. install no 
application). Does anyone see the failure in 0.13.0?

On the other hand, I can succeed to import pyarrow 0.14.1 in miniconda thru 
{{conda install}}.


> [Python] PyArrow wheel import fails on Windows Python 3.7
> -
>
> Key: ARROW-6119
> URL: https://issues.apache.org/jira/browse/ARROW-6119
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: Windows, Python 3.7
>Reporter: Paul Suganthan
>Priority: Major
>  Labels: wheel
> Fix For: 0.15.0
>
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in 
> 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified procedure could not be found.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6402) [C++] Suppress sign-compare warning with g++ 9.2.1

2019-08-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6402:
--
Labels: pull-request-available  (was: )

> [C++] Suppress sign-compare warning with g++ 9.2.1
> --
>
> Key: ARROW-6402
> URL: https://issues.apache.org/jira/browse/ARROW-6402
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
>
> {noformat}
> ../src/arrow/array/builder_union.cc: In constructor 
> 'arrow::BasicUnionBuilder::BasicUnionBuilder(arrow::MemoryPool*, 
> arrow::UnionMode::type, const 
> std::vector >&, const 
> std::shared_ptr&)':
> ../src/arrow/util/logging.h:86:55: error: comparison of integer 
> expressions of different signedness: 
> 'std::vector >::size_type' {aka 'long 
> unsigned int'} and 'signed char' [-Werror=sign-compare]
>86 | #define ARROW_CHECK_LT(val1, val2) ARROW_CHECK((val1) < (val2))
>   |~~~^~~~
> ../src/arrow/util/macros.h:43:52: note: in definition of macro 
> 'ARROW_PREDICT_TRUE'
>43 | #define ARROW_PREDICT_TRUE(x) (__builtin_expect(!!(x), 1))
>   |^
> ../src/arrow/util/logging.h:86:36: note: in expansion of macro 
> 'ARROW_CHECK'
>86 | #define ARROW_CHECK_LT(val1, val2) ARROW_CHECK((val1) < (val2))
>   |^~~
> ../src/arrow/util/logging.h:135:19: note: in expansion of macro 
> 'ARROW_CHECK_LT'
>   135 | #define DCHECK_LT ARROW_CHECK_LT
>   |   ^~
> ../src/arrow/array/builder_union.cc:79:3: note: in expansion of macro 
> 'DCHECK_LT'
>79 |   DCHECK_LT(type_id_to_children_.size(), 
> std::numeric_limits::max());
>   |   ^
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Comment Edited] (ARROW-6119) [Python] PyArrow wheel import fails on Windows Python 3.7

2019-08-30 Thread Kazuaki Ishizaki (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919906#comment-16919906
 ] 

Kazuaki Ishizaki edited comment on ARROW-6119 at 8/31/19 4:08 AM:
--

I can reproduce this error using 0.14.0 and 0.14.1 thru pip with embeddable 
Python on Windows 10 that have been just installed (i.e. install no 
application).

I will try it with conda tomorrow.


was (Author: kiszk):
I can reproduce this error using 0.14.0 and 0.14.1 with embeddable Python on 
Windows 10 that have been just installed (i.e. install no application).

I will try it with conda tomorrow.

> [Python] PyArrow wheel import fails on Windows Python 3.7
> -
>
> Key: ARROW-6119
> URL: https://issues.apache.org/jira/browse/ARROW-6119
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: Windows, Python 3.7
>Reporter: Paul Suganthan
>Priority: Major
>  Labels: wheel
> Fix For: 0.15.0
>
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in 
> 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified procedure could not be found.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6402) [C++] Suppress sign-compare warning with g++ 9.2.1

2019-08-30 Thread Sutou Kouhei (Jira)

Sutou Kouhei created ARROW-6402:
---

 Summary: [C++] Suppress sign-compare warning with g++ 9.2.1
 Key: ARROW-6402
 URL: https://issues.apache.org/jira/browse/ARROW-6402
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei


{noformat}
../src/arrow/array/builder_union.cc: In constructor 
'arrow::BasicUnionBuilder::BasicUnionBuilder(arrow::MemoryPool*, 
arrow::UnionMode::type, const std::vector 
>&, const std::shared_ptr&)':
../src/arrow/util/logging.h:86:55: error: comparison of integer expressions 
of different signedness: 'std::vector 
>::size_type' {aka 'long unsigned int'} and 'signed char' [-Werror=sign-compare]
   86 | #define ARROW_CHECK_LT(val1, val2) ARROW_CHECK((val1) < (val2))
  |~~~^~~~
../src/arrow/util/macros.h:43:52: note: in definition of macro 
'ARROW_PREDICT_TRUE'
   43 | #define ARROW_PREDICT_TRUE(x) (__builtin_expect(!!(x), 1))
  |^
../src/arrow/util/logging.h:86:36: note: in expansion of macro 'ARROW_CHECK'
   86 | #define ARROW_CHECK_LT(val1, val2) ARROW_CHECK((val1) < (val2))
  |^~~
../src/arrow/util/logging.h:135:19: note: in expansion of macro 
'ARROW_CHECK_LT'
  135 | #define DCHECK_LT ARROW_CHECK_LT
  |   ^~
../src/arrow/array/builder_union.cc:79:3: note: in expansion of macro 
'DCHECK_LT'
   79 |   DCHECK_LT(type_id_to_children_.size(), 
std::numeric_limits::max());
  |   ^
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6265) [Java] Avro adapter implement Array/Map/Fixed type

2019-08-30 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6265.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5099
[https://github.com/apache/arrow/pull/5099]

> [Java] Avro adapter implement Array/Map/Fixed type
> --
>
> Key: ARROW-6265
> URL: https://issues.apache.org/jira/browse/ARROW-6265
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 12h 20m
>  Remaining Estimate: 0h
>
> Support Array/Map/Fixed type in avro adapter.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-2769) [C++][Python] Deprecate and rename add_metadata methods

2019-08-30 Thread Sutou Kouhei (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei resolved ARROW-2769.
-
Resolution: Fixed

Issue resolved by pull request 5232
[https://github.com/apache/arrow/pull/5232]

> [C++][Python] Deprecate and rename add_metadata methods
> ---
>
> Key: ARROW-2769
> URL: https://issues.apache.org/jira/browse/ARROW-2769
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Deprecate and replace `pyarrow.Field.add_metadata` (and other likely named 
> methods) with replace_metadata, set_metadata or with_metadata. Knowing 
> Spark's immutable API, I would have chosen with_metadata but I guess this is 
> probably not what the average Python user would expect as naming.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-2769) [C++][Python] Deprecate and rename add_metadata methods

2019-08-30 Thread Sutou Kouhei (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei updated ARROW-2769:

Component/s: C++

> [C++][Python] Deprecate and rename add_metadata methods
> ---
>
> Key: ARROW-2769
> URL: https://issues.apache.org/jira/browse/ARROW-2769
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Deprecate and replace `pyarrow.Field.add_metadata` (and other likely named 
> methods) with replace_metadata, set_metadata or with_metadata. Knowing 
> Spark's immutable API, I would have chosen with_metadata but I guess this is 
> probably not what the average Python user would expect as naming.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-2769) [C++][Python] Deprecate and rename add_metadata methods

2019-08-30 Thread Sutou Kouhei (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei updated ARROW-2769:

Summary: [C++][Python] Deprecate and rename add_metadata methods  (was: 
[Python] Deprecate and rename add_metadata methods)

> [C++][Python] Deprecate and rename add_metadata methods
> ---
>
> Key: ARROW-2769
> URL: https://issues.apache.org/jira/browse/ARROW-2769
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Deprecate and replace `pyarrow.Field.add_metadata` (and other likely named 
> methods) with replace_metadata, set_metadata or with_metadata. Knowing 
> Spark's immutable API, I would have chosen with_metadata but I guess this is 
> probably not what the average Python user would expect as naming.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6094) [Format][Flight] Add GetFlightSchema to Flight RPC

2019-08-30 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6094.

Resolution: Fixed

Issue resolved by pull request 4980
[https://github.com/apache/arrow/pull/4980]

> [Format][Flight] Add GetFlightSchema to Flight RPC
> --
>
> Key: ARROW-6094
> URL: https://issues.apache.org/jira/browse/ARROW-6094
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, FlightRPC, Java, Python
>Reporter: Ryan Murray
>Assignee: Ryan Murray
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Implement GetFlightSchema as per 
> https://docs.google.com/document/d/1zLdFYikk3owbKpHvJrARLMlmYpi-Ef6OJy7H90MqViA/edit?usp=sharing
> and 
> https://lists.apache.org/thread.html/3539984493cf3d4d439bef25c150fa9e09e0b43ce0afb6be378d41df@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-4668) [C++] Support GCP BigQuery Storage API

2019-08-30 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-4668:
--

Assignee: (was: Micah Kornfield)

> [C++] Support GCP BigQuery Storage API
> --
>
> Key: ARROW-4668
> URL: https://issues.apache.org/jira/browse/ARROW-4668
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>  Labels: filesystem
> Fix For: 1.0.0
>
>
> Docs: [https://cloud.google.com/bigquery/docs/reference/storage/] 
> Need to investigate the best way to do this maybe just see if we can build 
> our client on GCP (once a protobuf definition is published to 
> [https://github.com/googleapis/googleapis/tree/master/google)?|https://github.com/googleapis/googleapis/tree/master/google)]
>  
> This will serve as a parent issue, and sub-issues will be added for subtasks 
> if necessary.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-4668) [C++] Support GCP BigQuery Storage API

2019-08-30 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-4668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1691#comment-1691
 ] 

Micah Kornfield commented on ARROW-4668:


Wes is correct.  I'll also add that either this (or even a higher level wrapper 
around BQ) or flight would make a good test case for DataSet APIs to make sure 
they are generic enough.  I won't be getting to this anytime soon, so I'm going 
to unassign it from myself.  I have some sample code on my work computer that I 
will also try to share to show how the API can be accessed in a simple scenario.

> [C++] Support GCP BigQuery Storage API
> --
>
> Key: ARROW-4668
> URL: https://issues.apache.org/jira/browse/ARROW-4668
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: filesystem
> Fix For: 1.0.0
>
>
> Docs: [https://cloud.google.com/bigquery/docs/reference/storage/] 
> Need to investigate the best way to do this maybe just see if we can build 
> our client on GCP (once a protobuf definition is published to 
> [https://github.com/googleapis/googleapis/tree/master/google)?|https://github.com/googleapis/googleapis/tree/master/google)]
>  
> This will serve as a parent issue, and sub-issues will be added for subtasks 
> if necessary.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6401) [Java] Implement dictionary-encoded subfields for Struct type

2019-08-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6401:
--
Labels: pull-request-available  (was: )

> [Java] Implement dictionary-encoded subfields for Struct type
> -
>
> Key: ARROW-6401
> URL: https://issues.apache.org/jira/browse/ARROW-6401
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
>
> Implement dictionary-encoded subfields for Struct type.
> Each child vector will have a dictionary, the dictionary vector is struct 
> type and holds all dictionaries.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6401) [Java] Implement dictionary-encoded subfields for Struct type

2019-08-30 Thread Ji Liu (Jira)

Ji Liu created ARROW-6401:
-

 Summary: [Java] Implement dictionary-encoded subfields for Struct 
type
 Key: ARROW-6401
 URL: https://issues.apache.org/jira/browse/ARROW-6401
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Implement dictionary-encoded subfields for Struct type.

Each child vector will have a dictionary, the dictionary vector is struct type 
and holds all dictionaries.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6078) [Java] Implement dictionary-encoded subfields for List type

2019-08-30 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6078.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 4972
[https://github.com/apache/arrow/pull/4972]

> [Java] Implement dictionary-encoded subfields for List type
> ---
>
> Key: ARROW-6078
> URL: https://issues.apache.org/jira/browse/ARROW-6078
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 10.5h
>  Remaining Estimate: 0h
>
> For example, int type List (valueCount = 5) has data like below:
> 10, 20
> 10, 20
> 30, 40, 50
> 30, 40, 50
> 10, 20
> could be encoded to:
> 0, 1
> 0, 1
> 2, 3, 4
> 2, 3, 4
> 0, 1
> with list type dictionary
> 10, 20, 30, 40, 50
> or
> 10,
> 20,
> 30,
> 40,
> 50
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5922) [Python] Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API

2019-08-30 Thread Saurabh Bajaj (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919936#comment-16919936
 ] 

Saurabh Bajaj commented on ARROW-5922:
--

Try setting the environment variable ARROW_LIBHDFS_DIR to the explicit location 
of libhdfs.so at the worker nodes. That's what worked for me. 

> [Python] Unable to connect to HDFS from a worker/data node on a Kerberized 
> cluster using pyarrow' hdfs API
> --
>
> Key: ARROW-5922
> URL: https://issues.apache.org/jira/browse/ARROW-5922
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: Unix
>Reporter: Saurabh Bajaj
>Priority: Major
> Fix For: 0.14.0
>
>
> Here's what I'm trying:
> {{```}}
> {{import pyarrow as pa }}
> {{conf = \{"hadoop.security.authentication": "kerberos"} }}
> {{fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf)}}
> {{```}}
> However, when I submit this job to the cluster using {{Dask-YARN}}, I get the 
> following error:
> ```
> {{File "test/run.py", line 3 fs = 
> pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf) File 
> "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py",
>  line 211, in connect File 
> "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py",
>  line 38, in __init__ File "pyarrow/io-hdfs.pxi", line 105, in 
> pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 83, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS connection failed}}
> {{```}}
> I also tried setting {{host (to a name node)}} and {{port (=8020)}}, however 
> I run into the same error. Since the error is not descriptive, I'm not sure 
> which setting needs to be altered. Any clues anyone?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6119) [Python] PyArrow wheel import fails on Windows Python 3.7

2019-08-30 Thread Kazuaki Ishizaki (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919906#comment-16919906
 ] 

Kazuaki Ishizaki commented on ARROW-6119:
-

I can reproduce this error using 0.14.0 and 0.14.1 with embeddable Python on 
Windows 10 that have been just installed (i.e. install no application).

I will try it with conda tomorrow.

> [Python] PyArrow wheel import fails on Windows Python 3.7
> -
>
> Key: ARROW-6119
> URL: https://issues.apache.org/jira/browse/ARROW-6119
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: Windows, Python 3.7
>Reporter: Paul Suganthan
>Priority: Major
>  Labels: wheel
> Fix For: 0.15.0
>
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in 
> 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified procedure could not be found.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6400) Arrow Java Library Build Error

2019-08-30 Thread Tanveer (Jira)

Tanveer created ARROW-6400:
--

 Summary: Arrow Java Library Build Error
 Key: ARROW-6400
 URL: https://issues.apache.org/jira/browse/ARROW-6400
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 0.14.1
Reporter: Tanveer
 Attachments: Screenshot from 2019-08-30 23-16-25.png, Screenshot from 
2019-08-30 23-44-34.png

Arrow Java Library is not being built with both 'master' and 'maint-0.14.x ' 
branches.

Please see the attachments.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-6310) [C++] Write 64-bit integers as strings in JSON integration test files

2019-08-30 Thread Benjamin Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Kietzman reassigned ARROW-6310:


Assignee: Benjamin Kietzman

> [C++] Write 64-bit integers as strings in JSON integration test files
> -
>
> Key: ARROW-6310
> URL: https://issues.apache.org/jira/browse/ARROW-6310
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Benjamin Kietzman
>Priority: Major
> Fix For: 0.15.0
>
>
> C++ side of ARROW-1875



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6392) [Python][Flight] list_actions Server RPC is not tested in test_flight.py, nor is return value validated

2019-08-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6392:
--
Labels: pull-request-available  (was: )

> [Python][Flight] list_actions Server RPC is not tested in test_flight.py, nor 
> is return value validated
> ---
>
> Key: ARROW-6392
> URL: https://issues.apache.org/jira/browse/ARROW-6392
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> This server method is implemented and part of the Python server vtable, but 
> it is not tested. If you mistakenly return a "string" action type, it will 
> pass silently. We might want to constrain the output to be ActionType or a 
> tuple



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6015) [Python] pyarrow wheel: `DLL load failed` when importing on windows

2019-08-30 Thread Kazuaki Ishizaki (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919876#comment-16919876
 ] 

Kazuaki Ishizaki commented on ARROW-6015:
-

I see. Thank you for your quick response. It looks more complex.

Have we already identified which libraries are missed when this failure occurs? 
 Or, haven't we identified yet?

> [Python] pyarrow wheel:  `DLL load failed` when importing on windows
> 
>
> Key: ARROW-6015
> URL: https://issues.apache.org/jira/browse/ARROW-6015
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, Python
>Affects Versions: 0.14.1
>Reporter: Ruslan Kuprieiev
>Priority: Major
>  Labels: wheel
> Fix For: 0.15.0
>
>
> When installing pyarrow 0.14.1 on windows 10 x64 with python 3.7, you get:
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in 
> 
>     from pyarrow.lib import cpu_count, set_cpu_count
>   ImportError: DLL load failed: The specified module could not be found.
>  On 0.14.0 everything works fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6015) [Python] pyarrow wheel: `DLL load failed` when importing on windows

2019-08-30 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919847#comment-16919847
 ] 

Antoine Pitrou commented on ARROW-6015:
---

This can of issue depends on which DLLs are already installed on your system. 
So if the wheel is missing e.g. some compression libraries (such as zstd or 
brotli) but you have them on your system already, the wheel will work fine for 
you. This is also what makes it more difficult to ensure that Windows wheels 
are correctly generated...

> [Python] pyarrow wheel:  `DLL load failed` when importing on windows
> 
>
> Key: ARROW-6015
> URL: https://issues.apache.org/jira/browse/ARROW-6015
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, Python
>Affects Versions: 0.14.1
>Reporter: Ruslan Kuprieiev
>Priority: Major
>  Labels: wheel
> Fix For: 0.15.0
>
>
> When installing pyarrow 0.14.1 on windows 10 x64 with python 3.7, you get:
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in 
> 
>     from pyarrow.lib import cpu_count, set_cpu_count
>   ImportError: DLL load failed: The specified module could not be found.
>  On 0.14.0 everything works fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6015) [Python] pyarrow wheel: `DLL load failed` when importing on windows

2019-08-30 Thread Kazuaki Ishizaki (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919843#comment-16919843
 ] 

Kazuaki Ishizaki commented on ARROW-6015:
-

I cannot reproduce this issue on my Windows 10 environment by using two pythons 
(conda and python) with [this 
whl|https://github.com/ursa-labs/crossbow/releases/download/build-669-appveyor-wheel-win-cp37m/pyarrow-0.14.1-cp37-cp37m-win_amd64.whl]
Do I miss something to reproduce this failure?

{code:java}
$ wget https://www.python.org/ftp/python/3.7.4/python-3.7.4-embed-amd64.zip
$ unzip python-3.7.4-embed-amd64.zip
$ cd python-3.7.4-embed-amd64
$ wget https://bootstrap.pypa.io/get-pip.py
$ python get-pip.py
$ wget pyarrow-0.14.1-cp37-cp37m-win_amd64.whl
$ python -m pip install pyarrow-0.14.1-cp37-cp37m-win_amd64.whl
...
Successfully installed numpy-1.17.1 pyarrow-0.14.1 six-1.12.0
$ python
Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit 
(AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> print (pyarrow.cpu_count())
4
>>>
{code}
{code:java}
$ activate arrow-dev
$ wget pyarrow-0.14.1-cp37-cp37m-win_amd64.whl
$ pip install pyarrow-0.14.1-cp37-cp37m-win_amd64.whl
...
Installing collected packages: pyarrow
Successfully installed pyarrow-0.14.1
>python
Python 3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 22:01:29) [MSC 
v.1900 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> print (pyarrow.cpu_count())
4
>>>
{code}

> [Python] pyarrow wheel:  `DLL load failed` when importing on windows
> 
>
> Key: ARROW-6015
> URL: https://issues.apache.org/jira/browse/ARROW-6015
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, Python
>Affects Versions: 0.14.1
>Reporter: Ruslan Kuprieiev
>Priority: Major
>  Labels: wheel
> Fix For: 0.15.0
>
>
> When installing pyarrow 0.14.1 on windows 10 x64 with python 3.7, you get:
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in 
> 
>     from pyarrow.lib import cpu_count, set_cpu_count
>   ImportError: DLL load failed: The specified module could not be found.
>  On 0.14.0 everything works fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6399) [C++] More extensive attributes usage could improve debugging

2019-08-30 Thread Benjamin Kietzman (Jira)

Benjamin Kietzman created ARROW-6399:


 Summary: [C++] More extensive attributes usage could improve 
debugging
 Key: ARROW-6399
 URL: https://issues.apache.org/jira/browse/ARROW-6399
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Benjamin Kietzman


Wrapping  raw or smart pointer parameters and other declarations with 
{{gsl::not_null}} will assert they are not null. The check is dropped for 
release builds.

Status is tagged with ARROW_MUST_USE_RESULT which emits warnings when a Status 
might be ignored if compiling with clang, but Result<> should probably be 
tagged with this too



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-5300) [C++] 0.13 FAILED to build with option -DARROW_NO_DEFAULT_MEMORY_POOL

2019-08-30 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-5300:
-

Assignee: Francois Saint-Jacques

> [C++] 0.13 FAILED to build with option -DARROW_NO_DEFAULT_MEMORY_POOL
> -
>
> Key: ARROW-5300
> URL: https://issues.apache.org/jira/browse/ARROW-5300
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Weihua Jiang
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 0.15.0
>
>
> I tried to upgrade Apache Arrow to 0.13. But, when building Apache Arrow 0.13 
> with option {{-DARROW_NO_DEFAULT_MEMORY_POOL}}, I got a lot of failures.
> It seems 0.13 assuming default memory pool always available.
>  
> My cmake command is:
> |{{make .. -DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_TESTS=off 
> -DARROW_USE_GLOG=off -DARROW_WITH_LZ4=off -DARROW_WITH_ZSTD=off 
> -DARROW_WITH_SNAPPY=off -DARROW_WITH_BROTLI=off -DARROW_WITH_ZLIB=off 
> -DARROW_JEMALLOC=off -DARROW_CXXFLAGS=-DARROW_NO_DEFAULT_MEMORY_POOL}}|
> I tried to fix the compilation by adding some missing constructors. However, 
> it seems this issue is bigger than I expected. It seems all the builders and 
> appenders have this issue as many classes even don't have a memory pool 
> associated. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5300) [C++] 0.13 FAILED to build with option -DARROW_NO_DEFAULT_MEMORY_POOL

2019-08-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5300:
--
Labels: pull-request-available  (was: )

> [C++] 0.13 FAILED to build with option -DARROW_NO_DEFAULT_MEMORY_POOL
> -
>
> Key: ARROW-5300
> URL: https://issues.apache.org/jira/browse/ARROW-5300
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Weihua Jiang
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> I tried to upgrade Apache Arrow to 0.13. But, when building Apache Arrow 0.13 
> with option {{-DARROW_NO_DEFAULT_MEMORY_POOL}}, I got a lot of failures.
> It seems 0.13 assuming default memory pool always available.
>  
> My cmake command is:
> |{{make .. -DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_TESTS=off 
> -DARROW_USE_GLOG=off -DARROW_WITH_LZ4=off -DARROW_WITH_ZSTD=off 
> -DARROW_WITH_SNAPPY=off -DARROW_WITH_BROTLI=off -DARROW_WITH_ZLIB=off 
> -DARROW_JEMALLOC=off -DARROW_CXXFLAGS=-DARROW_NO_DEFAULT_MEMORY_POOL}}|
> I tried to fix the compilation by adding some missing constructors. However, 
> it seems this issue is bigger than I expected. It seems all the builders and 
> appenders have this issue as many classes even don't have a memory pool 
> associated. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3762) [C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray

2019-08-30 Thread Benjamin Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Kietzman updated ARROW-3762:
-
Description: 
# When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
due to it not creating chunked arrays. Reading each row group individually and 
then concatenating the tables works, however.

 
{code:java}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


x = pa.array(list('1' * 2**30))

demo = 'demo.parquet'


def scenario():
t = pa.Table.from_arrays([x], ['x'])
writer = pq.ParquetWriter(demo, t.schema)
for i in range(2):
writer.write_table(t)
writer.close()

pf = pq.ParquetFile(demo)

# pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
contain more than 2147483646 bytes, have 2147483647
t2 = pf.read()

# Works, but note, there are 32 row groups, not 2 as suggested by:
# 
https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
t3 = pa.concat_tables(tables)

scenario()
{code}

  was:
When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
due to it not creating chunked arrays. Reading each row group individually and 
then concatenating the tables works, however.

 
{code:java}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


x = pa.array(list('1' * 2**30))

demo = 'demo.parquet'


def scenario():
t = pa.Table.from_arrays([x], ['x'])
writer = pq.ParquetWriter(demo, t.schema)
for i in range(2):
writer.write_table(t)
writer.close()

pf = pq.ParquetFile(demo)

# pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
contain more than 2147483646 bytes, have 2147483647
t2 = pf.read()

# Works, but note, there are 32 row groups, not 2 as suggested by:
# 
https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
t3 = pa.concat_tables(tables)

scenario()
{code}


> [C++] Parquet arrow::Table reads error when overflowing capacity of 
> BinaryArray
> ---
>
> Key: ARROW-3762
> URL: https://issues.apache.org/jira/browse/ARROW-3762
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Chris Ellison
>Assignee: Benjamin Kietzman
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0, 0.15.0
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> # When reading a parquet file with binary data > 2 GiB, we get an 
> ArrowIOError due to it not creating chunked arrays. Reading each row group 
> individually and then concatenating the tables works, however.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> x = pa.array(list('1' * 2**30))
> demo = 'demo.parquet'
> def scenario():
> t = pa.Table.from_arrays([x], ['x'])
> writer = pq.ParquetWriter(demo, t.schema)
> for i in range(2):
> writer.write_table(t)
> writer.close()
> pf = pq.ParquetFile(demo)
> # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
> contain more than 2147483646 bytes, have 2147483647
> t2 = pf.read()
> # Works, but note, there are 32 row groups, not 2 as suggested by:
> # 
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
> tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
> t3 = pa.concat_tables(tables)
> scenario()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6396) [C++] Add CompareOptions to Compare kernels

2019-08-30 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919757#comment-16919757
 ] 

Wes McKinney commented on ARROW-6396:
-

FWIW I wasn't familiar with the "Kleene" terminology

> [C++] Add CompareOptions to Compare kernels
> ---
>
> Key: ARROW-6396
> URL: https://issues.apache.org/jira/browse/ARROW-6396
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> This would add an enum ResolveNull \{ KLEENE_LOGIC, NULL_PROPAGATE } to 
> define the behavior of merging with AND/OR operators on boolean.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-3571) [Wiki] Release management guide does not explain how to set up Crossbow or where to find instructions

2019-08-30 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919756#comment-16919756
 ] 

Wes McKinney commented on ARROW-3571:
-

I'm looking at the release management guide and it says

"Setup crossbow as described in its README"

So I think we can merge the Sphinx port of the README and then update the wiki 
page

> [Wiki] Release management guide does not explain how to set up Crossbow or 
> where to find instructions
> -
>
> Key: ARROW-3571
> URL: https://issues.apache.org/jira/browse/ARROW-3571
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Wiki
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>
> If you follow the guide, at one point it says "Launch a Crossbow build" but 
> provides no link to the setup instructions for this



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6312) [C++] Declare required Libs.private in arrow.pc package config

2019-08-30 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919753#comment-16919753
 ] 

Wes McKinney commented on ARROW-6312:
-

Would you like to update your PR to make a change for 0.15.0?

> [C++] Declare required Libs.private in arrow.pc package config
> --
>
> Key: ARROW-6312
> URL: https://issues.apache.org/jira/browse/ARROW-6312
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Michael Maguire
>Assignee: Michael Maguire
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The current arrow.pc package config file produced is deficient and doesn't 
> properly declare static libraries pre-requisities that must be linked in in 
> order to *statically* link in libarrow.a
> Currently it just has:
> ```
>  Libs: -L${libdir} -larrow
> ```
> But in cases, e.g. where you enabled snappy, brotli or zlib support in arrow, 
> our toolchains need to see an arrow.pc file something more like:
> ```
>  Libs: -L${libdir} -larrow
>  Libs.private: -lsnappy -lboost_system -lz -llz4 -lbrotlidec -lbrotlienc 
> -lbrotlicommon -lzstd
> ```
> If not, we get linkage errors.  I'm told the convention is that if the .a has 
> an UNDEF, the Requires.private plus the Libs.private should resolve all the 
> undefs. See the Libs.private info in [https://linux.die.net/man/1/pkg-config]
>  
> Note, however, as Sutou Kouhei pointed out in 
> [https://github.com/apache/arrow/pull/5123#issuecomment-522771452,] the 
> additional Libs.private need to be dynamically generated based on whether 
> functionality like snappy, brotli or zlib is enabled..



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6390) [Python][Flight] Add Python documentation / tutorial for Flight

2019-08-30 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919745#comment-16919745
 ] 

Wes McKinney commented on ARROW-6390:
-

I'll try to put together a documentation skeleton for using Flight from Python

> [Python][Flight] Add Python documentation / tutorial for Flight
> ---
>
> Key: ARROW-6390
> URL: https://issues.apache.org/jira/browse/ARROW-6390
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> There is no Sphinx documentation for using Flight from Python. I have found 
> that writing documentation is an effective way to uncover usability problems 
> -- I would suggest we write comprehensive documentation for using Flight from 
> Python as a way to refine the public Python API



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-6390) [Python][Flight] Add Python documentation / tutorial for Flight

2019-08-30 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6390:
---

Assignee: Wes McKinney

> [Python][Flight] Add Python documentation / tutorial for Flight
> ---
>
> Key: ARROW-6390
> URL: https://issues.apache.org/jira/browse/ARROW-6390
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> There is no Sphinx documentation for using Flight from Python. I have found 
> that writing documentation is an effective way to uncover usability problems 
> -- I would suggest we write comprehensive documentation for using Flight from 
> Python as a way to refine the public Python API



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5995) [Python] pyarrow: hdfs: support file checksum

2019-08-30 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919742#comment-16919742
 ] 

Wes McKinney commented on ARROW-5995:
-

Can you invoke {{hdfs dfs -checksum}} using a system call to obtain the value? 
It would only work if the {{hdfs}} CLI tool is configured correctly to access 
your cluster

> [Python] pyarrow: hdfs: support file checksum
> -
>
> Key: ARROW-5995
> URL: https://issues.apache.org/jira/browse/ARROW-5995
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Ruslan Kuprieiev
>Priority: Minor
>
> I was not able to find how to retrieve checksum (`getFileChecksum` or `hadoop 
> fs/dfs -checksum`) for a file on hdfs. Judging by how it is implemented in 
> hadoop CLI [1], looks like we will also need to implement it manually in 
> pyarrow. Please correct me if I'm missing something. Is this feature 
> desirable? Or was there a good reason why it wasn't implemented already?
>  [1] 
> [https://github.com/hanborq/hadoop/blob/hadoop-hdh3u2.1/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java#L719]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6398) [C++] consolidate ScanOptions and ScanContext

2019-08-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6398:
--
Labels: dataset pull-request-available  (was: dataset)

> [C++] consolidate ScanOptions and ScanContext
> -
>
> Key: ARROW-6398
> URL: https://issues.apache.org/jira/browse/ARROW-6398
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Benjamin Kietzman
>Assignee: Benjamin Kietzman
>Priority: Minor
>  Labels: dataset, pull-request-available
>
> Currently ScanOptions has two distinct responsibilities: it contains the data 
> selector (and eventually projection schema) for the current scan and it 
> serves as the base class for format specific scan options.
> In addition, we have ScanContext which holds the memory pool for the current 
> scan.
> I think these classes should be rearranged as follows: ScanOptions will be 
> removed and FileScanOptions will be the abstract base class for format 
> specific scan options. ScanContext will be a concrete struct and contain the 
> data selector, projection schema, a vector of FileScanOptions, and any other 
> shared scan state.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5922) [Python] Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API

2019-08-30 Thread Ben Schreck (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919722#comment-16919722
 ] 

Ben Schreck commented on ARROW-5922:


I am getting the same error as you. I noticed that the error for me is in java- 
under the hood pyarrow tries to load the HDFS java class and can't find it. I 
can't figure out how to fix it though...

> [Python] Unable to connect to HDFS from a worker/data node on a Kerberized 
> cluster using pyarrow' hdfs API
> --
>
> Key: ARROW-5922
> URL: https://issues.apache.org/jira/browse/ARROW-5922
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: Unix
>Reporter: Saurabh Bajaj
>Priority: Major
> Fix For: 0.14.0
>
>
> Here's what I'm trying:
> {{```}}
> {{import pyarrow as pa }}
> {{conf = \{"hadoop.security.authentication": "kerberos"} }}
> {{fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf)}}
> {{```}}
> However, when I submit this job to the cluster using {{Dask-YARN}}, I get the 
> following error:
> ```
> {{File "test/run.py", line 3 fs = 
> pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf) File 
> "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py",
>  line 211, in connect File 
> "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py",
>  line 38, in __init__ File "pyarrow/io-hdfs.pxi", line 105, in 
> pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 83, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS connection failed}}
> {{```}}
> I also tried setting {{host (to a name node)}} and {{port (=8020)}}, however 
> I run into the same error. Since the error is not descriptive, I'm not sure 
> which setting needs to be altered. Any clues anyone?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6231) [Python] Consider assigning default column names when reading CSV file and header_rows=0

2019-08-30 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-6231.
---
Resolution: Fixed

Issue resolved by pull request 5206
[https://github.com/apache/arrow/pull/5206]

> [Python] Consider assigning default column names when reading CSV file and 
> header_rows=0
> 
>
> Key: ARROW-6231
> URL: https://issues.apache.org/jira/browse/ARROW-6231
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: csv, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> This is a slight usability rough edge. Assigning default names (like "f0, f1, 
> ...") would probably be better since then at least you can see how many 
> columns there are and what is in them. 
> {code}
> In [10]: parse_options = csv.ParseOptions(delimiter='|', header_rows=0)   
>   
> 
> In [11]: %time table = csv.read_csv('Performance_2016Q4.txt', 
> parse_options=parse_options)  
> 
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> ~/miniconda/envs/pyarrow-14-1/lib/python3.7/site-packages/pyarrow/_csv.pyx in 
> pyarrow._csv.read_csv()
> ~/miniconda/envs/pyarrow-14-1/lib/python3.7/site-packages/pyarrow/error.pxi 
> in pyarrow.lib.check_status()
> ArrowInvalid: header_rows == 0 needs explicit column names
> {code}
> In pandas integers are used, so some kind of default string would have to be 
> defined
> {code}
> In [18]: df = pd.read_csv('Performance_2016Q4.txt', sep='|', header=None, 
> low_memory=False) 
> 
> In [19]: df.columns   
>   
> 
> Out[19]: 
> Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 
> 16,
> 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
>dtype='int64')
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-6380) Method pyarrow.parquet.read_table has memory spikes from version 0.14

2019-08-30 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6380.
---

> Method pyarrow.parquet.read_table has memory spikes from version 0.14
> -
>
> Key: ARROW-6380
> URL: https://issues.apache.org/jira/browse/ARROW-6380
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0, 0.14.1
> Environment: ubuntu 18, 16GB ram, 4 cpus
>Reporter: Renan Alves Fonseca
>Priority: Major
>
> Method pyarrow.parquet.read_table is very slow and cause RAM spikes from 
> version 0.14.0
> Reading a 40MB parquet file takes less than 1 second in versions 0.11, 0.12 
> and 0.13. wheras it takes from 6 to 30 seconds in versions 0.14.x
> This impact in performance is easily measured. However, there is another 
> problem that I could only detect on htop screen. While opening a 40MB 
> parquet, the process occupies almost 16GB for some miliseconds. The pyarrow 
> table will result in around 300MB in the python process (registered using 
> memory-profiler). This does not happens in versions 0.13 and previous ones.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6398) [C++] consolidate ScanOptions and ScanContext

2019-08-30 Thread Benjamin Kietzman (Jira)

Benjamin Kietzman created ARROW-6398:


 Summary: [C++] consolidate ScanOptions and ScanContext
 Key: ARROW-6398
 URL: https://issues.apache.org/jira/browse/ARROW-6398
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


Currently ScanOptions has two distinct responsibilities: it contains the data 
selector (and eventually projection schema) for the current scan and it serves 
as the base class for format specific scan options.

In addition, we have ScanContext which holds the memory pool for the current 
scan.

I think these classes should be rearranged as follows: ScanOptions will be 
removed and FileScanOptions will be the abstract base class for format specific 
scan options. ScanContext will be a concrete struct and contain the data 
selector, projection schema, a vector of FileScanOptions, and any other shared 
scan state.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6397) [C++][CI] Fix S3 minio failure

2019-08-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6397:
--
Labels: pull-request-available  (was: )

> [C++][CI] Fix S3 minio failure
> --
>
> Key: ARROW-6397
> URL: https://issues.apache.org/jira/browse/ARROW-6397
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Continuous Integration
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
>
> See 
> [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27065941/job/gwjmr2hudm7693ef]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6397) [C++][CI] Fix S3 minio failure

2019-08-30 Thread Francois Saint-Jacques (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919623#comment-16919623
 ] 

Francois Saint-Jacques commented on ARROW-6397:
---

I think the non-empty directory is not affecting the test. The bind error is 
the real issue.

> [C++][CI] Fix S3 minio failure
> --
>
> Key: ARROW-6397
> URL: https://issues.apache.org/jira/browse/ARROW-6397
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Continuous Integration
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>
> See 
> [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27065941/job/gwjmr2hudm7693ef]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5618) [C++] [Parquet] Using deprecated Int96 storage for timestamps triggers integer overflow in some cases

2019-08-30 Thread TP Boudreau (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919611#comment-16919611
 ] 

TP Boudreau commented on ARROW-5618:


(Sorry if this is a duplicate comment -- first attempt doesn't seem to have 
posted.)

This is issue fell through the cracks on my end -- I'll look into it this 
weekend.

> [C++] [Parquet] Using deprecated Int96 storage for timestamps triggers 
> integer overflow in some cases
> -
>
> Key: ARROW-5618
> URL: https://issues.apache.org/jira/browse/ARROW-5618
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: TP Boudreau
>Assignee: TP Boudreau
>Priority: Minor
>  Labels: parquet, pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When storing Arrow timestamps in Parquet files using the Int96 storage 
> format, certain combinations of array lengths and validity bitmasks cause an 
> integer overflow error on read.  It's not immediately clear whether the 
> Arrow/Parquet writer is storing zeroes when it should be storing positive 
> values or the reader is attempting to calculate a nanoseconds value 
> inappropriately from zeroed inputs (perhaps missing the null bit flag).  Also 
> not immediately clear why only certain length columns seem to be affected.
> Probably the quickest way to reproduce this undefined behavior is to alter 
> the existing unit test UseDeprecatedInt96 (in file 
> .../arrow/cpp/src/parquet/arrow/arrow-reader-writer-test.cc) by quadrupling 
> its column lengths (repeating the same values), followed by 'make unittest' 
> using clang-7 with sanitizers enabled.  (Here's a patch applicable to current 
> master that changes the test as described: [1]; I used the following cmake 
> command to build my environment: [2].)  You should get a log something like 
> [3].  If requested, I'll see if I can put together a stand-alone minimal test 
> case that induces the behavior.
> The quick-hack at [4] will prevent integer overflows, but this is only 
> included to confirm the proximate cause of the bug: the Julian days field of 
> the Int96 appears to be zero, when a strictly positive number is expected.
> I've assigned the issue to myself and I'll start looking into the root cause 
> of this.
> [1] https://gist.github.com/tpboudreau/b6610c13cbfede4d6b171da681d1f94e
> [2] https://gist.github.com/tpboudreau/59178ca8cb50a935aab7477805aa32b9
> [3] https://gist.github.com/tpboudreau/0c2d0a18960c1aa04c838fa5c2ac7d2d
> [4] https://gist.github.com/tpboudreau/0993beb5c8c1488028e76fb2ca179b7f



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-6397) [C++][CI] Fix S3 minio failure

2019-08-30 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-6397:
-

Assignee: Francois Saint-Jacques

> [C++][CI] Fix S3 minio failure
> --
>
> Key: ARROW-6397
> URL: https://issues.apache.org/jira/browse/ARROW-6397
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Continuous Integration
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>
> See 
> [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27065941/job/gwjmr2hudm7693ef]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6368) [C++] Add RecordBatch projection functionality

2019-08-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6368:
--
Labels: dataset pull-request-available  (was: dataset)

> [C++] Add RecordBatch projection functionality
> --
>
> Key: ARROW-6368
> URL: https://issues.apache.org/jira/browse/ARROW-6368
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Benjamin Kietzman
>Assignee: Benjamin Kietzman
>Priority: Minor
>  Labels: dataset, pull-request-available
>
> define classes RecordBatchProjector (which projects from one schema to 
> another, augmenting with null/constant columns where necessary) and a subtype 
> of RecordBatchIterator which projects each batch yielded by a wrapped 
> iterator.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6380) Method pyarrow.parquet.read_table has memory spikes from version 0.14

2019-08-30 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6380:
---
Fix Version/s: (was: 0.13.0)

> Method pyarrow.parquet.read_table has memory spikes from version 0.14
> -
>
> Key: ARROW-6380
> URL: https://issues.apache.org/jira/browse/ARROW-6380
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0, 0.14.1
> Environment: ubuntu 18, 16GB ram, 4 cpus
>Reporter: Renan Alves Fonseca
>Priority: Major
>
> Method pyarrow.parquet.read_table is very slow and cause RAM spikes from 
> version 0.14.0
> Reading a 40MB parquet file takes less than 1 second in versions 0.11, 0.12 
> and 0.13. wheras it takes from 6 to 30 seconds in versions 0.14.x
> This impact in performance is easily measured. However, there is another 
> problem that I could only detect on htop screen. While opening a 40MB 
> parquet, the process occupies almost 16GB for some miliseconds. The pyarrow 
> table will result in around 300MB in the python process (registered using 
> memory-profiler). This does not happens in versions 0.13 and previous ones.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6380) Method pyarrow.parquet.read_table has memory spikes from version 0.14

2019-08-30 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6380.

Resolution: Duplicate

Thanks. This was fixed in ARROW-6060. Please reopen if you find this is still a 
problem in the code on master in the apache/arrow repository.

> Method pyarrow.parquet.read_table has memory spikes from version 0.14
> -
>
> Key: ARROW-6380
> URL: https://issues.apache.org/jira/browse/ARROW-6380
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0, 0.14.1
> Environment: ubuntu 18, 16GB ram, 4 cpus
>Reporter: Renan Alves Fonseca
>Priority: Major
> Fix For: 0.13.0
>
>
> Method pyarrow.parquet.read_table is very slow and cause RAM spikes from 
> version 0.14.0
> Reading a 40MB parquet file takes less than 1 second in versions 0.11, 0.12 
> and 0.13. wheras it takes from 6 to 30 seconds in versions 0.14.x
> This impact in performance is easily measured. However, there is another 
> problem that I could only detect on htop screen. While opening a 40MB 
> parquet, the process occupies almost 16GB for some miliseconds. The pyarrow 
> table will result in around 300MB in the python process (registered using 
> memory-profiler). This does not happens in versions 0.13 and previous ones.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6387) [Archery] Errors with make

2019-08-30 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-6387.
---
Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5234
[https://github.com/apache/arrow/pull/5234]

> [Archery] Errors with make
> --
>
> Key: ARROW-6387
> URL: https://issues.apache.org/jira/browse/ARROW-6387
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Archery
>Reporter: Omer Ozarslan
>Assignee: Omer Ozarslan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {{archery --debug benchmark run}} gives error on Debian 10, CMake 3.13.4, GNU 
> make 4.2.1:
> {code:java}
> (.venv)  omer@omer  ~/src/ext/arrow/cpp/build   master ●  archery --debug 
> benchmark run 
>
> DEBUG:archery:Running benchmark WORKSPACE 
>   
>
> DEBUG:archery:Executing `['/usr/bin/cmake', '-GMake', 
> '-DCMAKE_EXPORT_COMPILE_COMMANDS=ON', '-DCMAKE_BUILD_TYPE=release', 
> '-DBUILD_WARNING_LEVEL=production', '-DARROW_BUILD_TESTS=ON', 
> '-DARROW_BUILD_BENCHMARKS=ON', '-DARROW_PYTHON=OFF', '-DARROW_PARQUET=OFF', 
> '-DARROW_GANDIVA=OFF', '-DARROW_PLASMA=OFF', '-DARROW_FLIGHT=OFF', 
> '/home/omer/src/ext/arrow/cpp']`
> CMake Error: Could not create named generator Make
>   
>   
> 
> Generators
>
>   Unix Makefiles   = Generates standard UNIX makefiles.   
>   
>
>   Ninja= Generates build.ninja files. 
>   
>
>   Watcom WMake = Generates Watcom WMake makefiles.
>   
>
>   CodeBlocks - Ninja   = Generates CodeBlocks project files.  
>   
>
>   CodeBlocks - Unix Makefiles  = Generates CodeBlocks project files.  
>   
>
>   CodeLite - Ninja = Generates CodeLite project files.
>
>   CodeLite - Unix Makefiles= Generates CodeLite project files.
>  
>   Sublime Text 2 - Ninja   = Generates Sublime Text 2 project files.  
> 
>   Sublime Text 2 - Unix Makefiles
>= Generates Sublime Text 2 project files.  
> 
>   Kate - Ninja = Generates Kate project files.
>  
>   Kate - Unix Makefiles= Generates Kate project files.
>   Eclipse CDT4 - Ninja = Generates Eclipse CDT 4.0 project files.
>   Eclipse CDT4 - Unix Makefiles= Generates Eclipse CDT 4.0 project files.
> Traceback (most recent call last):
> [[[cropped]]]{code}
> After trivial fix:
> {code:java}
> diff --git a/dev/archery/archery/utils/cmake.py 
> b/dev/archery/archery/utils/cmake.py
> index 38aedab2d..3150ea9a6 100644
> --- a/dev/archery/archery/utils/cmake.py
> +++ b/dev/archery/archery/utils/cmake.py
> @@ -34,7 +34,7 @@ class CMake(Command):
>  in the search path.
>  """
>  found_ninja = which("ninja")
> -return "Ninja" if found_ninja else "Make"
> +return "Ninja" if found_ninja else "Unix Makefiles"{code}
> I get another error:
> {code:java}
> [[[cropped]]
> -- Generating done
> -- Build files have been written to: /tmp/arrow-bench-48x_yleb/WORKSPACE/build
> DEBUG:archery:Executing `[None]`
> Traceback (most recent call last):
>   File "/home/omer/src/ext/arrow/.venv/bin/archery", line

[jira] [Assigned] (ARROW-6387) [Archery] Errors with make

2019-08-30 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-6387:
-

Assignee: Omer Ozarslan

> [Archery] Errors with make
> --
>
> Key: ARROW-6387
> URL: https://issues.apache.org/jira/browse/ARROW-6387
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Omer Ozarslan
>Assignee: Omer Ozarslan
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {{archery --debug benchmark run}} gives error on Debian 10, CMake 3.13.4, GNU 
> make 4.2.1:
> {code:java}
> (.venv)  omer@omer  ~/src/ext/arrow/cpp/build   master ●  archery --debug 
> benchmark run 
>
> DEBUG:archery:Running benchmark WORKSPACE 
>   
>
> DEBUG:archery:Executing `['/usr/bin/cmake', '-GMake', 
> '-DCMAKE_EXPORT_COMPILE_COMMANDS=ON', '-DCMAKE_BUILD_TYPE=release', 
> '-DBUILD_WARNING_LEVEL=production', '-DARROW_BUILD_TESTS=ON', 
> '-DARROW_BUILD_BENCHMARKS=ON', '-DARROW_PYTHON=OFF', '-DARROW_PARQUET=OFF', 
> '-DARROW_GANDIVA=OFF', '-DARROW_PLASMA=OFF', '-DARROW_FLIGHT=OFF', 
> '/home/omer/src/ext/arrow/cpp']`
> CMake Error: Could not create named generator Make
>   
>   
> 
> Generators
>
>   Unix Makefiles   = Generates standard UNIX makefiles.   
>   
>
>   Ninja= Generates build.ninja files. 
>   
>
>   Watcom WMake = Generates Watcom WMake makefiles.
>   
>
>   CodeBlocks - Ninja   = Generates CodeBlocks project files.  
>   
>
>   CodeBlocks - Unix Makefiles  = Generates CodeBlocks project files.  
>   
>
>   CodeLite - Ninja = Generates CodeLite project files.
>
>   CodeLite - Unix Makefiles= Generates CodeLite project files.
>  
>   Sublime Text 2 - Ninja   = Generates Sublime Text 2 project files.  
> 
>   Sublime Text 2 - Unix Makefiles
>= Generates Sublime Text 2 project files.  
> 
>   Kate - Ninja = Generates Kate project files.
>  
>   Kate - Unix Makefiles= Generates Kate project files.
>   Eclipse CDT4 - Ninja = Generates Eclipse CDT 4.0 project files.
>   Eclipse CDT4 - Unix Makefiles= Generates Eclipse CDT 4.0 project files.
> Traceback (most recent call last):
> [[[cropped]]]{code}
> After trivial fix:
> {code:java}
> diff --git a/dev/archery/archery/utils/cmake.py 
> b/dev/archery/archery/utils/cmake.py
> index 38aedab2d..3150ea9a6 100644
> --- a/dev/archery/archery/utils/cmake.py
> +++ b/dev/archery/archery/utils/cmake.py
> @@ -34,7 +34,7 @@ class CMake(Command):
>  in the search path.
>  """
>  found_ninja = which("ninja")
> -return "Ninja" if found_ninja else "Make"
> +return "Ninja" if found_ninja else "Unix Makefiles"{code}
> I get another error:
> {code:java}
> [[[cropped]]
> -- Generating done
> -- Build files have been written to: /tmp/arrow-bench-48x_yleb/WORKSPACE/build
> DEBUG:archery:Executing `[None]`
> Traceback (most recent call last):
>   File "/home/omer/src/ext/arrow/.venv/bin/archery", line 11, in 
> load_entry_point('archery', 'console_scripts', 'archery')()
>   File 
> "/home/omer/src/ext/arrow/.venv/lib/python3.7/site-packages/click/core.py",

[jira] [Updated] (ARROW-6387) [Archery] Errors with make

2019-08-30 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6387:
--
Component/s: Archery

> [Archery] Errors with make
> --
>
> Key: ARROW-6387
> URL: https://issues.apache.org/jira/browse/ARROW-6387
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Archery
>Reporter: Omer Ozarslan
>Assignee: Omer Ozarslan
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {{archery --debug benchmark run}} gives error on Debian 10, CMake 3.13.4, GNU 
> make 4.2.1:
> {code:java}
> (.venv)  omer@omer  ~/src/ext/arrow/cpp/build   master ●  archery --debug 
> benchmark run 
>
> DEBUG:archery:Running benchmark WORKSPACE 
>   
>
> DEBUG:archery:Executing `['/usr/bin/cmake', '-GMake', 
> '-DCMAKE_EXPORT_COMPILE_COMMANDS=ON', '-DCMAKE_BUILD_TYPE=release', 
> '-DBUILD_WARNING_LEVEL=production', '-DARROW_BUILD_TESTS=ON', 
> '-DARROW_BUILD_BENCHMARKS=ON', '-DARROW_PYTHON=OFF', '-DARROW_PARQUET=OFF', 
> '-DARROW_GANDIVA=OFF', '-DARROW_PLASMA=OFF', '-DARROW_FLIGHT=OFF', 
> '/home/omer/src/ext/arrow/cpp']`
> CMake Error: Could not create named generator Make
>   
>   
> 
> Generators
>
>   Unix Makefiles   = Generates standard UNIX makefiles.   
>   
>
>   Ninja= Generates build.ninja files. 
>   
>
>   Watcom WMake = Generates Watcom WMake makefiles.
>   
>
>   CodeBlocks - Ninja   = Generates CodeBlocks project files.  
>   
>
>   CodeBlocks - Unix Makefiles  = Generates CodeBlocks project files.  
>   
>
>   CodeLite - Ninja = Generates CodeLite project files.
>
>   CodeLite - Unix Makefiles= Generates CodeLite project files.
>  
>   Sublime Text 2 - Ninja   = Generates Sublime Text 2 project files.  
> 
>   Sublime Text 2 - Unix Makefiles
>= Generates Sublime Text 2 project files.  
> 
>   Kate - Ninja = Generates Kate project files.
>  
>   Kate - Unix Makefiles= Generates Kate project files.
>   Eclipse CDT4 - Ninja = Generates Eclipse CDT 4.0 project files.
>   Eclipse CDT4 - Unix Makefiles= Generates Eclipse CDT 4.0 project files.
> Traceback (most recent call last):
> [[[cropped]]]{code}
> After trivial fix:
> {code:java}
> diff --git a/dev/archery/archery/utils/cmake.py 
> b/dev/archery/archery/utils/cmake.py
> index 38aedab2d..3150ea9a6 100644
> --- a/dev/archery/archery/utils/cmake.py
> +++ b/dev/archery/archery/utils/cmake.py
> @@ -34,7 +34,7 @@ class CMake(Command):
>  in the search path.
>  """
>  found_ninja = which("ninja")
> -return "Ninja" if found_ninja else "Make"
> +return "Ninja" if found_ninja else "Unix Makefiles"{code}
> I get another error:
> {code:java}
> [[[cropped]]
> -- Generating done
> -- Build files have been written to: /tmp/arrow-bench-48x_yleb/WORKSPACE/build
> DEBUG:archery:Executing `[None]`
> Traceback (most recent call last):
>   File "/home/omer/src/ext/arrow/.venv/bin/archery", line 11, in 
> load_entry_point('archery', 'console_scripts', 'archery')()
>   File 
>

[jira] [Created] (ARROW-6397) [C++][CI] Fix S3 minio failure

2019-08-30 Thread Francois Saint-Jacques (Jira)

Francois Saint-Jacques created ARROW-6397:
-

 Summary: [C++][CI] Fix S3 minio failure
 Key: ARROW-6397
 URL: https://issues.apache.org/jira/browse/ARROW-6397
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Continuous Integration
Reporter: Francois Saint-Jacques


See 
[https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27065941/job/gwjmr2hudm7693ef]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6341) [Python] Implement low-level bindings for Dataset

2019-08-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6341:
--
Labels: dataset pull-request-available  (was: dataset)

> [Python] Implement low-level bindings for Dataset
> -
>
> Key: ARROW-6341
> URL: https://issues.apache.org/jira/browse/ARROW-6341
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Francois Saint-Jacques
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: dataset, pull-request-available
>
> The following classes should be accessible from Python:
> * class DataSource
> * class DataFragment
> * function DiscoverySource
> * class ScanContext, ScanOptions, ScanTask
> * class Dataset
> * class ScannerBuilder
> * class Scanner
> The end result is reading a directory of parquet files as a single stream.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6341) [Python] Implement low-level bindings for Dataset

2019-08-30 Thread Krisztian Szucs (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-6341:
---
Summary: [Python] Implement low-level bindings for Dataset  (was: [Python] 
Implements low-level bindings to Dataset classes:)

> [Python] Implement low-level bindings for Dataset
> -
>
> Key: ARROW-6341
> URL: https://issues.apache.org/jira/browse/ARROW-6341
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Francois Saint-Jacques
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: dataset
>
> The following classes should be accessible from Python:
> * class DataSource
> * class DataFragment
> * function DiscoverySource
> * class ScanContext, ScanOptions, ScanTask
> * class Dataset
> * class ScannerBuilder
> * class Scanner
> The end result is reading a directory of parquet files as a single stream.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6344) [C++][Gandiva] substring does not handle multibyte characters

2019-08-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6344:
--
Labels: pull-request-available  (was: )

> [C++][Gandiva] substring does not handle multibyte characters
> -
>
> Key: ARROW-6344
> URL: https://issues.apache.org/jira/browse/ARROW-6344
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6396) [C++] Add CompareOptions to Compare kernels

2019-08-30 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6396:
--
Description: This would add an enum ResolveNull \{ KLEENE_LOGIC, 
NULL_PROPAGATE } to define the behavior of merging with AND/OR operators on 
boolean.  (was: This would add an enum ResolveNull \{ KLEENE_LOGIC, 
NULL_PROPAGATE }.)

> [C++] Add CompareOptions to Compare kernels
> ---
>
> Key: ARROW-6396
> URL: https://issues.apache.org/jira/browse/ARROW-6396
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> This would add an enum ResolveNull \{ KLEENE_LOGIC, NULL_PROPAGATE } to 
> define the behavior of merging with AND/OR operators on boolean.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6396) [C++] Add CompareOptions to Compare kernels

2019-08-30 Thread Francois Saint-Jacques (Jira)

Francois Saint-Jacques created ARROW-6396:
-

 Summary: [C++] Add CompareOptions to Compare kernels
 Key: ARROW-6396
 URL: https://issues.apache.org/jira/browse/ARROW-6396
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Francois Saint-Jacques


This would add an enum ResolveNull \{ KLEENE_LOGIC, NULL_PROPAGATE }.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-3571) [Wiki] Release management guide does not explain how to set up Crossbow or where to find instructions

2019-08-30 Thread Krisztian Szucs (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919388#comment-16919388
 ] 

Krisztian Szucs commented on ARROW-3571:


Yeah, I've moved the crossbow README to sphinx. Do you mean the whole release 
management guide?

> [Wiki] Release management guide does not explain how to set up Crossbow or 
> where to find instructions
> -
>
> Key: ARROW-3571
> URL: https://issues.apache.org/jira/browse/ARROW-3571
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Wiki
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>
> If you follow the guide, at one point it says "Launch a Crossbow build" but 
> provides no link to the setup instructions for this



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6395) [pyarrow] Bug when using bool arrays with stride greater than 1

2019-08-30 Thread Igor Yastrebov (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919377#comment-16919377
 ] 

Igor Yastrebov commented on ARROW-6395:
---

[~jorisvandenbossche] is this solved by 
[ARROW-6325|https://issues.apache.org/jira/browse/ARROW-6325]?

> [pyarrow] Bug when using bool arrays with stride greater than 1
> ---
>
> Key: ARROW-6395
> URL: https://issues.apache.org/jira/browse/ARROW-6395
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: Philip Felton
>Priority: Major
>
> Here's code to reproduce it:
> {code:python}
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> pa.__version__
> '0.14.0'
> >>> xs = np.array([True, False, False, True, True, False, True, True, True, 
> >>> False, False, False, False, False, True, False, True, True, True, True, 
> >>> True])
> >>> xs_sliced = xs[0::2]
> >>> xs_sliced
> array([ True, False, True, True, True, False, False, True, True,
>  True, True])
> >>> pa_xs = pa.array(xs_sliced, pa.bool_())
> >>> pa_xs
> 
> [
>  true,
>  false,
>  false,
>  false,
>  false,
>  false,
>  false,
>  false,
>  false,
>  false,
>  false
> ]{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6395) [pyarrow] Bug when using bool arrays with stride greater than 1

2019-08-30 Thread Philip Felton (Jira)

Philip Felton created ARROW-6395:


 Summary: [pyarrow] Bug when using bool arrays with stride greater 
than 1
 Key: ARROW-6395
 URL: https://issues.apache.org/jira/browse/ARROW-6395
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.14.0
Reporter: Philip Felton


Here's code to reproduce it:

{code:python}
>>> import numpy as np
>>> import pyarrow as pa
>>> pa.__version__
'0.14.0'
>>> xs = np.array([True, False, False, True, True, False, True, True, True, 
>>> False, False, False, False, False, True, False, True, True, True, True, 
>>> True])
>>> xs_sliced = xs[0::2]
>>> xs_sliced
array([ True, False, True, True, True, False, False, True, True,
 True, True])
>>> pa_xs = pa.array(xs_sliced, pa.bool_())
>>> pa_xs

[
 true,
 false,
 false,
 false,
 false,
 false,
 false,
 false,
 false,
 false,
 false
]{code}




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6380) Method pyarrow.parquet.read_table has memory spikes from version 0.14

2019-08-30 Thread Igor Yastrebov (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919370#comment-16919370
 ] 

Igor Yastrebov commented on ARROW-6380:
---

Is it a duplicate of 
[ARROW-6059|https://issues.apache.org/jira/browse/ARROW-6059]?

> Method pyarrow.parquet.read_table has memory spikes from version 0.14
> -
>
> Key: ARROW-6380
> URL: https://issues.apache.org/jira/browse/ARROW-6380
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0, 0.14.1
> Environment: ubuntu 18, 16GB ram, 4 cpus
>Reporter: Renan Alves Fonseca
>Priority: Major
> Fix For: 0.13.0
>
>
> Method pyarrow.parquet.read_table is very slow and cause RAM spikes from 
> version 0.14.0
> Reading a 40MB parquet file takes less than 1 second in versions 0.11, 0.12 
> and 0.13. wheras it takes from 6 to 30 seconds in versions 0.14.x
> This impact in performance is easily measured. However, there is another 
> problem that I could only detect on htop screen. While opening a 40MB 
> parquet, the process occupies almost 16GB for some miliseconds. The pyarrow 
> table will result in around 300MB in the python process (registered using 
> memory-profiler). This does not happens in versions 0.13 and previous ones.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6144) [C++][Gandiva] Implement random function in Gandiva

2019-08-30 Thread Praveen Kumar Desabandu (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Kumar Desabandu resolved ARROW-6144.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5022
[https://github.com/apache/arrow/pull/5022]

> [C++][Gandiva] Implement random function in Gandiva
> ---
>
> Key: ARROW-6144
> URL: https://issues.apache.org/jira/browse/ARROW-6144
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Implement random(), random(int seed) functions. The values are sampled from a 
> uniform distribution(0, 1) The random values for each row of a column are 
> generated from same generator which is initialised at (function) build time.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6394) [Java] Support conversions between delta vector and partial sum vector

2019-08-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6394:
--
Labels: pull-request-available  (was: )

> [Java] Support conversions between delta vector and partial sum vector
> --
>
> Key: ARROW-6394
> URL: https://issues.apache.org/jira/browse/ARROW-6394
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>
> What is a delta vector/partial sum vector?
> Given an integer vector a with length n, its partial sum vector is another 
> integer vector b with length n + 1, with values defined as:
> b(0) = initial sum
> b(i ) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n
> Given an integer vector with length n + 1, its delta vector is another 
> integer vector b with length n, with values defined as:
> b(i ) = a(i ) - a(i - 1), i = 0, 1, ... , n -1
> In this issue, we provide utilities to convert between vector and partial sum 
> vector. It is interesting to note that the two operations corresponding to 
> the discrete integration and differentian.
> These conversions have wide applications. For example,
> 1. The run-length vector proposed by Micah is based on the partial sum 
> vector, while the deduplication functionality is based on delta vector. This 
> issue provides conversions between them.
> 2. The current VarCharVector/VarBinaryVector implementations are based on 
> partial sum vector. We can transform them to delta vectors before IPC, to 
> reduce network traffic.
> 3. Converting to delta can be considered as a way for data compression. To 
> further reduce the data volume, the operation can be applied more than once, 
> to further reduce data volume.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5995) [Python] pyarrow: hdfs: support file checksum

2019-08-30 Thread Ruslan Kuprieiev (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919311#comment-16919311
 ] 

Ruslan Kuprieiev commented on ARROW-5995:
-

Btw, [~wesmckinn] [~npr] , what are your thoughts on this?

> [Python] pyarrow: hdfs: support file checksum
> -
>
> Key: ARROW-5995
> URL: https://issues.apache.org/jira/browse/ARROW-5995
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Ruslan Kuprieiev
>Priority: Minor
>
> I was not able to find how to retrieve checksum (`getFileChecksum` or `hadoop 
> fs/dfs -checksum`) for a file on hdfs. Judging by how it is implemented in 
> hadoop CLI [1], looks like we will also need to implement it manually in 
> pyarrow. Please correct me if I'm missing something. Is this feature 
> desirable? Or was there a good reason why it wasn't implemented already?
>  [1] 
> [https://github.com/hanborq/hadoop/blob/hadoop-hdh3u2.1/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java#L719]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5995) [Python] pyarrow: hdfs: support file checksum

2019-08-30 Thread Ruslan Kuprieiev (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919309#comment-16919309
 ] 

Ruslan Kuprieiev commented on ARROW-5995:
-

You are right, such a hackish approach would probably not pass the reviews. But 
it might be a good temporary pure-python workaround if parsing those metafiles 
is comparatively simple, so we don't have to mess around with our own C 
library, for which we would have to ship wheels (which is a hustle). And having 
that workaround, we could submit and patiently wait for proper patches to get 
merged into libhdfs and pyarrow. If the workaround is hard to implement, then 
we could skip it and keep using hadoop CLI as we do right now, focusing on 
proper patches to libhdfs and pyarrow. What do you think? :)

> [Python] pyarrow: hdfs: support file checksum
> -
>
> Key: ARROW-5995
> URL: https://issues.apache.org/jira/browse/ARROW-5995
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Ruslan Kuprieiev
>Priority: Minor
>
> I was not able to find how to retrieve checksum (`getFileChecksum` or `hadoop 
> fs/dfs -checksum`) for a file on hdfs. Judging by how it is implemented in 
> hadoop CLI [1], looks like we will also need to implement it manually in 
> pyarrow. Please correct me if I'm missing something. Is this feature 
> desirable? Or was there a good reason why it wasn't implemented already?
>  [1] 
> [https://github.com/hanborq/hadoop/blob/hadoop-hdh3u2.1/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java#L719]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6394) [Java] Support conversions between delta vector and partial sum vector

2019-08-30 Thread Liya Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan updated ARROW-6394:

Description: 
What is a delta vector/partial sum vector?

Given an integer vector a with length n, its partial sum vector is another 
integer vector b with length n + 1, with values defined as:

b(0) = initial sum
b(i ) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n

Given an integer vector with length n + 1, its delta vector is another integer 
vector b with length n, with values defined as:

b(i ) = a(i ) - a(i - 1), i = 0, 1, ... , n -1

In this issue, we provide utilities to convert between vector and partial sum 
vector. It is interesting to note that the two operations corresponding to the 
discrete integration and differentian.

These conversions have wide applications. For example,

1. The run-length vector proposed by Micah is based on the partial sum vector, 
while the deduplication functionality is based on delta vector. This issue 
provides conversions between them.

2. The current VarCharVector/VarBinaryVector implementations are based on 
partial sum vector. We can transform them to delta vectors before IPC, to 
reduce network traffic.

3. Converting to delta can be considered as a way for data compression. To 
further reduce the data volume, the operation can be applied more than once, to 
further reduce data volume.

  was:
What is a delta vector/partial sum vector?

Given an integer vector a with length n, its partial sum vector is another 
integer vector b with length n + 1, with values defined as:

b(0) = initial sum
b(i) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n

Given an integer vector with length n + 1, its delta vector is another integer 
vector b with length n, with values defined as:

b(i) = a(i) - a(i - 1), i = 0, 1, ... , n -1

In this issue, we provide utilities to convert between vector and partial sum 
vector. It is interesting to note that the two operations corresponding to the 
discrete integration and differentian.

These conversions have wide applications. For example,

1. The run-length vector proposed by Micah is based on the partial sum vector, 
while the deduplication functionality is based on delta vector. This issue 
provides conversions between them.

2. The current VarCharVector/VarBinaryVector implementations are based on 
partial sum vector. We can transform them to delta vectors before IPC, to 
reduce network traffic.

3. Converting to delta can be considered as a way for data compression. To 
further reduce the data volume, the operation can be applied more than once, to 
further reduce data volume.


> [Java] Support conversions between delta vector and partial sum vector
> --
>
> Key: ARROW-6394
> URL: https://issues.apache.org/jira/browse/ARROW-6394
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>
> What is a delta vector/partial sum vector?
> Given an integer vector a with length n, its partial sum vector is another 
> integer vector b with length n + 1, with values defined as:
> b(0) = initial sum
> b(i ) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n
> Given an integer vector with length n + 1, its delta vector is another 
> integer vector b with length n, with values defined as:
> b(i ) = a(i ) - a(i - 1), i = 0, 1, ... , n -1
> In this issue, we provide utilities to convert between vector and partial sum 
> vector. It is interesting to note that the two operations corresponding to 
> the discrete integration and differentian.
> These conversions have wide applications. For example,
> 1. The run-length vector proposed by Micah is based on the partial sum 
> vector, while the deduplication functionality is based on delta vector. This 
> issue provides conversions between them.
> 2. The current VarCharVector/VarBinaryVector implementations are based on 
> partial sum vector. We can transform them to delta vectors before IPC, to 
> reduce network traffic.
> 3. Converting to delta can be considered as a way for data compression. To 
> further reduce the data volume, the operation can be applied more than once, 
> to further reduce data volume.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6394) [Java] Support conversions between delta vector and partial sum vector

2019-08-30 Thread Liya Fan (Jira)

Liya Fan created ARROW-6394:
---

 Summary: [Java] Support conversions between delta vector and 
partial sum vector
 Key: ARROW-6394
 URL: https://issues.apache.org/jira/browse/ARROW-6394
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


What is a delta vector/partial sum vector?

Given an integer vector a with length n, its partial sum vector is another 
integer vector b with length n + 1, with values defined as:

b(0) = initial sum
b(i) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n

Given an integer vector with length n + 1, its delta vector is another integer 
vector b with length n, with values defined as:

b(i) = a(i) - a(i - 1), i = 0, 1, ... , n -1

In this issue, we provide utilities to convert between vector and partial sum 
vector. It is interesting to note that the two operations corresponding to the 
discrete integration and differentian.

These conversions have wide applications. For example,

1. The run-length vector proposed by Micah is based on the partial sum vector, 
while the deduplication functionality is based on delta vector. This issue 
provides conversions between them.

2. The current VarCharVector/VarBinaryVector implementations are based on 
partial sum vector. We can transform them to delta vectors before IPC, to 
reduce network traffic.

3. Converting to delta can be considered as a way for data compression. To 
further reduce the data volume, the operation can be applied more than once, to 
further reduce data volume.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

72 matches

Mail list logo