[jira] [Created] (ARROW-6791) Memory Leak

2019-10-04 Thread George Prichard (Jira)
George Prichard created ARROW-6791:
--

 Summary: Memory Leak 
 Key: ARROW-6791
 URL: https://issues.apache.org/jira/browse/ARROW-6791
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.1, 0.14.0
 Environment: Ubuntu 18.04, 32GB ram, conda-forge installation
Reporter: George Prichard


Memory leak with large string columns crashes the program. This only seems to 
affect 0.14.x  - it works fine for me in 0.13.0. It might be related to earlier 
similar issues? e.g. [https://github.com/apache/arrow/issues/2624]

Below is a reprex which works in earlier versions, but crashes on read (writing 
is fine) in this one. The real-life version of the data is full of URLs as the 
strings. 

Weirdly it crashes my 32GB Ubuntu 18.04, but runs (if very slowly for the read) 
on my 16GB Macbook. 

Thanks so much for the excellent tools! 

 

 
{code:java}
import pandas as pd
n_rows = int(1e6)
n_cols = 10
col_length = 100
df = pd.DataFrame()
for i in range(n_cols):
 df[f'col_{i}'] = pd.util.testing.rands_array(col_length, n_rows)
print('Generated df', df.shape)
filename = 'tmp.parquet'
print('Writing parquet')
df.to_parquet(filename)
print('Reading parquet')
pd.read_parquet(filename)
{code}
 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6580) [Java] Support comparison for unsigned integers

2019-10-04 Thread Praveen Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Kumar resolved ARROW-6580.
--
Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5405
[https://github.com/apache/arrow/pull/5405]

> [Java] Support comparison for unsigned integers
> ---
>
> Key: ARROW-6580
> URL: https://issues.apache.org/jira/browse/ARROW-6580
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In this issue, we support the comparison of unsigned integer vectors, 
> including UInt1Vector, UInt2Vector, UInt4Vector, and UInt8Vector.
> With support for comparison for these vectors, the sort for them is also 
> supported automatically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6791) Memory Leak

2019-10-04 Thread George Prichard (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Prichard updated ARROW-6791:
---
Description: 
Memory leak with large string columns crashes the program. This only seems to 
affect 0.14.x  - it works fine for me in 0.13.0. It might be related to earlier 
similar issues? e.g. [https://github.com/apache/arrow/issues/2624]

Below is a reprex which works in earlier versions, but crashes on read (writing 
is fine) in this one. The real-life version of the data is full of URLs as the 
strings. 

Weirdly it crashes my 32GB Ubuntu 18.04, but runs (if very slowly for the read) 
on my 16GB Macbook. 

Thanks so much for the excellent tools! 

 

 
{code:java}
import pandas as pd

n_rows = int(1e6)
n_cols = 10
col_length = 100

df = pd.DataFrame()

for i in range(n_cols):
df[f'col_{i}'] = pd.util.testing.rands_array(col_length, n_rows)

print('Generated df', df.shape)
filename = 'tmp.parquet'

print('Writing parquet')
df.to_parquet(filename)

print('Reading parquet')
pd.read_parquet(filename)
{code}
 

 

 

 

 

  was:
Memory leak with large string columns crashes the program. This only seems to 
affect 0.14.x  - it works fine for me in 0.13.0. It might be related to earlier 
similar issues? e.g. [https://github.com/apache/arrow/issues/2624]

Below is a reprex which works in earlier versions, but crashes on read (writing 
is fine) in this one. The real-life version of the data is full of URLs as the 
strings. 

Weirdly it crashes my 32GB Ubuntu 18.04, but runs (if very slowly for the read) 
on my 16GB Macbook. 

Thanks so much for the excellent tools! 

 

 
{code:java}
import pandas as pd
n_rows = int(1e6)
n_cols = 10
col_length = 100
df = pd.DataFrame()
for i in range(n_cols):
 df[f'col_{i}'] = pd.util.testing.rands_array(col_length, n_rows)
print('Generated df', df.shape)
filename = 'tmp.parquet'
print('Writing parquet')
df.to_parquet(filename)
print('Reading parquet')
pd.read_parquet(filename)
{code}
 

 

 

 

 


> Memory Leak 
> 
>
> Key: ARROW-6791
> URL: https://issues.apache.org/jira/browse/ARROW-6791
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0, 0.14.1
> Environment: Ubuntu 18.04, 32GB ram, conda-forge installation
>Reporter: George Prichard
>Priority: Major
>
> Memory leak with large string columns crashes the program. This only seems to 
> affect 0.14.x  - it works fine for me in 0.13.0. It might be related to 
> earlier similar issues? e.g. [https://github.com/apache/arrow/issues/2624]
> Below is a reprex which works in earlier versions, but crashes on read 
> (writing is fine) in this one. The real-life version of the data is full of 
> URLs as the strings. 
> Weirdly it crashes my 32GB Ubuntu 18.04, but runs (if very slowly for the 
> read) on my 16GB Macbook. 
> Thanks so much for the excellent tools! 
>  
>  
> {code:java}
> import pandas as pd
> n_rows = int(1e6)
> n_cols = 10
> col_length = 100
> df = pd.DataFrame()
> for i in range(n_cols):
> df[f'col_{i}'] = pd.util.testing.rands_array(col_length, n_rows)
> print('Generated df', df.shape)
> filename = 'tmp.parquet'
> print('Writing parquet')
> df.to_parquet(filename)
> print('Reading parquet')
> pd.read_parquet(filename)
> {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6274) [Rust] [DataFusion] Add support for writing results to CSV

2019-10-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6274:
--
Labels: beginner pull-request-available  (was: beginner)

> [Rust] [DataFusion] Add support for writing results to CSV
> --
>
> Key: ARROW-6274
> URL: https://issues.apache.org/jira/browse/ARROW-6274
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Hengruo Zhang
>Priority: Major
>  Labels: beginner, pull-request-available
>
> There is currently no simple way to result query results to CSV. It would be 
> good to have convenience methods either in ExecutionContext or separate 
> utility methods to enable results to be written in CSV format to stdout or to 
> a file.
> There is sample code in unit tests for this and the approach is to iterate 
> over each row in a batch and then iterate over each column and downcast it to 
> an appropriate type (based on the schema associated with the batch) and then 
> pull out the value for the row.
> See 
> [https://github.com/apache/arrow/blob/master/rust/datafusion/tests/sql.rs#L425-L497]
>  for example code in a test
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6778) [C++] Support DurationType in Cast kernel

2019-10-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6778:
--
Labels: pull-request-available  (was: )

> [C++] Support DurationType in Cast kernel
> -
>
> Key: ARROW-6778
> URL: https://issues.apache.org/jira/browse/ARROW-6778
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>
> Currently, duration is not yet supported in basic cast operations (using the 
> python binding from ARROW-5855, currently from my branch, not yet merged):
> {code}
> In [25]: arr = pa.array([1, 2])
> In [26]: arr.cast(pa.duration('s'))  
> ...
> ArrowNotImplementedError: No cast implemented from int64 to duration[s]
> In [27]: arr = pa.array([1, 2], pa.duration('s'))  
> In [28]: arr.cast(pa.duration('ms'))
> ...
> ArrowNotImplementedError: No cast implemented from duration[s] to duration[ms]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6695) [Rust] [DataFusion] Remove execution of logical plan

2019-10-04 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-6695:
-

Assignee: Andy Grove

> [Rust] [DataFusion] Remove execution of logical plan
> 
>
> Key: ARROW-6695
> URL: https://issues.apache.org/jira/browse/ARROW-6695
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> Remove execution of logical plan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6694) [Rust] [DataFusion] Update integration tests to use physical plan

2019-10-04 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-6694:
-

Assignee: Andy Grove

> [Rust] [DataFusion] Update integration tests to use physical plan
> -
>
> Key: ARROW-6694
> URL: https://issues.apache.org/jira/browse/ARROW-6694
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> Update integration tests to use physical query plan (once all features are 
> supported)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6658) [Rust] [DataFusion] Implement AVG aggregate expression

2019-10-04 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-6658.
---
Resolution: Fixed

Issue resolved by pull request 5558
[https://github.com/apache/arrow/pull/5558]

> [Rust] [DataFusion] Implement AVG aggregate expression
> --
>
> Key: ARROW-6658
> URL: https://issues.apache.org/jira/browse/ARROW-6658
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andy Grove
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Implement AVG aggregate expression. See COUNT and SUM for inspiration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6692) [Rust] [DataFusion] Update examples to use physical query plan

2019-10-04 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-6692:
-

Assignee: Andy Grove

> [Rust] [DataFusion] Update examples to use physical query plan
> --
>
> Key: ARROW-6692
> URL: https://issues.apache.org/jira/browse/ARROW-6692
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> Update examples to use physical query plan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6692) [Rust] [DataFusion] Update examples to use physical query plan

2019-10-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6692:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Update examples to use physical query plan
> --
>
> Key: ARROW-6692
> URL: https://issues.apache.org/jira/browse/ARROW-6692
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Update examples to use physical query plan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6694) [Rust] [DataFusion] Update integration tests to use physical plan

2019-10-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6694:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Update integration tests to use physical plan
> -
>
> Key: ARROW-6694
> URL: https://issues.apache.org/jira/browse/ARROW-6694
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Update integration tests to use physical query plan (once all features are 
> supported)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6693) [Rust] [DataFusion] Update unit tests to use physical query plan

2019-10-04 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-6693:
-

Assignee: Andy Grove

> [Rust] [DataFusion] Update unit tests to use physical query plan
> 
>
> Key: ARROW-6693
> URL: https://issues.apache.org/jira/browse/ARROW-6693
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> Update unit tests to use physical query plan (once all features are supported)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6695) [Rust] [DataFusion] Remove execution of logical plan

2019-10-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6695:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Remove execution of logical plan
> 
>
> Key: ARROW-6695
> URL: https://issues.apache.org/jira/browse/ARROW-6695
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Remove execution of logical plan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6656) [Rust] [DataFusion] Implement MIN and MAX aggregate expressions

2019-10-04 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-6656.
---
Resolution: Fixed

Issue resolved by pull request 5557
[https://github.com/apache/arrow/pull/5557]

> [Rust] [DataFusion] Implement MIN and MAX aggregate expressions
> ---
>
> Key: ARROW-6656
> URL: https://issues.apache.org/jira/browse/ARROW-6656
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Implement MIN and MAX aggregate expressions. See the SUM implementation for 
> inspiration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6657) [Rust] [DataFusion] Implement COUNT aggregate expression

2019-10-04 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-6657.
---
Resolution: Fixed

Issue resolved by pull request 5513
[https://github.com/apache/arrow/pull/5513]

> [Rust] [DataFusion] Implement COUNT aggregate expression
> 
>
> Key: ARROW-6657
> URL: https://issues.apache.org/jira/browse/ARROW-6657
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andy Grove
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Implement COUNT aggregate expressions. See the SUM implementation for 
> inspiration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6766) [Python] libarrow_python..dylib does not exist

2019-10-04 Thread Tarek Allam (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944619#comment-16944619
 ] 

Tarek Allam commented on ARROW-6766:


Interesting.. After your comment [~wesm]  I looked further into the stderr and 
found:

{{...}}
{{-- Added shared library dependency arrow_shared: 
/usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow.dylib}}
{{-- Added shared library dependency arrow_python_shared: 
/usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow_python.dylib}}
{{-- Parquet C++ ABI version: 15}}
{{-- Parquet C++ SO version: 15}}

{{...}}

I don't know if that is at all useful, or if perhaps it just provides more 
confusion. Should there be an equivalent "Arrow C++ ABI version" and "Arrow C++ 
SO version", is that what you meant? 

 

Very strange indeed.

> [Python] libarrow_python..dylib does not exist
> --
>
> Key: ARROW-6766
> URL: https://issues.apache.org/jira/browse/ARROW-6766
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0, 0.15.0
>Reporter: Tarek Allam
>Priority: Major
>
> {{After following the instructions found on the developer guides for Python, 
> I was}}
>  {{able to build fine by using:}}
> {{# Assuming immediately prior one has run:}}
>  {{# $ git clone g...@github.com:apache/arrow.git}}
>  # $ conda create -y -n pyarrow-dev -c conda-forge 
>  #   --file arrow/ci/conda_env_unix.yml 
>  #   --file arrow/ci/conda_env_cpp.yml 
>  #   --file arrow/ci/conda_env_python.yml 
>  #    compilers 
>  {{#  python=3.7}}
>  {{# $ conda activate pyarrow-dev}}
>  {{# $ brew update && brew bundle --file=arrow/cpp/Brewfile}}{{export 
> ARROW_HOME=$(pwd)/arrow/dist}}
>  {{export LD_LIBRARY_PATH=$(pwd)/arrow/dist/lib:$LD_LIBRARY_PATH}}{{export 
> CC=`which clang`}}
>  {{export CXX=`which clang++`}}{\{mkdir arrow/cpp/build }}
>      pushd arrow/cpp/build \
>      cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
>      -DCMAKE_INSTALL_LIBDIR=lib \
>      -DARROW_FLIGHT=OFF \
>      -DARROW_GANDIVA=OFF \
>      -DARROW_ORC=ON \
>      -DARROW_PARQUET=ON \
>      -DARROW_PYTHON=ON \
>      -DARROW_PLASMA=ON \
>      -DARROW_BUILD_TESTS=ON \
>     ..
>  {{make -j4}}
>  {{make install}}
>  {{popd}}
> But when I run:
> {{pushd arrow/python}}
>  {{export PYARROW_WITH_FLIGHT=0}}
>  {{export PYARROW_WITH_GANDIVA=0}}
>  {{export PYARROW_WITH_ORC=1}}
>  {{export PYARROW_WITH_PARQUET=1}}
>  {{python setup.py build_ext --inplace}}
>  {{popd}}
> I get the following errors:
> {{-- Build output directory: 
> /Users/tallamjr/Github/arrow/python/build/temp.macosx-10.9-x86_64-3.7/release}}
>  {{-- Found the Arrow core library: 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow.dylib}}
>  {{-- Found the Arrow Python library: 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow_python.dylib}}
>  {{CMake Error: File 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow..dylib does not 
> exist.}}{{...}}{{CMake Error: File 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow..dylib does not exist.}}
>  {{CMake Error at CMakeLists.txt:230 (configure_file):}}
>  \{{ configure_file Problem configuring file}}
>  {{Call Stack (most recent call first):}}
>  \{{ CMakeLists.txt:315 (bundle_arrow_lib)}}
>  {{CMake Error: File 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow_python..dylib does not 
> exist.}}
>  {{CMake Error at CMakeLists.txt:226 (configure_file):}}
>  \{{ configure_file Problem configuring file}}
>  {{Call Stack (most recent call first):}}
>  \{{ CMakeLists.txt:320 (bundle_arrow_lib)}}
>  {{CMake Error: File 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow_python..dylib does not 
> exist.}}
>  {{CMake Error at CMakeLists.txt:230 (configure_file):}}
>  \{{ configure_file Problem configuring file}}
>  {{Call Stack (most recent call first):}}
>  \{{ CMakeLists.txt:320 (bundle_arrow_lib)}}
>  
> What is quite strange is that the libraries seem to indeed be there but they
>  have an addition component such as `libarrow.15.dylib` .e.g:
> {{$ ls -l libarrow_python.15.dylib && echo $PWD}}
>  {{lrwxr-xr-x 1 tallamjr staff 28 Oct 2 14:02 libarrow_python.15.dylib ->}}
>  {{libarrow_python.15.0.0.dylib}}
>  {{/Users/tallamjr/github/arrow/dist/lib}}
> I guess I am not exactly sure what the issue here is but it appears to be that
>  the version is not captured as a variable that is used by CMAKE? I have run 
> the
>  same setup on `master` (`7d18c1c`) and on `apache-arrow-0.14.0` (`a591d76`)
>  which both seem to produce same errors.
> Apologies if this is not quite the format for JIRA issues here or perhaps if
>  it's not the correct platform for this, I'm very new to the project and
>  contributing to apache in general. Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6760) [C++] JSON: improve error message when column changed type

2019-10-04 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-6760.
-
Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5571
[https://github.com/apache/arrow/pull/5571]

> [C++] JSON: improve error message when column changed type
> --
>
> Key: ARROW-6760
> URL: https://issues.apache.org/jira/browse/ARROW-6760
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: harikrishnan
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
> Attachments: dummy.jl
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When a column accidentally changes type in a JSON file (which is not 
> supported), it would be nice to get the column name that gives this problem 
> in the error message.
> ---
> I am trying to parse a simple json file. While doing so, am getting the error 
>  {{JSON parse error: A column changed from string to number}}
> {code}
> from pyarrow import json
> r = json.read_json('dummy.jl')
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6791) Memory Leak

2019-10-04 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944610#comment-16944610
 ] 

Micah Kornfield commented on ARROW-6791:


This sounds like it might be https://issues.apache.org/jira/browse/ARROW-6060 
which is fixed 0.15.0 (currently being released).

> Memory Leak 
> 
>
> Key: ARROW-6791
> URL: https://issues.apache.org/jira/browse/ARROW-6791
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0, 0.14.1
> Environment: Ubuntu 18.04, 32GB ram, conda-forge installation
>Reporter: George Prichard
>Priority: Major
>
> Memory leak with large string columns crashes the program. This only seems to 
> affect 0.14.x  - it works fine for me in 0.13.0. It might be related to 
> earlier similar issues? e.g. [https://github.com/apache/arrow/issues/2624]
> Below is a reprex which works in earlier versions, but crashes on read 
> (writing is fine) in this one. The real-life version of the data is full of 
> URLs as the strings. 
> Weirdly it crashes my 32GB Ubuntu 18.04, but runs (if very slowly for the 
> read) on my 16GB Macbook. 
> Thanks so much for the excellent tools! 
>  
>  
> {code:java}
> import pandas as pd
> n_rows = int(1e6)
> n_cols = 10
> col_length = 100
> df = pd.DataFrame()
> for i in range(n_cols):
> df[f'col_{i}'] = pd.util.testing.rands_array(col_length, n_rows)
> print('Generated df', df.shape)
> filename = 'tmp.parquet'
> print('Writing parquet')
> df.to_parquet(filename)
> print('Reading parquet')
> pd.read_parquet(filename)
> {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6792) [R] Explore roxygen2 R6 class documentation

2019-10-04 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6792:
--

 Summary: [R] Explore roxygen2 R6 class documentation
 Key: ARROW-6792
 URL: https://issues.apache.org/jira/browse/ARROW-6792
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
 Fix For: 1.0.0


roxygen2 version 7.0 adds support for documenting R6 classes, rather than the 
ad hoc approach we've had to take without it: 
[https://github.com/r-lib/roxygen2/blob/master/vignettes/rd.Rmd#L203]

Try it out and see how we like it, and consider refactoring the docs to use it 
everywhere.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6437) [R] Add AWS SDK to system dependencies for macOS and Windows

2019-10-04 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944640#comment-16944640
 ] 

Neal Richardson commented on ARROW-6437:


Following up: appears that aws-sdk-cpp doesn't work on mingw. See 
[https://github.com/r-windows/rtools-packages/pull/37]

So S3 support will have to be flagged based on whether the C++ library is built 
with ARROW_S3=ON, and that won't be on for Windows.

I'll try to update the homebrew/autobrew formula though so we can have macOS 
support.

> [R] Add AWS SDK to system dependencies for macOS and Windows
> 
>
> Key: ARROW-6437
> URL: https://issues.apache.org/jira/browse/ARROW-6437
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> The Arrow C++ library now has an S3 filesystem implementation (ARROW-453), 
> and in order to take advantage of that from R, we need to add the 
> {{aws-sdk-cpp}} dependency to the macOS and Windows toolchains. 
> There is no PKGBUILD for this at https://github.com/msys2/MINGW-packages, but 
> https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=aws-sdk-cpp-git has 
> one. 
> For macOS, there already is a formula at 
> https://github.com/Homebrew/homebrew-core/blob/master/Formula/aws-sdk-cpp.rb, 
> maybe that's sufficient?
> Once that is in place, we can enable {{ARROW_S3=ON}} in cmake and build with 
> it 
> (https://github.com/apache/arrow/pull/5167/files#diff-b048bf4c1679dce1028fd897a7c43b93R177)
> cc [~jeroenooms]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6437) [R] Add AWS SDK to system dependencies for macOS and Windows

2019-10-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6437:
--
Labels: pull-request-available  (was: )

> [R] Add AWS SDK to system dependencies for macOS and Windows
> 
>
> Key: ARROW-6437
> URL: https://issues.apache.org/jira/browse/ARROW-6437
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> The Arrow C++ library now has an S3 filesystem implementation (ARROW-453), 
> and in order to take advantage of that from R, we need to add the 
> {{aws-sdk-cpp}} dependency to the macOS and Windows toolchains. 
> There is no PKGBUILD for this at https://github.com/msys2/MINGW-packages, but 
> https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=aws-sdk-cpp-git has 
> one. 
> For macOS, there already is a formula at 
> https://github.com/Homebrew/homebrew-core/blob/master/Formula/aws-sdk-cpp.rb, 
> maybe that's sufficient?
> Once that is in place, we can enable {{ARROW_S3=ON}} in cmake and build with 
> it 
> (https://github.com/apache/arrow/pull/5167/files#diff-b048bf4c1679dce1028fd897a7c43b93R177)
> cc [~jeroenooms]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6791) Memory Leak

2019-10-04 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944616#comment-16944616
 ] 

Neal Richardson commented on ARROW-6791:


Duplicate of ARROW-5086?

> Memory Leak 
> 
>
> Key: ARROW-6791
> URL: https://issues.apache.org/jira/browse/ARROW-6791
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0, 0.14.1
> Environment: Ubuntu 18.04, 32GB ram, conda-forge installation
>Reporter: George Prichard
>Priority: Major
>
> Memory leak with large string columns crashes the program. This only seems to 
> affect 0.14.x  - it works fine for me in 0.13.0. It might be related to 
> earlier similar issues? e.g. [https://github.com/apache/arrow/issues/2624]
> Below is a reprex which works in earlier versions, but crashes on read 
> (writing is fine) in this one. The real-life version of the data is full of 
> URLs as the strings. 
> Weirdly it crashes my 32GB Ubuntu 18.04, but runs (if very slowly for the 
> read) on my 16GB Macbook. 
> Thanks so much for the excellent tools! 
>  
>  
> {code:java}
> import pandas as pd
> n_rows = int(1e6)
> n_cols = 10
> col_length = 100
> df = pd.DataFrame()
> for i in range(n_cols):
> df[f'col_{i}'] = pd.util.testing.rands_array(col_length, n_rows)
> print('Generated df', df.shape)
> filename = 'tmp.parquet'
> print('Writing parquet')
> df.to_parquet(filename)
> print('Reading parquet')
> pd.read_parquet(filename)
> {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6778) [C++] Support DurationType in Cast kernel

2019-10-04 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-6778:


Assignee: Joris Van den Bossche

> [C++] Support DurationType in Cast kernel
> -
>
> Key: ARROW-6778
> URL: https://issues.apache.org/jira/browse/ARROW-6778
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, duration is not yet supported in basic cast operations (using the 
> python binding from ARROW-5855, currently from my branch, not yet merged):
> {code}
> In [25]: arr = pa.array([1, 2])
> In [26]: arr.cast(pa.duration('s'))  
> ...
> ArrowNotImplementedError: No cast implemented from int64 to duration[s]
> In [27]: arr = pa.array([1, 2], pa.duration('s'))  
> In [28]: arr.cast(pa.duration('ms'))
> ...
> ArrowNotImplementedError: No cast implemented from duration[s] to duration[ms]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6437) [R] Add AWS SDK to system dependencies for macOS and Windows

2019-10-04 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-6437.
---
Resolution: Fixed

Issue resolved by pull request 5579
[https://github.com/apache/arrow/pull/5579]

> [R] Add AWS SDK to system dependencies for macOS and Windows
> 
>
> Key: ARROW-6437
> URL: https://issues.apache.org/jira/browse/ARROW-6437
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The Arrow C++ library now has an S3 filesystem implementation (ARROW-453), 
> and in order to take advantage of that from R, we need to add the 
> {{aws-sdk-cpp}} dependency to the macOS and Windows toolchains. 
> There is no PKGBUILD for this at https://github.com/msys2/MINGW-packages, but 
> https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=aws-sdk-cpp-git has 
> one. 
> For macOS, there already is a formula at 
> https://github.com/Homebrew/homebrew-core/blob/master/Formula/aws-sdk-cpp.rb, 
> maybe that's sufficient?
> Once that is in place, we can enable {{ARROW_S3=ON}} in cmake and build with 
> it 
> (https://github.com/apache/arrow/pull/5167/files#diff-b048bf4c1679dce1028fd897a7c43b93R177)
> cc [~jeroenooms]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6596) [R] Getting "Cannot call io___MemoryMappedFile__Open()" error while reading a parquet file

2019-10-04 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson closed ARROW-6596.
--
  Assignee: Wes McKinney
Resolution: Information Provided

> [R] Getting "Cannot call io___MemoryMappedFile__Open()" error while reading a 
> parquet file
> --
>
> Key: ARROW-6596
> URL: https://issues.apache.org/jira/browse/ARROW-6596
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.14.1
> Environment: ubuntu 18.04
>Reporter: Addhyan
>Assignee: Wes McKinney
>Priority: Major
>  Labels: Docker, R, arrow, parquet
>
> I am using r/Dockerfile to get all the R dependency and following back to get 
> everything to get the arrow/r work in linux (either ubuntu/debian) but it is 
> continuously giving me this error:
> Error in io___MemoryMappedFile__Open(fs::path_abs(path), mode) : 
>   Cannot call io___MemoryMappedFile__Open()
> I have installed all the required cpp libraries as mentioned here: 
> [https://arrow.apache.org/install/] under "Ubuntu 18.04 LTS or later".  I 
> have also tried to use 
> [cpp/Dockerfile|https://github.com/apache/arrow/blob/master/cpp/Dockerfile] 
> and then followed backwards without any luck. The error is consistent and 
> doesn't go away. 
> I am trying to build a docker image with dockerfile containing everything 
> that arrow needs, all the cpp libraries etc. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6439) [R] Implement S3 file-system interface in R

2019-10-04 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944737#comment-16944737
 ] 

Neal Richardson commented on ARROW-6439:


This is ready to start now.

> [R] Implement S3 file-system interface in R
> ---
>
> Key: ARROW-6439
> URL: https://issues.apache.org/jira/browse/ARROW-6439
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain Francois
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6681) [C#] Record Batches in reverse order?

2019-10-04 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6681:
---
Summary: [C#] Record Batches in reverse order?  (was: [C# -> R] - Record 
Batches in reverse order?)

> [C#] Record Batches in reverse order?
> -
>
> Key: ARROW-6681
> URL: https://issues.apache.org/jira/browse/ARROW-6681
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Are 'RecordBatches' being in C# being written in reverse order?
> I made a simple test which creates a single row per record batch of 0 to 99 
> and attempted to read this in R. To my surprise batch(0) in R had the value 
> 99 not 0
> This may not seem like a big deal, however when dealing with 'huge' files, 
> its more efficient to use Record Batches / index lookup than attempting to 
> load the entire file into memory.
> Having the order consistent within the different language / API seems only to 
> make sense - for now I can work around this by reversing the order before 
> writing.
>  
> https://github.com/apache/arrow/issues/5475
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6681) [C#] Record Batches in reverse order?

2019-10-04 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6681:
---
Component/s: (was: R)

> [C#] Record Batches in reverse order?
> -
>
> Key: ARROW-6681
> URL: https://issues.apache.org/jira/browse/ARROW-6681
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Are 'RecordBatches' being in C# being written in reverse order?
> I made a simple test which creates a single row per record batch of 0 to 99 
> and attempted to read this in R. To my surprise batch(0) in R had the value 
> 99 not 0
> This may not seem like a big deal, however when dealing with 'huge' files, 
> its more efficient to use Record Batches / index lookup than attempting to 
> load the entire file into memory.
> Having the order consistent within the different language / API seems only to 
> make sense - for now I can work around this by reversing the order before 
> writing.
>  
> https://github.com/apache/arrow/issues/5475
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6793) [R] Arrow C++ binary packaging for Linux

2019-10-04 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6793:
--

 Summary: [R] Arrow C++ binary packaging for Linux
 Key: ARROW-6793
 URL: https://issues.apache.org/jira/browse/ARROW-6793
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 1.0.0


Our current installation experience on Linux isn't ideal. Unless you've already 
installed the Arrow C++ library, when you install the R package, you get a 
shell that tells you to install the C++ library. That was a useful approach to 
allow us to get the package on CRAN, which makes it easy for macOS and Windows 
users to install, but it doesn't improve the installation experience for Linux 
users. This is an impediment to adoption of arrow not only by users but also by 
package maintainers who might want to depend on arrow. 

macOS and Windows have a better experience because at installation time, the 
configure scripts download and statically link a prebuilt C++ library. CRAN 
bundles the whole thing up and delivers that as a binary R package. 

Python wheels do a similar thing: they're binaries that contain all external 
dependencies. And there are pyarrow wheels for Linux. This suggests that we 
could do something similar for R: build a generic Linux binary of the C++ 
library and download it in the R package configure script at install time.

I experimented with using the Arrow C++ binaries included in the Python wheels 
in R. See discussion at the end of ARROW-5956. This worked on macOS (not useful 
for R, but it proved the concept) and almost worked on Linux, but it turned out 
that the "manylinux2010" standard is too archaic to work with contemporary 
Rcpp. 

Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, 
just with slightly more modern compiler/settings. Publish that C++ binary 
package to bintray. Then download it in the R configure script if a 
local/system package isn't found.

Once we have a basic version working, test against various distros on 
[R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere 
and/or ensure the current fallback behavior when we encounter a distro that 
this doesn't work for. If necessary, we can make multiple flavors of this C++ 
binary for debian, centos, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6439) [R] Implement S3 file-system interface in R

2019-10-04 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6439:
--

Assignee: Romain Francois

> [R] Implement S3 file-system interface in R
> ---
>
> Key: ARROW-6439
> URL: https://issues.apache.org/jira/browse/ARROW-6439
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain Francois
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6437) [R] Add AWS SDK to system dependencies for macOS

2019-10-04 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6437:
---
Summary: [R] Add AWS SDK to system dependencies for macOS  (was: [R] Add 
AWS SDK to system dependencies for macOS and Windows)

> [R] Add AWS SDK to system dependencies for macOS
> 
>
> Key: ARROW-6437
> URL: https://issues.apache.org/jira/browse/ARROW-6437
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The Arrow C++ library now has an S3 filesystem implementation (ARROW-453), 
> and in order to take advantage of that from R, we need to add the 
> {{aws-sdk-cpp}} dependency to the macOS and Windows toolchains. 
> There is no PKGBUILD for this at https://github.com/msys2/MINGW-packages, but 
> https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=aws-sdk-cpp-git has 
> one. 
> For macOS, there already is a formula at 
> https://github.com/Homebrew/homebrew-core/blob/master/Formula/aws-sdk-cpp.rb, 
> maybe that's sufficient?
> Once that is in place, we can enable {{ARROW_S3=ON}} in cmake and build with 
> it 
> (https://github.com/apache/arrow/pull/5167/files#diff-b048bf4c1679dce1028fd897a7c43b93R177)
> cc [~jeroenooms]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6793) [R] Arrow C++ binary packaging for Linux

2019-10-04 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944786#comment-16944786
 ] 

Wes McKinney commented on ARROW-6793:
-

+1. We should be able to take the manylinux2010 base image and tweak the 
CXXFLAGS to suit R's requirements. 

Note that we may have to generate two different libraries, one for pre-gcc5 ABI 
and one for post. I think manylinux2010 uses the pre-gcc5 ABI in the interest 
of broad spectrum compatibility. The R build may need to detect which ABI the 
active configuration needs. Not sure how easy that will be

> [R] Arrow C++ binary packaging for Linux
> 
>
> Key: ARROW-6793
> URL: https://issues.apache.org/jira/browse/ARROW-6793
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> Our current installation experience on Linux isn't ideal. Unless you've 
> already installed the Arrow C++ library, when you install the R package, you 
> get a shell that tells you to install the C++ library. That was a useful 
> approach to allow us to get the package on CRAN, which makes it easy for 
> macOS and Windows users to install, but it doesn't improve the installation 
> experience for Linux users. This is an impediment to adoption of arrow not 
> only by users but also by package maintainers who might want to depend on 
> arrow. 
> macOS and Windows have a better experience because at installation time, the 
> configure scripts download and statically link a prebuilt C++ library. CRAN 
> bundles the whole thing up and delivers that as a binary R package. 
> Python wheels do a similar thing: they're binaries that contain all external 
> dependencies. And there are pyarrow wheels for Linux. This suggests that we 
> could do something similar for R: build a generic Linux binary of the C++ 
> library and download it in the R package configure script at install time.
> I experimented with using the Arrow C++ binaries included in the Python 
> wheels in R. See discussion at the end of ARROW-5956. This worked on macOS 
> (not useful for R, but it proved the concept) and almost worked on Linux, but 
> it turned out that the "manylinux2010" standard is too archaic to work with 
> contemporary Rcpp. 
> Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, 
> just with slightly more modern compiler/settings. Publish that C++ binary 
> package to bintray. Then download it in the R configure script if a 
> local/system package isn't found.
> Once we have a basic version working, test against various distros on 
> [R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere 
> and/or ensure the current fallback behavior when we encounter a distro that 
> this doesn't work for. If necessary, we can make multiple flavors of this C++ 
> binary for debian, centos, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6795) [C#] Reading large Arrow files in C# results in an exception

2019-10-04 Thread Eric Erhardt (Jira)
Eric Erhardt created ARROW-6795:
---

 Summary: [C#] Reading large Arrow files in C# results in an 
exception
 Key: ARROW-6795
 URL: https://issues.apache.org/jira/browse/ARROW-6795
 Project: Apache Arrow
  Issue Type: Bug
  Components: C#
Reporter: Eric Erhardt


If you try to read a large Arrow file (2GB+) using the C# reader, you get an 
exception because it is casting the file position (a 64-bit long) to a 32-bit 
integer. When the file size is large

 

See [https://github.com/apache/arrow/pull/5412]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6766) [Python] libarrow_python..dylib does not exist

2019-10-04 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944781#comment-16944781
 ] 

Wes McKinney commented on ARROW-6766:
-

I don't develop on macOS but maybe [~uwe] or [~npr] can show what a successful 
build log on macOS should look like. You can also look at our CI logs to see 
where there is a breakdown

> [Python] libarrow_python..dylib does not exist
> --
>
> Key: ARROW-6766
> URL: https://issues.apache.org/jira/browse/ARROW-6766
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0, 0.15.0
>Reporter: Tarek Allam
>Priority: Major
>
> {{After following the instructions found on the developer guides for Python, 
> I was}}
>  {{able to build fine by using:}}
> {{# Assuming immediately prior one has run:}}
>  {{# $ git clone g...@github.com:apache/arrow.git}}
>  # $ conda create -y -n pyarrow-dev -c conda-forge 
>  #   --file arrow/ci/conda_env_unix.yml 
>  #   --file arrow/ci/conda_env_cpp.yml 
>  #   --file arrow/ci/conda_env_python.yml 
>  #    compilers 
>  {{#  python=3.7}}
>  {{# $ conda activate pyarrow-dev}}
>  {{# $ brew update && brew bundle --file=arrow/cpp/Brewfile}}{{export 
> ARROW_HOME=$(pwd)/arrow/dist}}
>  {{export LD_LIBRARY_PATH=$(pwd)/arrow/dist/lib:$LD_LIBRARY_PATH}}{{export 
> CC=`which clang`}}
>  {{export CXX=`which clang++`}}{\{mkdir arrow/cpp/build }}
>      pushd arrow/cpp/build \
>      cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
>      -DCMAKE_INSTALL_LIBDIR=lib \
>      -DARROW_FLIGHT=OFF \
>      -DARROW_GANDIVA=OFF \
>      -DARROW_ORC=ON \
>      -DARROW_PARQUET=ON \
>      -DARROW_PYTHON=ON \
>      -DARROW_PLASMA=ON \
>      -DARROW_BUILD_TESTS=ON \
>     ..
>  {{make -j4}}
>  {{make install}}
>  {{popd}}
> But when I run:
> {{pushd arrow/python}}
>  {{export PYARROW_WITH_FLIGHT=0}}
>  {{export PYARROW_WITH_GANDIVA=0}}
>  {{export PYARROW_WITH_ORC=1}}
>  {{export PYARROW_WITH_PARQUET=1}}
>  {{python setup.py build_ext --inplace}}
>  {{popd}}
> I get the following errors:
> {{-- Build output directory: 
> /Users/tallamjr/Github/arrow/python/build/temp.macosx-10.9-x86_64-3.7/release}}
>  {{-- Found the Arrow core library: 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow.dylib}}
>  {{-- Found the Arrow Python library: 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow_python.dylib}}
>  {{CMake Error: File 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow..dylib does not 
> exist.}}{{...}}{{CMake Error: File 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow..dylib does not exist.}}
>  {{CMake Error at CMakeLists.txt:230 (configure_file):}}
>  \{{ configure_file Problem configuring file}}
>  {{Call Stack (most recent call first):}}
>  \{{ CMakeLists.txt:315 (bundle_arrow_lib)}}
>  {{CMake Error: File 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow_python..dylib does not 
> exist.}}
>  {{CMake Error at CMakeLists.txt:226 (configure_file):}}
>  \{{ configure_file Problem configuring file}}
>  {{Call Stack (most recent call first):}}
>  \{{ CMakeLists.txt:320 (bundle_arrow_lib)}}
>  {{CMake Error: File 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow_python..dylib does not 
> exist.}}
>  {{CMake Error at CMakeLists.txt:230 (configure_file):}}
>  \{{ configure_file Problem configuring file}}
>  {{Call Stack (most recent call first):}}
>  \{{ CMakeLists.txt:320 (bundle_arrow_lib)}}
>  
> What is quite strange is that the libraries seem to indeed be there but they
>  have an addition component such as `libarrow.15.dylib` .e.g:
> {{$ ls -l libarrow_python.15.dylib && echo $PWD}}
>  {{lrwxr-xr-x 1 tallamjr staff 28 Oct 2 14:02 libarrow_python.15.dylib ->}}
>  {{libarrow_python.15.0.0.dylib}}
>  {{/Users/tallamjr/github/arrow/dist/lib}}
> I guess I am not exactly sure what the issue here is but it appears to be that
>  the version is not captured as a variable that is used by CMAKE? I have run 
> the
>  same setup on `master` (`7d18c1c`) and on `apache-arrow-0.14.0` (`a591d76`)
>  which both seem to produce same errors.
> Apologies if this is not quite the format for JIRA issues here or perhaps if
>  it's not the correct platform for this, I'm very new to the project and
>  contributing to apache in general. Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6794) [Release] dev/release/post-03-website.sh is out of date in a couple of ways

2019-10-04 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6794:
---

 Summary: [Release] dev/release/post-03-website.sh is out of date 
in a couple of ways
 Key: ARROW-6794
 URL: https://issues.apache.org/jira/browse/ARROW-6794
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Wes McKinney
 Fix For: 1.0.0


* Need to add APACHE_ prefix to environment variables
* arrow-site repository is now separate



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6794) [Release] dev/release/post-03-website.sh is out of date in a couple of ways

2019-10-04 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6794:
---

Assignee: Wes McKinney

> [Release] dev/release/post-03-website.sh is out of date in a couple of ways
> ---
>
> Key: ARROW-6794
> URL: https://issues.apache.org/jira/browse/ARROW-6794
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> * Need to add APACHE_ prefix to environment variables
> * arrow-site repository is now separate



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6766) [Python] libarrow_python..dylib does not exist

2019-10-04 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944823#comment-16944823
 ] 

Neal Richardson commented on ARROW-6766:


I don't use conda but I believe Uwe does so he's probably better positioned to 
advise.

> [Python] libarrow_python..dylib does not exist
> --
>
> Key: ARROW-6766
> URL: https://issues.apache.org/jira/browse/ARROW-6766
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0, 0.15.0
>Reporter: Tarek Allam
>Priority: Major
>
> {{After following the instructions found on the developer guides for Python, 
> I was}}
>  {{able to build fine by using:}}
> {{# Assuming immediately prior one has run:}}
>  {{# $ git clone g...@github.com:apache/arrow.git}}
>  # $ conda create -y -n pyarrow-dev -c conda-forge 
>  #   --file arrow/ci/conda_env_unix.yml 
>  #   --file arrow/ci/conda_env_cpp.yml 
>  #   --file arrow/ci/conda_env_python.yml 
>  #    compilers 
>  {{#  python=3.7}}
>  {{# $ conda activate pyarrow-dev}}
>  {{# $ brew update && brew bundle --file=arrow/cpp/Brewfile}}{{export 
> ARROW_HOME=$(pwd)/arrow/dist}}
>  {{export LD_LIBRARY_PATH=$(pwd)/arrow/dist/lib:$LD_LIBRARY_PATH}}{{export 
> CC=`which clang`}}
>  {{export CXX=`which clang++`}}{\{mkdir arrow/cpp/build }}
>      pushd arrow/cpp/build \
>      cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
>      -DCMAKE_INSTALL_LIBDIR=lib \
>      -DARROW_FLIGHT=OFF \
>      -DARROW_GANDIVA=OFF \
>      -DARROW_ORC=ON \
>      -DARROW_PARQUET=ON \
>      -DARROW_PYTHON=ON \
>      -DARROW_PLASMA=ON \
>      -DARROW_BUILD_TESTS=ON \
>     ..
>  {{make -j4}}
>  {{make install}}
>  {{popd}}
> But when I run:
> {{pushd arrow/python}}
>  {{export PYARROW_WITH_FLIGHT=0}}
>  {{export PYARROW_WITH_GANDIVA=0}}
>  {{export PYARROW_WITH_ORC=1}}
>  {{export PYARROW_WITH_PARQUET=1}}
>  {{python setup.py build_ext --inplace}}
>  {{popd}}
> I get the following errors:
> {{-- Build output directory: 
> /Users/tallamjr/Github/arrow/python/build/temp.macosx-10.9-x86_64-3.7/release}}
>  {{-- Found the Arrow core library: 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow.dylib}}
>  {{-- Found the Arrow Python library: 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow_python.dylib}}
>  {{CMake Error: File 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow..dylib does not 
> exist.}}{{...}}{{CMake Error: File 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow..dylib does not exist.}}
>  {{CMake Error at CMakeLists.txt:230 (configure_file):}}
>  \{{ configure_file Problem configuring file}}
>  {{Call Stack (most recent call first):}}
>  \{{ CMakeLists.txt:315 (bundle_arrow_lib)}}
>  {{CMake Error: File 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow_python..dylib does not 
> exist.}}
>  {{CMake Error at CMakeLists.txt:226 (configure_file):}}
>  \{{ configure_file Problem configuring file}}
>  {{Call Stack (most recent call first):}}
>  \{{ CMakeLists.txt:320 (bundle_arrow_lib)}}
>  {{CMake Error: File 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow_python..dylib does not 
> exist.}}
>  {{CMake Error at CMakeLists.txt:230 (configure_file):}}
>  \{{ configure_file Problem configuring file}}
>  {{Call Stack (most recent call first):}}
>  \{{ CMakeLists.txt:320 (bundle_arrow_lib)}}
>  
> What is quite strange is that the libraries seem to indeed be there but they
>  have an addition component such as `libarrow.15.dylib` .e.g:
> {{$ ls -l libarrow_python.15.dylib && echo $PWD}}
>  {{lrwxr-xr-x 1 tallamjr staff 28 Oct 2 14:02 libarrow_python.15.dylib ->}}
>  {{libarrow_python.15.0.0.dylib}}
>  {{/Users/tallamjr/github/arrow/dist/lib}}
> I guess I am not exactly sure what the issue here is but it appears to be that
>  the version is not captured as a variable that is used by CMAKE? I have run 
> the
>  same setup on `master` (`7d18c1c`) and on `apache-arrow-0.14.0` (`a591d76`)
>  which both seem to produce same errors.
> Apologies if this is not quite the format for JIRA issues here or perhaps if
>  it's not the correct platform for this, I'm very new to the project and
>  contributing to apache in general. Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)