[jira] [Resolved] (ARROW-8224) [C++] Remove APIs deprecated prior to 0.16.0

2020-03-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8224.
-
Resolution: Fixed

Issue resolved by pull request 6735
[https://github.com/apache/arrow/pull/6735]

> [C++] Remove APIs deprecated prior to 0.16.0
> 
>
> Key: ARROW-8224
> URL: https://issues.apache.org/jira/browse/ARROW-8224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5585) [Go] rename arrow.TypeEquals into arrow.TypeEqual

2020-03-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5585:
--
Labels: pull-request-available  (was: )

> [Go] rename arrow.TypeEquals into arrow.TypeEqual
> -
>
> Key: ARROW-5585
> URL: https://issues.apache.org/jira/browse/ARROW-5585
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Reporter: Sebastien Binet
>Priority: Major
>  Labels: pull-request-available
>
> this is to follow Go' stdlib conventions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8249) [Rust] [DataFusion] Make Table and LogicalPlanBuilder APIs more consistent

2020-03-27 Thread Andy Grove (Jira)
Andy Grove created ARROW-8249:
-

 Summary: [Rust] [DataFusion] Make Table and LogicalPlanBuilder 
APIs more consistent
 Key: ARROW-8249
 URL: https://issues.apache.org/jira/browse/ARROW-8249
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
 Fix For: 1.0.0


We now have two similar APIs with Table and LogicalPlanBuilder and although 
they are similar, there are some differences and it would be good to unify 
them. There is also code duplication and it most likely makes sense for the 
Table API to delegate to the query builder API to build logical plans.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8246) [C++] Add -Wa,-mbig-obj when compiling with MinGW to avoid linking errors

2020-03-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8246.
-
Resolution: Fixed

Issue resolved by pull request 6743
[https://github.com/apache/arrow/pull/6743]

> [C++] Add -Wa,-mbig-obj when compiling with MinGW to avoid linking errors
> -
>
> Key: ARROW-8246
> URL: https://issues.apache.org/jira/browse/ARROW-8246
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> See 
> https://digitalkarabela.com/mingw-w64-how-to-fix-file-too-big-too-many-sections/
> This seems to be the MinGW equivalent of {{/bigobj}} in MSVC



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-8222) [C++] Use bcp to make a slim boost for bundled build

2020-03-27 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069092#comment-17069092
 ] 

Wes McKinney edited comment on ARROW-8222 at 3/27/20, 10:53 PM:


To collect some anecdotal evidence about problems with the current boost_ep. It 
seems the entire boostorg has some kind of rate limiting issue on Bintray and 
trying to access e.g. 
https://dl.bintray.com/boostorg/release/1.72.0/source/boost_1_72_0.tar.gz 
yields 403 Forbidden. So all the more reason to host our EP artifact on GitHub 
or some other place within our agency


was (Author: wesmckinn):
To collect some anecdotal evidence about problems with the current boost_ep. It 
seems the entire boostorg has some kind of rate limiting issue on Bintray and 
trying to access dl.bintray.com yields 403 Forbidden. So all the more reason to 
host our EP artifact on GitHub or some other place within our agency

> [C++] Use bcp to make a slim boost for bundled build
> 
>
> Key: ARROW-8222
> URL: https://issues.apache.org/jira/browse/ARROW-8222
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> We don't use much of Boost (just system, filesystem, and regex), but when we 
> do a bundled build, we still download and extract all of boost. The tarball 
> itself is 113mb, expanded is over 700mb. This can be slow, and it requires a 
> lot of free disk space that we don't really need.
> [bcp|https://www.boost.org/doc/libs/1_72_0/tools/bcp/doc/html/index.html] is 
> a boost tool that lets you extract a subset of boost, resolving any of its 
> necessary dependencies across boost. The savings for us could be huge:
> {code}
> mkdir test
> ./bcp system.hpp filesystem.hpp regex.hpp test
> tar -czf test.tar.gz test/
> {code}
> The resulting tarball is 885K (kilobytes!). 
> {{bcp}} also lets you re-namespace, so this would (IIUC) solve ARROW-4286 as 
> well.
> We would need a place to host this tarball, and we would have to updated it 
> whenever we (1) bump the boost version or (2) add a new boost library 
> dependency. This patch would of course include a script that would generate 
> the tarball. Given the small size, we could also consider just vendoring it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8222) [C++] Use bcp to make a slim boost for bundled build

2020-03-27 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069092#comment-17069092
 ] 

Wes McKinney commented on ARROW-8222:
-

To collect some anecdotal evidence about problems with the current boost_ep. It 
seems the entire boostorg has some kind of rate limiting issue on Bintray and 
trying to access dl.bintray.com yields 403 Forbidden. So all the more reason to 
host our EP artifact on GitHub or some other place within our agency

> [C++] Use bcp to make a slim boost for bundled build
> 
>
> Key: ARROW-8222
> URL: https://issues.apache.org/jira/browse/ARROW-8222
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> We don't use much of Boost (just system, filesystem, and regex), but when we 
> do a bundled build, we still download and extract all of boost. The tarball 
> itself is 113mb, expanded is over 700mb. This can be slow, and it requires a 
> lot of free disk space that we don't really need.
> [bcp|https://www.boost.org/doc/libs/1_72_0/tools/bcp/doc/html/index.html] is 
> a boost tool that lets you extract a subset of boost, resolving any of its 
> necessary dependencies across boost. The savings for us could be huge:
> {code}
> mkdir test
> ./bcp system.hpp filesystem.hpp regex.hpp test
> tar -czf test.tar.gz test/
> {code}
> The resulting tarball is 885K (kilobytes!). 
> {{bcp}} also lets you re-namespace, so this would (IIUC) solve ARROW-4286 as 
> well.
> We would need a place to host this tarball, and we would have to updated it 
> whenever we (1) bump the boost version or (2) add a new boost library 
> dependency. This patch would of course include a script that would generate 
> the tarball. Given the small size, we could also consider just vendoring it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8248) [C++] vcpkg build clobbers arrow.lib from shared (.dll) with static (.lib)

2020-03-27 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069085#comment-17069085
 ] 

Wes McKinney commented on ARROW-8248:
-

This seems to be something that the vcpkg maintainers did on purpose

https://github.com/microsoft/vcpkg/blob/master/ports/arrow/portfile.cmake#L46

You should probably report the problem to them directly

> [C++] vcpkg build clobbers arrow.lib from shared (.dll) with static (.lib)
> --
>
> Key: ARROW-8248
> URL: https://issues.apache.org/jira/browse/ARROW-8248
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Developer Tools
>Affects Versions: 0.16.0
>Reporter: Scott Wilson
>Priority: Major
>
> After installing Arrow via vcpkg, build the library per the steps below. 
> CMake builds the shared arrow library (.dll) and then the static arrow 
> library (.lib). It overwrites the shared arrow.lib (exports) with the static 
> arrow.lib. This results in multiple link/execution problems when using the vc 
> projects to build the example projects until you realize that shared arrow 
> needs to be rebuilt. (This took me two days.) 
> Also, many of the projects added with the extra -D flags (beyond 
> ARROW_BUILD_TESTS) don't build.
> ***
> "C:\Program Files (x86)\Microsoft Visual 
> Studio\2017\Professional\Common7\Tools\VsDevCmd.bat" -arch=amd64
> cd F:\Dev\vcpkg\buildtrees\arrow\src\row-0.16.0-872c330822\cpp
> mkdir build
> cd build
> cmake .. -G "Visual Studio 15 2017 Win64" -DARROW_BUILD_TESTS=ON 
> -DARROW_BUILD_EXAMPLES=ON -DARROW_PARQUET=ON -DARROW_PYTHON=ON 
> -DCMAKE_BUILD_TYPE=Debug
> cmake --build . --config Debug



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8248) [C++] vcpkg build clobbers arrow.lib from shared (.dll) with static (.lib)

2020-03-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8248:

Summary: [C++] vcpkg build clobbers arrow.lib from shared (.dll) with 
static (.lib)  (was: vcpkg build clobbers arrow.lib from shared (.dll) with 
static (.lib))

> [C++] vcpkg build clobbers arrow.lib from shared (.dll) with static (.lib)
> --
>
> Key: ARROW-8248
> URL: https://issues.apache.org/jira/browse/ARROW-8248
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Developer Tools
>Affects Versions: 0.16.0
>Reporter: Scott Wilson
>Priority: Major
>
> After installing Arrow via vcpkg, build the library per the steps below. 
> CMake builds the shared arrow library (.dll) and then the static arrow 
> library (.lib). It overwrites the shared arrow.lib (exports) with the static 
> arrow.lib. This results in multiple link/execution problems when using the vc 
> projects to build the example projects until you realize that shared arrow 
> needs to be rebuilt. (This took me two days.) 
> Also, many of the projects added with the extra -D flags (beyond 
> ARROW_BUILD_TESTS) don't build.
> ***
> "C:\Program Files (x86)\Microsoft Visual 
> Studio\2017\Professional\Common7\Tools\VsDevCmd.bat" -arch=amd64
> cd F:\Dev\vcpkg\buildtrees\arrow\src\row-0.16.0-872c330822\cpp
> mkdir build
> cd build
> cmake .. -G "Visual Studio 15 2017 Win64" -DARROW_BUILD_TESTS=ON 
> -DARROW_BUILD_EXAMPLES=ON -DARROW_PARQUET=ON -DARROW_PYTHON=ON 
> -DCMAKE_BUILD_TYPE=Debug
> cmake --build . --config Debug



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8248) vcpkg build clobbers arrow.lib from shared (.dll) with static (.lib)

2020-03-27 Thread Scott Wilson (Jira)
Scott Wilson created ARROW-8248:
---

 Summary: vcpkg build clobbers arrow.lib from shared (.dll) with 
static (.lib)
 Key: ARROW-8248
 URL: https://issues.apache.org/jira/browse/ARROW-8248
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Developer Tools
Affects Versions: 0.16.0
Reporter: Scott Wilson


After installing Arrow via vcpkg, build the library per the steps below. CMake 
builds the shared arrow library (.dll) and then the static arrow library 
(.lib). It overwrites the shared arrow.lib (exports) with the static arrow.lib. 
This results in multiple link/execution problems when using the vc projects to 
build the example projects until you realize that shared arrow needs to be 
rebuilt. (This took me two days.) 

Also, many of the projects added with the extra -D flags (beyond 
ARROW_BUILD_TESTS) don't build.

***

"C:\Program Files (x86)\Microsoft Visual 
Studio\2017\Professional\Common7\Tools\VsDevCmd.bat" -arch=amd64

cd F:\Dev\vcpkg\buildtrees\arrow\src\row-0.16.0-872c330822\cpp

mkdir build

cd build

cmake .. -G "Visual Studio 15 2017 Win64" -DARROW_BUILD_TESTS=ON 
-DARROW_BUILD_EXAMPLES=ON -DARROW_PARQUET=ON -DARROW_PYTHON=ON 
-DCMAKE_BUILD_TYPE=Debug

cmake --build . --config Debug



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8241) [Rust] Add convenience methods to Schema

2020-03-27 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-8241.
---
Resolution: Fixed

Issue resolved by pull request 6740
[https://github.com/apache/arrow/pull/6740]

> [Rust] Add convenience methods to Schema
> 
>
> Key: ARROW-8241
> URL: https://issues.apache.org/jira/browse/ARROW-8241
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> I would like to add the following methods to Schema to make it easier to work 
> with.
>  
> {code:java}
> pub fn field_with_name(, name: ) -> Result<>;
> pub fn index_of(, name: ) -> Result;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7792) [R] read_feather does not close connection to file

2020-03-27 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069051#comment-17069051
 ] 

Neal Richardson commented on ARROW-7792:


Yes I've been waiting for your patch to land before tackling this.

> [R] read_feather does not close connection to file
> --
>
> Key: ARROW-7792
> URL: https://issues.apache.org/jira/browse/ARROW-7792
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Martin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> x = as.data.frame(seq(1:100))
> pbFilename <- file.path(getwd(), "reproduceBug.feather")
> arrow::write_feather(x = x, sink = pbFilename)
> file.exists(pbFilename)
> file.remove(pbFilename)
> arrow::write_feather(x = x, sink = pbFilename)
> tempDX <- arrow::read_feather(file = pbFilename, as_data_frame = T)
> file.exists(pbFilename)
> file.remove(pbFilename)
> >Warning message:
> >In file.remove(pbFilename) :
> >cannot remove file 
> >'C:/Martin/Repo/ReinforcementLearner/reproduceBug.feather', reason
> >  'Permission denied'
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7792) [R] read_feather does not close connection to file

2020-03-27 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069043#comment-17069043
 ] 

Wes McKinney commented on ARROW-7792:
-

[~npr] we should probably revisit this in context with Feather V2 ARROW-5510. 
Now writing is boiled down to a single function 
{{arrow::ipc::feather::WriteTable}} which can fail. 

In Python we have this scenario guarded with try/except to make sure that the 
file handle is cleaned up (we got identical bug reports): 
https://github.com/apache/arrow/blob/master/python/pyarrow/feather.py#L179

> [R] read_feather does not close connection to file
> --
>
> Key: ARROW-7792
> URL: https://issues.apache.org/jira/browse/ARROW-7792
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Martin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> x = as.data.frame(seq(1:100))
> pbFilename <- file.path(getwd(), "reproduceBug.feather")
> arrow::write_feather(x = x, sink = pbFilename)
> file.exists(pbFilename)
> file.remove(pbFilename)
> arrow::write_feather(x = x, sink = pbFilename)
> tempDX <- arrow::read_feather(file = pbFilename, as_data_frame = T)
> file.exists(pbFilename)
> file.remove(pbFilename)
> >Warning message:
> >In file.remove(pbFilename) :
> >cannot remove file 
> >'C:/Martin/Repo/ReinforcementLearner/reproduceBug.feather', reason
> >  'Permission denied'
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7783) [C++] ARROW_DATASET should enable ARROW_COMPUTE

2020-03-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7783:
--
Labels: pull-request-available  (was: )

> [C++] ARROW_DATASET should enable ARROW_COMPUTE
> ---
>
> Key: ARROW-7783
> URL: https://issues.apache.org/jira/browse/ARROW-7783
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>
> Currenty, passing {{-DARROW_DATASET=ON}} to CMake doesn't enable 
> ARROW_COMPUTE, which leads to linker errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7783) [C++] ARROW_DATASET should enable ARROW_COMPUTE

2020-03-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-7783:
---

Assignee: Wes McKinney  (was: Francois Saint-Jacques)

> [C++] ARROW_DATASET should enable ARROW_COMPUTE
> ---
>
> Key: ARROW-7783
> URL: https://issues.apache.org/jira/browse/ARROW-7783
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.17.0
>
>
> Currenty, passing {{-DARROW_DATASET=ON}} to CMake doesn't enable 
> ARROW_COMPUTE, which leads to linker errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7605) [C++] Merge jemalloc and other BUNDLED dependencies into libarrow.a

2020-03-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7605:

Fix Version/s: (was: 0.17.0)
   1.0.0

> [C++] Merge jemalloc and other BUNDLED dependencies into libarrow.a
> ---
>
> Key: ARROW-7605
> URL: https://issues.apache.org/jira/browse/ARROW-7605
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> If ARROW_JEMALLOC=ON, then currently the libarrow.a cannot be used for static 
> linking without also obtaining libjemalloc_pic.a



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7605) [C++] Merge jemalloc and other BUNDLED dependencies into libarrow.a

2020-03-27 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069039#comment-17069039
 ] 

Wes McKinney commented on ARROW-7605:
-

I haven't forgotten about this. This change is too risky to rush into 0.17.0 
but I hope to have a patch ready for it in the near future that we can make 
sure it robust to different scenarios after 0.17.0 goes out. If someone wants 
to pick up the project from me, you are welcome to do so

> [C++] Merge jemalloc and other BUNDLED dependencies into libarrow.a
> ---
>
> Key: ARROW-7605
> URL: https://issues.apache.org/jira/browse/ARROW-7605
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> If ARROW_JEMALLOC=ON, then currently the libarrow.a cannot be used for static 
> linking without also obtaining libjemalloc_pic.a



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7605) [C++] Merge jemalloc and other BUNDLED dependencies into libarrow.a

2020-03-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7605:

Summary: [C++] Merge jemalloc and other BUNDLED dependencies into 
libarrow.a  (was: [C++] Merge private je_arrow symbols into produced libarrow.a)

> [C++] Merge jemalloc and other BUNDLED dependencies into libarrow.a
> ---
>
> Key: ARROW-7605
> URL: https://issues.apache.org/jira/browse/ARROW-7605
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> If ARROW_JEMALLOC=ON, then currently the libarrow.a cannot be used for static 
> linking without also obtaining libjemalloc_pic.a



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6528) [C++] Spurious Flight test failures (port allocation failure)

2020-03-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6528.
-
  Assignee: David Li
Resolution: Fixed

I'm closing as Resolved, the binding to port 0 changes should help -- if this 
occurs again we should reopen and then figure out where a port is failing to 
allocate

> [C++] Spurious Flight test failures (port allocation failure)
> -
>
> Key: ARROW-6528
> URL: https://issues.apache.org/jira/browse/ARROW-6528
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: David Li
>Priority: Major
> Fix For: 0.17.0
>
>
> Seems like our port allocation scheme inside unit tests is still not very 
> reliable :-/
> https://ci.ursalabs.org/#/builders/71/builds/4147/steps/8/logs/stdio
> {code}
> [--] 3 tests from TestMetadata
> [ RUN  ] TestMetadata.DoGet
> E0905 12:45:40.322644527   10203 server_chttp2.cc:40]
> {"created":"@1567687540.322612245","description":"No address added out of 
> total 1 
> resolved","file":"../src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":394,"referenced_errors":[{"created":"@1567687540.322609844","description":"Unable
>  to configure 
> socket","fd":7,"file":"../src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":217,"referenced_errors":[{"created":"@1567687540.322602634","description":"Address
>  already in 
> use","errno":98,"file":"../src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address
>  already in use","syscall":"bind"}]}]}
> ../src/arrow/flight/flight_test.cc:429: Failure
> Failed
> 'server->Init(options)' failed with Unknown error: Server did not start 
> properly
> /buildbot/AMD64_Conda_Python_3_7/cpp/build-support/run-test.sh: line 97: 
> 10203 Segmentation fault  (core dumped) $TEST_EXECUTABLE "$@" 2>&1
>  10204 Done| $ROOT/build-support/asan_symbolize.py
>  10205 Done| ${CXXFILT:-c++filt}
>  10206 Done| 
> $ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE
>  10207 Done| $pipe_cmd 2>&1
>  10208 Done| tee $LOGFILE
> /buildbot/AMD64_Conda_Python_3_7/cpp/build/src/arrow/flight
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6895) [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader repeats returned values when calling `NextBatch()`

2020-03-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6895.
-
Resolution: Fixed

Issue resolved by pull request 6460
[https://github.com/apache/arrow/pull/6460]

> [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader 
> repeats returned values when calling `NextBatch()`
> ---
>
> Key: ARROW-6895
> URL: https://issues.apache.org/jira/browse/ARROW-6895
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.15.0
> Environment: Linux 5.2.17-200.fc30.x86_64 (Docker)
>Reporter: Adam Hooper
>Assignee: Adam Hooper
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.17.0
>
> Attachments: 01-fix-arrow-6895.diff, bad.parquet, 
> reset-dictionary-on-read.diff, works.parquet
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Given most columns, I can run a loop like:
> {code:cpp}
> std::unique_ptr columnReader(/*...*/);
> while (nRowsRemaining > 0) {
> int n = std::min(100, nRowsRemaining);
> std::shared_ptr chunkedArray;
> auto status = columnReader->NextBatch(n, );
> // ... and then use `chunkedArray`
> nRowsRemaining -= n;
> }
> {code}
> (The context is: "convert Parquet to CSV/JSON, with small memory footprint." 
> Used in https://github.com/CJWorkbench/parquet-to-arrow)
> Normally, the first {{NextBatch()}} return value looks like {{val0...val99}}; 
> the second return value looks like {{val100...val199}}; and so on.
> ... but with a {{ByteArrayDictionaryRecordReader}}, that isn't the case. The 
> first {{NextBatch()}} return value looks like {{val0...val100}}; the second 
> return value looks like {{val0...val99, val100...val199}} (ChunkedArray with 
> two arrays); the third return value looks like {{val0...val99, 
> val100...val199, val200...val299}} (ChunkedArray with three arrays); and so 
> on. The returned arrays are never cleared.
> In sum: {{NextBatch()}} on a dictionary column reader returns the wrong 
> values.
> I've attached a minimal Parquet file that presents this problem with the 
> above code; and I've written a patch that fixes this one case, to illustrate 
> where things are wrong. I don't think I understand enough edge cases to 
> decree that my patch is a correct fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6895) [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader repeats returned values when calling `NextBatch()`

2020-03-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6895:
---

Assignee: Adam Hooper

> [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader 
> repeats returned values when calling `NextBatch()`
> ---
>
> Key: ARROW-6895
> URL: https://issues.apache.org/jira/browse/ARROW-6895
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.15.0
> Environment: Linux 5.2.17-200.fc30.x86_64 (Docker)
>Reporter: Adam Hooper
>Assignee: Adam Hooper
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.17.0
>
> Attachments: 01-fix-arrow-6895.diff, bad.parquet, 
> reset-dictionary-on-read.diff, works.parquet
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Given most columns, I can run a loop like:
> {code:cpp}
> std::unique_ptr columnReader(/*...*/);
> while (nRowsRemaining > 0) {
> int n = std::min(100, nRowsRemaining);
> std::shared_ptr chunkedArray;
> auto status = columnReader->NextBatch(n, );
> // ... and then use `chunkedArray`
> nRowsRemaining -= n;
> }
> {code}
> (The context is: "convert Parquet to CSV/JSON, with small memory footprint." 
> Used in https://github.com/CJWorkbench/parquet-to-arrow)
> Normally, the first {{NextBatch()}} return value looks like {{val0...val99}}; 
> the second return value looks like {{val100...val199}}; and so on.
> ... but with a {{ByteArrayDictionaryRecordReader}}, that isn't the case. The 
> first {{NextBatch()}} return value looks like {{val0...val100}}; the second 
> return value looks like {{val0...val99, val100...val199}} (ChunkedArray with 
> two arrays); the third return value looks like {{val0...val99, 
> val100...val199, val200...val299}} (ChunkedArray with three arrays); and so 
> on. The returned arrays are never cleared.
> In sum: {{NextBatch()}} on a dictionary column reader returns the wrong 
> values.
> I've attached a minimal Parquet file that presents this problem with the 
> above code; and I've written a patch that fixes this one case, to illustrate 
> where things are wrong. I don't think I understand enough edge cases to 
> decree that my patch is a correct fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6895) [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader repeats returned values when calling `NextBatch()`

2020-03-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6895:
---

Assignee: (was: Wes McKinney)

> [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader 
> repeats returned values when calling `NextBatch()`
> ---
>
> Key: ARROW-6895
> URL: https://issues.apache.org/jira/browse/ARROW-6895
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.15.0
> Environment: Linux 5.2.17-200.fc30.x86_64 (Docker)
>Reporter: Adam Hooper
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.17.0
>
> Attachments: 01-fix-arrow-6895.diff, bad.parquet, 
> reset-dictionary-on-read.diff, works.parquet
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Given most columns, I can run a loop like:
> {code:cpp}
> std::unique_ptr columnReader(/*...*/);
> while (nRowsRemaining > 0) {
> int n = std::min(100, nRowsRemaining);
> std::shared_ptr chunkedArray;
> auto status = columnReader->NextBatch(n, );
> // ... and then use `chunkedArray`
> nRowsRemaining -= n;
> }
> {code}
> (The context is: "convert Parquet to CSV/JSON, with small memory footprint." 
> Used in https://github.com/CJWorkbench/parquet-to-arrow)
> Normally, the first {{NextBatch()}} return value looks like {{val0...val99}}; 
> the second return value looks like {{val100...val199}}; and so on.
> ... but with a {{ByteArrayDictionaryRecordReader}}, that isn't the case. The 
> first {{NextBatch()}} return value looks like {{val0...val100}}; the second 
> return value looks like {{val0...val99, val100...val199}} (ChunkedArray with 
> two arrays); the third return value looks like {{val0...val99, 
> val100...val199, val200...val299}} (ChunkedArray with three arrays); and so 
> on. The returned arrays are never cleared.
> In sum: {{NextBatch()}} on a dictionary column reader returns the wrong 
> values.
> I've attached a minimal Parquet file that presents this problem with the 
> above code; and I've written a patch that fixes this one case, to illustrate 
> where things are wrong. I don't think I understand enough edge cases to 
> decree that my patch is a correct fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8231) [Rust] Parse key_value_metadata from parquet FileMetaData into arrow schema metadata

2020-03-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8231.
-
Fix Version/s: 0.17.0
   Resolution: Fixed

Issue resolved by pull request 6742
[https://github.com/apache/arrow/pull/6742]

> [Rust] Parse key_value_metadata from parquet FileMetaData into arrow schema 
> metadata
> 
>
> Key: ARROW-8231
> URL: https://issues.apache.org/jira/browse/ARROW-8231
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The parquet-format FileMetaData struct contains optional key value pairs with 
> additional metadata about the schema:
> [https://docs.rs/parquet-format/2.6.0/src/parquet_format/parquet_format.rs.html#3821]
> When the parquet file was generated using the java avro parquet writer, this 
> for example contains the original avro schema under the `parquet.avro.schema` 
> or `avro.schema` keys.
> It would be nice if this metadata was accessible through the 
> `arrow::datatypes::Schema.metadata` field.
> I'm willing to implement and create a pull request for this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8231) [Rust] Parse key_value_metadata from parquet FileMetaData into arrow schema metadata

2020-03-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8231:
---

Assignee: Jörn Horstmann

> [Rust] Parse key_value_metadata from parquet FileMetaData into arrow schema 
> metadata
> 
>
> Key: ARROW-8231
> URL: https://issues.apache.org/jira/browse/ARROW-8231
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The parquet-format FileMetaData struct contains optional key value pairs with 
> additional metadata about the schema:
> [https://docs.rs/parquet-format/2.6.0/src/parquet_format/parquet_format.rs.html#3821]
> When the parquet file was generated using the java avro parquet writer, this 
> for example contains the original avro schema under the `parquet.avro.schema` 
> or `avro.schema` keys.
> It would be nice if this metadata was accessible through the 
> `arrow::datatypes::Schema.metadata` field.
> I'm willing to implement and create a pull request for this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8243) [Rust] [DataFusion] Fix inconsistent API in LogicalPlanBuilder

2020-03-27 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-8243.
---
Resolution: Fixed

Issue resolved by pull request 6741
[https://github.com/apache/arrow/pull/6741]

> [Rust] [DataFusion] Fix inconsistent API in LogicalPlanBuilder
> --
>
> Key: ARROW-8243
> URL: https://issues.apache.org/jira/browse/ARROW-8243
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> LogicalPlanBuilder project method takes a  whereas other methods take a 
> Vec. It makes sense to take Vec and take ownership of these inputs since they 
> are being used to build the plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8245) [Python][Parquet] Skip hidden directories when reading partitioned parquet files

2020-03-27 Thread Caleb Overman (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069004#comment-17069004
 ] 

Caleb Overman commented on ARROW-8245:
--

We're currently on 0.16.0 and have a patch to ignore directories with a . 
prefix. Happy to do a PR for this - are there any other known prefixes that 
should be ignored?

> [Python][Parquet] Skip hidden directories when reading partitioned parquet 
> files
> 
>
> Key: ARROW-8245
> URL: https://issues.apache.org/jira/browse/ARROW-8245
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Caleb Overman
>Priority: Minor
>  Labels: parquet
> Fix For: 0.17.0
>
>
> When writing a partitioned parquet file Spark can create a temporary hidden 
> {{.spark-staging}} directory within the parquet file. Because it is a 
> directory and not a file, it is not skipped when trying to read the parquet 
> file. Pyarrow currently only skips directories prefixed with {{_}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8246) [C++] Add -Wa,-mbig-obj when compiling with MinGW to avoid linking errors

2020-03-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8246:
--
Labels: pull-request-available  (was: )

> [C++] Add -Wa,-mbig-obj when compiling with MinGW to avoid linking errors
> -
>
> Key: ARROW-8246
> URL: https://issues.apache.org/jira/browse/ARROW-8246
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>
> See 
> https://digitalkarabela.com/mingw-w64-how-to-fix-file-too-big-too-many-sections/
> This seems to be the MinGW equivalent of {{/bigobj}} in MSVC



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8246) [C++] Add -Wa,-mbig-obj when compiling with MinGW to avoid linking errors

2020-03-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8246:
---

Assignee: Wes McKinney

> [C++] Add -Wa,-mbig-obj when compiling with MinGW to avoid linking errors
> -
>
> Key: ARROW-8246
> URL: https://issues.apache.org/jira/browse/ARROW-8246
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.17.0
>
>
> See 
> https://digitalkarabela.com/mingw-w64-how-to-fix-file-too-big-too-many-sections/
> This seems to be the MinGW equivalent of {{/bigobj}} in MSVC



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8217) [R][C++] Fix crashing test in test-dataset.R on 32-bit Windows from ARROW-7979

2020-03-27 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068983#comment-17068983
 ] 

Wes McKinney commented on ARROW-8217:
-

It sort of suggests that that the segfault is originating in the R bindings 
instead of the C++ library, which would be weird but I suppose it's possible. 

I think the debug build error can be resolved with ARROW-8246

> [R][C++] Fix crashing test in test-dataset.R on 32-bit Windows from ARROW-7979
> --
>
> Key: ARROW-8217
> URL: https://issues.apache.org/jira/browse/ARROW-8217
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.17.0
>
>
> If we can obtain a gdb backtrace from the failed test in 
> https://github.com/apache/arrow/pull/6638 then we can sort out what's wrong. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8247) [Python] Expose Parquet writing "engine" setting in pyarrow.parquet.write_table

2020-03-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8247:

Description: This is a follow up to ARROW-7741 so we have a path to the old 
Parquet writer logic in the event that bugs are reported and we need to give 
users a workaround. Eventually this option will be removed once the prior 
writing code is removed  (was: This is a follow up to ARROW-7741 so we have a 
path to the old Parquet writer logic in the event that bugs are reported and we 
need to give users a workaround)

> [Python] Expose Parquet writing "engine" setting in 
> pyarrow.parquet.write_table
> ---
>
> Key: ARROW-8247
> URL: https://issues.apache.org/jira/browse/ARROW-8247
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.17.0
>
>
> This is a follow up to ARROW-7741 so we have a path to the old Parquet writer 
> logic in the event that bugs are reported and we need to give users a 
> workaround. Eventually this option will be removed once the prior writing 
> code is removed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8247) [Python] Expose Parquet writing "engine" setting in pyarrow.parquet.write_table

2020-03-27 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8247:
---

 Summary: [Python] Expose Parquet writing "engine" setting in 
pyarrow.parquet.write_table
 Key: ARROW-8247
 URL: https://issues.apache.org/jira/browse/ARROW-8247
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.17.0


This is a follow up to ARROW-7741 so we have a path to the old Parquet writer 
logic in the event that bugs are reported and we need to give users a workaround



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8246) [C++] Add -Wa,-mbig-obj when compiling with MinGW to avoid linking errors

2020-03-27 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8246:
---

 Summary: [C++] Add -Wa,-mbig-obj when compiling with MinGW to 
avoid linking errors
 Key: ARROW-8246
 URL: https://issues.apache.org/jira/browse/ARROW-8246
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.17.0


See 
https://digitalkarabela.com/mingw-w64-how-to-fix-file-too-big-too-many-sections/

This seems to be the MinGW equivalent of {{/bigobj}} in MSVC



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8245) [Python][Parquet] Skip hidden directories when reading partitioned parquet files

2020-03-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8245:

Fix Version/s: 0.17.0

> [Python][Parquet] Skip hidden directories when reading partitioned parquet 
> files
> 
>
> Key: ARROW-8245
> URL: https://issues.apache.org/jira/browse/ARROW-8245
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Caleb Overman
>Priority: Minor
>  Labels: parquet
> Fix For: 0.17.0
>
>
> When writing a partitioned parquet file Spark can create a temporary hidden 
> {{.spark-staging}} directory within the parquet file. Because it is a 
> directory and not a file, it is not skipped when trying to read the parquet 
> file. Pyarrow currently only skips directories prefixed with {{_}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8245) [Python][Parquet] Skip hidden directories when reading partitioned parquet files

2020-03-27 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068974#comment-17068974
 ] 

Wes McKinney commented on ARROW-8245:
-

Ah I see that the issue is that this exclusion is only applied to file paths. 
See

https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L927

This should be easy to fix. Will also need to be handled in the C++ Datasets 
API cc [~jorisvandenbossche]

> [Python][Parquet] Skip hidden directories when reading partitioned parquet 
> files
> 
>
> Key: ARROW-8245
> URL: https://issues.apache.org/jira/browse/ARROW-8245
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Caleb Overman
>Priority: Minor
>  Labels: parquet
>
> When writing a partitioned parquet file Spark can create a temporary hidden 
> {{.spark-staging}} directory within the parquet file. Because it is a 
> directory and not a file, it is not skipped when trying to read the parquet 
> file. Pyarrow currently only skips directories prefixed with {{_}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8245) [Python][Parquet] Skip hidden directories when reading partitioned parquet files

2020-03-27 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068973#comment-17068973
 ] 

Wes McKinney commented on ARROW-8245:
-

What version of the library are you using?

https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L877

> [Python][Parquet] Skip hidden directories when reading partitioned parquet 
> files
> 
>
> Key: ARROW-8245
> URL: https://issues.apache.org/jira/browse/ARROW-8245
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Caleb Overman
>Priority: Minor
>  Labels: parquet
>
> When writing a partitioned parquet file Spark can create a temporary hidden 
> {{.spark-staging}} directory within the parquet file. Because it is a 
> directory and not a file, it is not skipped when trying to read the parquet 
> file. Pyarrow currently only skips directories prefixed with {{_}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-03-27 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068969#comment-17068969
 ] 

Wes McKinney commented on ARROW-3329:
-

You need to clean temporary files out of the python/ directory with {{git clean 
-fdx python}}. This should be added to the documentation

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8070) [C++] Cast segfaults on unsupported cast from list to utf8

2020-03-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8070.
-
Resolution: Fixed

Issue resolved by pull request 6738
[https://github.com/apache/arrow/pull/6738]

> [C++] Cast segfaults on unsupported cast from list to utf8
> --
>
> Key: ARROW-8070
> URL: https://issues.apache.org/jira/browse/ARROW-8070
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Daniel Nugent
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Was messing around with some nested arrays and found a pretty easy to 
> reproduce segfault:
> {code:java}
> Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48)
> [GCC 7.3.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np, pyarrow as pa
> >>> pa.__version__
> '0.16.0'
> >>> np.__version__
> '1.18.1'
> >>> x=[np.array([b'a',b'b'])]
> >>> a = pa.array(x,pa.list_(pa.binary()))
> >>> a
> 
> [
>   [
> 61,
> 62
>   ]
> ]
> >>> a.cast(pa.string())
> Segmentation fault
> {code}
> I don't know if that cast makes sense, but I left the checks on, so I would 
> not expect a segfault from it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8242) [C++] Flight fails to compile on GCC 4.8

2020-03-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8242.
-
Resolution: Fixed

Issue resolved by pull request 6739
[https://github.com/apache/arrow/pull/6739]

> [C++] Flight fails to compile on GCC 4.8
> 
>
> Key: ARROW-8242
> URL: https://issues.apache.org/jira/browse/ARROW-8242
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> See recent build log 
> https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=8944=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=5b4cc83a-7bb0-5664-5bb1-588f7e4dc05b=2186



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8217) [R][C++] Fix crashing test in test-dataset.R on 32-bit Windows from ARROW-7979

2020-03-27 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068931#comment-17068931
 ] 

Neal Richardson commented on ARROW-8217:


I did a build with -DARROW_CXXFLAGS="-g", no difference. That was on the C++ 
build only though, any reason to think it would matter to compile the R 
bindings with that too?

Even after the hanging build was resolved on master, I still can't get a debug 
build: 
https://github.com/ursa-labs/arrow-r-nightly/runs/538449507?check_suite_focus=true#step:7:1126

{code}
[ 16%] Building CXX object 
src/arrow/CMakeFiles/arrow_static.dir/record_batch.cc.obj
C:/Rtools/mingw_64/bin/../lib/gcc/x86_64-w64-mingw32/4.9.3/../../../../x86_64-w64-mingw32/bin/as.exe:
 CMakeFiles/arrow_static.dir/array/diff.cc.obj: too many sections (43914)
D:\a\_temp\msys\msys64\tmp\ccqm015L.s: Assembler messages:
D:\a\_temp\msys\msys64\tmp\ccqm015L.s: Fatal error: can't write 
CMakeFiles/arrow_static.dir/array/diff.cc.obj: File too big
C:/Rtools/mingw_64/bin/../lib/gcc/x86_64-w64-mingw32/4.9.3/../../../../x86_64-w64-mingw32/bin/as.exe:
 CMakeFiles/arrow_static.dir/array/diff.cc.obj: too many sections (43914)
D:\a\_temp\msys\msys64\tmp\ccqm015L.s: Fatal error: can't close 
CMakeFiles/arrow_static.dir/array/diff.cc.obj: File too big
{code}

> [R][C++] Fix crashing test in test-dataset.R on 32-bit Windows from ARROW-7979
> --
>
> Key: ARROW-8217
> URL: https://issues.apache.org/jira/browse/ARROW-8217
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.17.0
>
>
> If we can obtain a gdb backtrace from the failed test in 
> https://github.com/apache/arrow/pull/6638 then we can sort out what's wrong. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-8238) [C++][Compute] Failed to build compute tests on windows with msvc2015

2020-03-27 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068929#comment-17068929
 ] 

Wes McKinney edited comment on ARROW-8238 at 3/27/20, 5:36 PM:
---

Definitely weird since we build these tests in CI. If you are having a hard 
time figuring it out I can try to reproduce locally on my Windows 10 machine


was (Author: wesmckinn):
Definitely weird since we build these functions in CI. If you are having a hard 
time figuring it out I can try to reproduce locally on my Windows 10 machine

> [C++][Compute] Failed to build compute tests on windows with msvc2015
> -
>
> Key: ARROW-8238
> URL: https://issues.apache.org/jira/browse/ARROW-8238
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Compute
>Reporter: Yibo Cai
>Priority: Minor
>
> Build Arrow compute tests on Windows10 with MSVC2015:
> {code:bash}
> cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DARROW_COMPUTE=ON 
> -DARROW_BUILD_TESTS=ON ..
> ninja -j3
> {code}
> Build failed with below message:
> {code:bash}
> [311/405] Linking CXX executable release\arrow-misc-test.exe
> FAILED: release/arrow-misc-test.exe
> cmd.exe /C "cd . && 
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\bin\cmake.exe -E 
> vs_link_exe --intdir=src\arrow\CMakeFiles\arrow-misc-test.dir 
> --rc=C:\PROGRA~2\WI3CF2~1\8.1\bin\x64\rc.exe 
> --mt=C:\PROGRA~2\WI3CF2~1\8.1\bin\x64\mt.exe --manifests  -- 
> C:\PROGRA~2\MICROS~1.0\VC\bin\amd64\link.exe /nologo 
> src\arrow\CMakeFiles\arrow-misc-test.dir\memory_pool_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\result_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\pretty_print_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\status_test.cc.obj  
> /out:release\arrow-misc-test.exe /implib:release\arrow-misc-test.lib 
> /pdb:release\arrow-misc-test.pdb /version:0.0  /machine:x64  
> /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console  
> release\arrow_testing.lib  release\arrow.lib  
> googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib  
> googletest_ep-prefix\src\googletest_ep\lib\gtest.lib  
> googletest_ep-prefix\src\googletest_ep\lib\gmock.lib  
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_filesystem.lib  
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_system.lib  
> Ws2_32.lib  kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib 
> ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib && cd ."
> LINK: command "C:\PROGRA~2\MICROS~1.0\VC\bin\amd64\link.exe /nologo 
> src\arrow\CMakeFiles\arrow-misc-test.dir\memory_pool_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\result_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\pretty_print_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\status_test.cc.obj 
> /out:release\arrow-misc-test.exe /implib:release\arrow-misc-test.lib 
> /pdb:release\arrow-misc-test.pdb /version:0.0 /machine:x64 
> /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console 
> release\arrow_testing.lib release\arrow.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gtest.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gmock.lib 
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_filesystem.lib 
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_system.lib 
> Ws2_32.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib 
> ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST 
> /MANIFESTFILE:release\arrow-misc-test.exe.manifest" failed (exit code 1169) 
> with the following output:
> arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: __cdecl 
> std::vector >::vector std::allocator >(class std::initializer_list,class 
> std::allocator const &)" 
> (??0?$vector@HV?$allocator@H@std@@@std@@QEAA@V?$initializer_list@H@1@AEBV?$allocator@H@1@@Z)
>  already defined in result_test.cc.obj
> arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: __cdecl 
> std::vector >::~vector std::allocator >(void)" (??1?$vector@HV?$allocator@H@std@@@std@@QEAA@XZ) 
> already defined in result_test.cc.obj
> arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: unsigned 
> __int64 __cdecl std::vector >::size(void)const 
> " (?size@?$vector@HV?$allocator@H@std@@@std@@QEBA_KXZ) already defined in 
> result_test.cc.obj
> release\arrow-misc-test.exe : fatal error LNK1169: one or more multiply 
> defined symbols found
> [313/405] Building CXX object 
> src\arrow\CMakeFiles\arrow-table-test.dir\table_builder_test.cc.obj
> ninja: build stopped: subcommand failed.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8238) [C++][Compute] Failed to build compute tests on windows with msvc2015

2020-03-27 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068929#comment-17068929
 ] 

Wes McKinney commented on ARROW-8238:
-

Definitely weird since we build these functions in CI. If you are having a hard 
time figuring it out I can try to reproduce locally on my Windows 10 machine

> [C++][Compute] Failed to build compute tests on windows with msvc2015
> -
>
> Key: ARROW-8238
> URL: https://issues.apache.org/jira/browse/ARROW-8238
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Compute
>Reporter: Yibo Cai
>Priority: Minor
>
> Build Arrow compute tests on Windows10 with MSVC2015:
> {code:bash}
> cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DARROW_COMPUTE=ON 
> -DARROW_BUILD_TESTS=ON ..
> ninja -j3
> {code}
> Build failed with below message:
> {code:bash}
> [311/405] Linking CXX executable release\arrow-misc-test.exe
> FAILED: release/arrow-misc-test.exe
> cmd.exe /C "cd . && 
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\bin\cmake.exe -E 
> vs_link_exe --intdir=src\arrow\CMakeFiles\arrow-misc-test.dir 
> --rc=C:\PROGRA~2\WI3CF2~1\8.1\bin\x64\rc.exe 
> --mt=C:\PROGRA~2\WI3CF2~1\8.1\bin\x64\mt.exe --manifests  -- 
> C:\PROGRA~2\MICROS~1.0\VC\bin\amd64\link.exe /nologo 
> src\arrow\CMakeFiles\arrow-misc-test.dir\memory_pool_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\result_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\pretty_print_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\status_test.cc.obj  
> /out:release\arrow-misc-test.exe /implib:release\arrow-misc-test.lib 
> /pdb:release\arrow-misc-test.pdb /version:0.0  /machine:x64  
> /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console  
> release\arrow_testing.lib  release\arrow.lib  
> googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib  
> googletest_ep-prefix\src\googletest_ep\lib\gtest.lib  
> googletest_ep-prefix\src\googletest_ep\lib\gmock.lib  
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_filesystem.lib  
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_system.lib  
> Ws2_32.lib  kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib 
> ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib && cd ."
> LINK: command "C:\PROGRA~2\MICROS~1.0\VC\bin\amd64\link.exe /nologo 
> src\arrow\CMakeFiles\arrow-misc-test.dir\memory_pool_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\result_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\pretty_print_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\status_test.cc.obj 
> /out:release\arrow-misc-test.exe /implib:release\arrow-misc-test.lib 
> /pdb:release\arrow-misc-test.pdb /version:0.0 /machine:x64 
> /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console 
> release\arrow_testing.lib release\arrow.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gtest.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gmock.lib 
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_filesystem.lib 
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_system.lib 
> Ws2_32.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib 
> ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST 
> /MANIFESTFILE:release\arrow-misc-test.exe.manifest" failed (exit code 1169) 
> with the following output:
> arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: __cdecl 
> std::vector >::vector std::allocator >(class std::initializer_list,class 
> std::allocator const &)" 
> (??0?$vector@HV?$allocator@H@std@@@std@@QEAA@V?$initializer_list@H@1@AEBV?$allocator@H@1@@Z)
>  already defined in result_test.cc.obj
> arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: __cdecl 
> std::vector >::~vector std::allocator >(void)" (??1?$vector@HV?$allocator@H@std@@@std@@QEAA@XZ) 
> already defined in result_test.cc.obj
> arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: unsigned 
> __int64 __cdecl std::vector >::size(void)const 
> " (?size@?$vector@HV?$allocator@H@std@@@std@@QEBA_KXZ) already defined in 
> result_test.cc.obj
> release\arrow-misc-test.exe : fatal error LNK1169: one or more multiply 
> defined symbols found
> [313/405] Building CXX object 
> src\arrow\CMakeFiles\arrow-table-test.dir\table_builder_test.cc.obj
> ninja: build stopped: subcommand failed.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7741) [C++][Parquet] Incorporate new level generation logic in parquet write path with a flag to revert back to old logic

2020-03-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-7741.
-
Fix Version/s: 0.17.0
   Resolution: Fixed

Issue resolved by pull request 6586
[https://github.com/apache/arrow/pull/6586]

> [C++][Parquet] Incorporate new level generation logic in parquet write path 
> with a flag to revert back to old logic
> ---
>
> Key: ARROW-7741
> URL: https://issues.apache.org/jira/browse/ARROW-7741
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> This is likely going to be a decent amount of changes we should isolate them 
> behind a feature flag.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8061) [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support row groups)

2020-03-27 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8061.
---
Fix Version/s: 0.17.0
   Resolution: Fixed

Issue resolved by pull request 6670
[https://github.com/apache/arrow/pull/6670]

> [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support 
> row groups)
> -
>
> Key: ARROW-8061
> URL: https://issues.apache.org/jira/browse/ARROW-8061
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Specifically for parquet (not sure if it will be relevant for other file 
> formats as well, for IPC/feather potentially ther record batch), it would be 
> useful to target row groups instead of files as fragments.
> Quoting the original design documents: _"In datasets consisting of many 
> fragments, the dataset API must expose the granularity of fragments in a 
> public way to enable parallel processing, if desired. "._   
> And a comment from Wes on that: _"a single Parquet file can "export" one or 
> more fragments based on settings. The default might be to split fragments 
> based on row group"_
> Currently, the level on which fragments are defined (at least in the typical 
> partitioned parquet dataset) is "1 file == 1 fragment".
> Would it be possible or desirable to make this more fine grained, where you 
> could also opt to have a fragment per row group?   
> We could have a ParquetFragment that has this option, and a ParquetFileFormat 
> specific option to say what the granularity of a fragment is (file vs row 
> group)?
> cc [~fsaintjacques] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7908) [R] Can't install package without setting LIBARROW_DOWNLOAD=true

2020-03-27 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-7908.

Fix Version/s: 0.17.0
 Assignee: Neal Richardson
   Resolution: Fixed

> [R] Can't install package without setting LIBARROW_DOWNLOAD=true
> 
>
> Key: ARROW-7908
> URL: https://issues.apache.org/jira/browse/ARROW-7908
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.16.0
> Environment: Operating System: Red Hat Enterprise Linux Server 7.6 
> (Maipo) 
> CPE OS Name: cpe:/o:redhat:enterprise_linux:7.6:GA:server
> Kernel: Linux 3.10.0-957.35.2.el7.x86_64
> Architecture: x86-64  
>Reporter: Taeke
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 0.17.0
>
>
> Hi,
> Installing arrow in R does not work intuitively on our server.
> {code:r}
> install.packages("arrow")`
> {code}
> results in an error:
> {code:sh}
> Installing package into '/home//R/x86_64-redhat-linux-gnu-library/3.6'
> (as 'lib' is unspecified)
> trying URL 'https://cloud.r-project.org/src/contrib/arrow_0.16.0.2.tar.gz'
> Content type 'application/x-gzip' length 216119 bytes (211 KB)
> ==
> downloaded 211 KB
> * installing *source* package 'arrow' ...
> ** package 'arrow' successfully unpacked and MD5 sums checked
> ** using staged installation
> PKG_CFLAGS=-I/tmp/Rtmp3v1BDf/R.INSTALL4a5d5d9f8bc8/arrow/libarrow/arrow-0.16.0.2/include
>   -DARROW_R_WITH_ARROW
> PKG_LIBS=-L/tmp/Rtmp3v1BDf/R.INSTALL4a5d5d9f8bc8/arrow/libarrow/arrow-0.16.0.2/lib
>  -larrow_dataset -lparquet -larrow -lthrift -lsnappy -lz -lzstd -llz4 
> -lbrotlidec-static -lbrotlienc-static -lbrotlicommon-static 
> -lboost_filesystem -lboost_regex -lboost_system -ljemalloc_pic
> ** libs
> g++ -m64 -std=gnu++11 -I"/usr/include/R" -DNDEBUG 
> -I/tmp/Rtmp3v1BDf/R.INSTALL4a5d5d9f8bc8/arrow/libarrow/arrow-0.16.0.2/include 
>  -DARROW_R_WITH_ARROW -I"/usr/lib64/R/library/Rcpp/include" 
> -I/usr/local/include  -fpic  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 
> -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 
> -grecord-gcc-switches   -m64 -mtune=generic  -c array.cpp -o array.o
> In file included from array.cpp:18:0:
> ./arrow_types.h:201:31: fatal error: arrow/dataset/api.h: No such file or 
> directory
> {code}
> It appears that the C++ code is not built. With arrow 0.16.0.1 things do work 
> out, because it tries to build the C++ code from source. With arrow 0.16.0.2 
> such is no longer the case. I could finish the installation by setting the 
> environment variable LIBARROW_DOWNLOAD to 'true':
> {code:java}
> export LIBARROW_DOWNLOAD=true
> {code}
> That, apparently, triggers the build from source. I would have expected that 
> I would not need to set this variable explicitly.
> I found that [between 
> versions|https://github.com/apache/arrow/commit/660d0e7cbaa1cfb51498299d445636fdd6a58420],
>  the default value of LIBARROW_DOWNLOAD has changed:
> {code:sh}
> - download_ok <- locally_installing && !env_is("LIBARROW_DOWNLOAD", "false")
> + download_ok <- env_is("LIBARROW_DOWNLOAD", "true")
> {code}
> In our environment, that variable was _not_ set, resulting (accidentally?) in 
> download_ok being false and therefore the libraries not being installed and 
> finally the resulting error above.
>  
> I can't quite figure out the logic behind all this, but it would be nice if 
> we'd be able to install the package without first having to set 
> LIBARROW_DOWNLOAD.
>  
> Thank you for looking into this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7908) [R] Can't install package without setting LIBARROW_DOWNLOAD=true

2020-03-27 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068828#comment-17068828
 ] 

Neal Richardson commented on ARROW-7908:


Glad to hear!

> [R] Can't install package without setting LIBARROW_DOWNLOAD=true
> 
>
> Key: ARROW-7908
> URL: https://issues.apache.org/jira/browse/ARROW-7908
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.16.0
> Environment: Operating System: Red Hat Enterprise Linux Server 7.6 
> (Maipo) 
> CPE OS Name: cpe:/o:redhat:enterprise_linux:7.6:GA:server
> Kernel: Linux 3.10.0-957.35.2.el7.x86_64
> Architecture: x86-64  
>Reporter: Taeke
>Priority: Major
>
> Hi,
> Installing arrow in R does not work intuitively on our server.
> {code:r}
> install.packages("arrow")`
> {code}
> results in an error:
> {code:sh}
> Installing package into '/home//R/x86_64-redhat-linux-gnu-library/3.6'
> (as 'lib' is unspecified)
> trying URL 'https://cloud.r-project.org/src/contrib/arrow_0.16.0.2.tar.gz'
> Content type 'application/x-gzip' length 216119 bytes (211 KB)
> ==
> downloaded 211 KB
> * installing *source* package 'arrow' ...
> ** package 'arrow' successfully unpacked and MD5 sums checked
> ** using staged installation
> PKG_CFLAGS=-I/tmp/Rtmp3v1BDf/R.INSTALL4a5d5d9f8bc8/arrow/libarrow/arrow-0.16.0.2/include
>   -DARROW_R_WITH_ARROW
> PKG_LIBS=-L/tmp/Rtmp3v1BDf/R.INSTALL4a5d5d9f8bc8/arrow/libarrow/arrow-0.16.0.2/lib
>  -larrow_dataset -lparquet -larrow -lthrift -lsnappy -lz -lzstd -llz4 
> -lbrotlidec-static -lbrotlienc-static -lbrotlicommon-static 
> -lboost_filesystem -lboost_regex -lboost_system -ljemalloc_pic
> ** libs
> g++ -m64 -std=gnu++11 -I"/usr/include/R" -DNDEBUG 
> -I/tmp/Rtmp3v1BDf/R.INSTALL4a5d5d9f8bc8/arrow/libarrow/arrow-0.16.0.2/include 
>  -DARROW_R_WITH_ARROW -I"/usr/lib64/R/library/Rcpp/include" 
> -I/usr/local/include  -fpic  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 
> -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 
> -grecord-gcc-switches   -m64 -mtune=generic  -c array.cpp -o array.o
> In file included from array.cpp:18:0:
> ./arrow_types.h:201:31: fatal error: arrow/dataset/api.h: No such file or 
> directory
> {code}
> It appears that the C++ code is not built. With arrow 0.16.0.1 things do work 
> out, because it tries to build the C++ code from source. With arrow 0.16.0.2 
> such is no longer the case. I could finish the installation by setting the 
> environment variable LIBARROW_DOWNLOAD to 'true':
> {code:java}
> export LIBARROW_DOWNLOAD=true
> {code}
> That, apparently, triggers the build from source. I would have expected that 
> I would not need to set this variable explicitly.
> I found that [between 
> versions|https://github.com/apache/arrow/commit/660d0e7cbaa1cfb51498299d445636fdd6a58420],
>  the default value of LIBARROW_DOWNLOAD has changed:
> {code:sh}
> - download_ok <- locally_installing && !env_is("LIBARROW_DOWNLOAD", "false")
> + download_ok <- env_is("LIBARROW_DOWNLOAD", "true")
> {code}
> In our environment, that variable was _not_ set, resulting (accidentally?) in 
> download_ok being false and therefore the libraries not being installed and 
> finally the resulting error above.
>  
> I can't quite figure out the logic behind all this, but it would be nice if 
> we'd be able to install the package without first having to set 
> LIBARROW_DOWNLOAD.
>  
> Thank you for looking into this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7908) [R] Can't install package without setting LIBARROW_DOWNLOAD=true

2020-03-27 Thread Taeke (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068827#comment-17068827
 ] 

Taeke commented on ARROW-7908:
--

Ah, I see. I did install from CRAN, but in order to work around the error of 
the missing codegen.R, i switched to github, which lead to the mismatch in 
version numbers. I could not find another way to get the installation starting 
(other than setting LIBARROW_DOWNLOAD, which I tried to avoid).

Indeed, with the nightly build it does install properly. Thanks a lot!

> [R] Can't install package without setting LIBARROW_DOWNLOAD=true
> 
>
> Key: ARROW-7908
> URL: https://issues.apache.org/jira/browse/ARROW-7908
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.16.0
> Environment: Operating System: Red Hat Enterprise Linux Server 7.6 
> (Maipo) 
> CPE OS Name: cpe:/o:redhat:enterprise_linux:7.6:GA:server
> Kernel: Linux 3.10.0-957.35.2.el7.x86_64
> Architecture: x86-64  
>Reporter: Taeke
>Priority: Major
>
> Hi,
> Installing arrow in R does not work intuitively on our server.
> {code:r}
> install.packages("arrow")`
> {code}
> results in an error:
> {code:sh}
> Installing package into '/home//R/x86_64-redhat-linux-gnu-library/3.6'
> (as 'lib' is unspecified)
> trying URL 'https://cloud.r-project.org/src/contrib/arrow_0.16.0.2.tar.gz'
> Content type 'application/x-gzip' length 216119 bytes (211 KB)
> ==
> downloaded 211 KB
> * installing *source* package 'arrow' ...
> ** package 'arrow' successfully unpacked and MD5 sums checked
> ** using staged installation
> PKG_CFLAGS=-I/tmp/Rtmp3v1BDf/R.INSTALL4a5d5d9f8bc8/arrow/libarrow/arrow-0.16.0.2/include
>   -DARROW_R_WITH_ARROW
> PKG_LIBS=-L/tmp/Rtmp3v1BDf/R.INSTALL4a5d5d9f8bc8/arrow/libarrow/arrow-0.16.0.2/lib
>  -larrow_dataset -lparquet -larrow -lthrift -lsnappy -lz -lzstd -llz4 
> -lbrotlidec-static -lbrotlienc-static -lbrotlicommon-static 
> -lboost_filesystem -lboost_regex -lboost_system -ljemalloc_pic
> ** libs
> g++ -m64 -std=gnu++11 -I"/usr/include/R" -DNDEBUG 
> -I/tmp/Rtmp3v1BDf/R.INSTALL4a5d5d9f8bc8/arrow/libarrow/arrow-0.16.0.2/include 
>  -DARROW_R_WITH_ARROW -I"/usr/lib64/R/library/Rcpp/include" 
> -I/usr/local/include  -fpic  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 
> -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 
> -grecord-gcc-switches   -m64 -mtune=generic  -c array.cpp -o array.o
> In file included from array.cpp:18:0:
> ./arrow_types.h:201:31: fatal error: arrow/dataset/api.h: No such file or 
> directory
> {code}
> It appears that the C++ code is not built. With arrow 0.16.0.1 things do work 
> out, because it tries to build the C++ code from source. With arrow 0.16.0.2 
> such is no longer the case. I could finish the installation by setting the 
> environment variable LIBARROW_DOWNLOAD to 'true':
> {code:java}
> export LIBARROW_DOWNLOAD=true
> {code}
> That, apparently, triggers the build from source. I would have expected that 
> I would not need to set this variable explicitly.
> I found that [between 
> versions|https://github.com/apache/arrow/commit/660d0e7cbaa1cfb51498299d445636fdd6a58420],
>  the default value of LIBARROW_DOWNLOAD has changed:
> {code:sh}
> - download_ok <- locally_installing && !env_is("LIBARROW_DOWNLOAD", "false")
> + download_ok <- env_is("LIBARROW_DOWNLOAD", "true")
> {code}
> In our environment, that variable was _not_ set, resulting (accidentally?) in 
> download_ok being false and therefore the libraries not being installed and 
> finally the resulting error above.
>  
> I can't quite figure out the logic behind all this, but it would be nice if 
> we'd be able to install the package without first having to set 
> LIBARROW_DOWNLOAD.
>  
> Thank you for looking into this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8245) [Python][Parquet] Skip hidden directories when reading partitioned parquet files

2020-03-27 Thread Caleb Overman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Overman updated ARROW-8245:
-
Description: When writing a partitioned parquet file Spark can create a 
temporary hidden {{.spark-staging}} directory within the parquet file. Because 
it is a directory and not a file, it is not skipped when trying to read the 
parquet file. Pyarrow currently only skips directories prefixed with {{_}}.  
(was: When writing a partitioned parquet file Spark can create a temporary 
hidden `.spark-staging` directory within the parquet file. Because it is a 
directory and not a file, it is not skipped when trying to read the parquet 
file. Pyarrow currently only skips directories prefixed with `_`.)

> [Python][Parquet] Skip hidden directories when reading partitioned parquet 
> files
> 
>
> Key: ARROW-8245
> URL: https://issues.apache.org/jira/browse/ARROW-8245
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Caleb Overman
>Priority: Minor
>  Labels: parquet
>
> When writing a partitioned parquet file Spark can create a temporary hidden 
> {{.spark-staging}} directory within the parquet file. Because it is a 
> directory and not a file, it is not skipped when trying to read the parquet 
> file. Pyarrow currently only skips directories prefixed with {{_}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7908) [R] Can't install package without setting LIBARROW_DOWNLOAD=true

2020-03-27 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068806#comment-17068806
 ] 

Neal Richardson commented on ARROW-7908:


codegen.R and decor are not required, despite the error messages they throw: 
they're allowed to fail.

I'm not sure why you're seeing a version number of .9000 unless you're 
installing from git/github, not CRAN.

In any case, I believe installation should work now on the latest dev version. 
Could you try installing (no env vars required) from our nightly repository, 
{{install.packages("arrow", repos="https://dl.bintray.com/ursalabs/arrow-r;)}}?

> [R] Can't install package without setting LIBARROW_DOWNLOAD=true
> 
>
> Key: ARROW-7908
> URL: https://issues.apache.org/jira/browse/ARROW-7908
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.16.0
> Environment: Operating System: Red Hat Enterprise Linux Server 7.6 
> (Maipo) 
> CPE OS Name: cpe:/o:redhat:enterprise_linux:7.6:GA:server
> Kernel: Linux 3.10.0-957.35.2.el7.x86_64
> Architecture: x86-64  
>Reporter: Taeke
>Priority: Major
>
> Hi,
> Installing arrow in R does not work intuitively on our server.
> {code:r}
> install.packages("arrow")`
> {code}
> results in an error:
> {code:sh}
> Installing package into '/home//R/x86_64-redhat-linux-gnu-library/3.6'
> (as 'lib' is unspecified)
> trying URL 'https://cloud.r-project.org/src/contrib/arrow_0.16.0.2.tar.gz'
> Content type 'application/x-gzip' length 216119 bytes (211 KB)
> ==
> downloaded 211 KB
> * installing *source* package 'arrow' ...
> ** package 'arrow' successfully unpacked and MD5 sums checked
> ** using staged installation
> PKG_CFLAGS=-I/tmp/Rtmp3v1BDf/R.INSTALL4a5d5d9f8bc8/arrow/libarrow/arrow-0.16.0.2/include
>   -DARROW_R_WITH_ARROW
> PKG_LIBS=-L/tmp/Rtmp3v1BDf/R.INSTALL4a5d5d9f8bc8/arrow/libarrow/arrow-0.16.0.2/lib
>  -larrow_dataset -lparquet -larrow -lthrift -lsnappy -lz -lzstd -llz4 
> -lbrotlidec-static -lbrotlienc-static -lbrotlicommon-static 
> -lboost_filesystem -lboost_regex -lboost_system -ljemalloc_pic
> ** libs
> g++ -m64 -std=gnu++11 -I"/usr/include/R" -DNDEBUG 
> -I/tmp/Rtmp3v1BDf/R.INSTALL4a5d5d9f8bc8/arrow/libarrow/arrow-0.16.0.2/include 
>  -DARROW_R_WITH_ARROW -I"/usr/lib64/R/library/Rcpp/include" 
> -I/usr/local/include  -fpic  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 
> -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 
> -grecord-gcc-switches   -m64 -mtune=generic  -c array.cpp -o array.o
> In file included from array.cpp:18:0:
> ./arrow_types.h:201:31: fatal error: arrow/dataset/api.h: No such file or 
> directory
> {code}
> It appears that the C++ code is not built. With arrow 0.16.0.1 things do work 
> out, because it tries to build the C++ code from source. With arrow 0.16.0.2 
> such is no longer the case. I could finish the installation by setting the 
> environment variable LIBARROW_DOWNLOAD to 'true':
> {code:java}
> export LIBARROW_DOWNLOAD=true
> {code}
> That, apparently, triggers the build from source. I would have expected that 
> I would not need to set this variable explicitly.
> I found that [between 
> versions|https://github.com/apache/arrow/commit/660d0e7cbaa1cfb51498299d445636fdd6a58420],
>  the default value of LIBARROW_DOWNLOAD has changed:
> {code:sh}
> - download_ok <- locally_installing && !env_is("LIBARROW_DOWNLOAD", "false")
> + download_ok <- env_is("LIBARROW_DOWNLOAD", "true")
> {code}
> In our environment, that variable was _not_ set, resulting (accidentally?) in 
> download_ok being false and therefore the libraries not being installed and 
> finally the resulting error above.
>  
> I can't quite figure out the logic behind all this, but it would be nice if 
> we'd be able to install the package without first having to set 
> LIBARROW_DOWNLOAD.
>  
> Thank you for looking into this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8245) [Python] Skip hidden directories when reading partitioned parquet files

2020-03-27 Thread Caleb Overman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Overman updated ARROW-8245:
-
Labels: parquet  (was: )

> [Python] Skip hidden directories when reading partitioned parquet files
> ---
>
> Key: ARROW-8245
> URL: https://issues.apache.org/jira/browse/ARROW-8245
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Caleb Overman
>Priority: Minor
>  Labels: parquet
>
> When writing a partitioned parquet file Spark can create a temporary hidden 
> `.spark-staging` directory within the parquet file. Because it is a directory 
> and not a file, it is not skipped when trying to read the parquet file. 
> Pyarrow currently only skips directories prefixed with `_`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8245) [Python][Parquet] Skip hidden directories when reading partitioned parquet files

2020-03-27 Thread Caleb Overman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Overman updated ARROW-8245:
-
Summary: [Python][Parquet] Skip hidden directories when reading partitioned 
parquet files  (was: [Python] Skip hidden directories when reading partitioned 
parquet files)

> [Python][Parquet] Skip hidden directories when reading partitioned parquet 
> files
> 
>
> Key: ARROW-8245
> URL: https://issues.apache.org/jira/browse/ARROW-8245
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Caleb Overman
>Priority: Minor
>  Labels: parquet
>
> When writing a partitioned parquet file Spark can create a temporary hidden 
> `.spark-staging` directory within the parquet file. Because it is a directory 
> and not a file, it is not skipped when trying to read the parquet file. 
> Pyarrow currently only skips directories prefixed with `_`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8245) [Python] Skip hidden directories when reading partitioned parquet files

2020-03-27 Thread Caleb Overman (Jira)
Caleb Overman created ARROW-8245:


 Summary: [Python] Skip hidden directories when reading partitioned 
parquet files
 Key: ARROW-8245
 URL: https://issues.apache.org/jira/browse/ARROW-8245
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Caleb Overman


When writing a partitioned parquet file Spark can create a temporary hidden 
`.spark-staging` directory within the parquet file. Because it is a directory 
and not a file, it is not skipped when trying to read the parquet file. Pyarrow 
currently only skips directories prefixed with `_`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8244) [Python][Parquet] Add `write_to_dataset` option to populate the "file_path" metadata fields

2020-03-27 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068794#comment-17068794
 ] 

Joris Van den Bossche commented on ARROW-8244:
--

Thanks for opening the issue [~rjzamora]

Agreed this is a problem, and I think we should at least also return the path 
(so it can be fixed afterwards), or otherwise set it ourselves (optionally).

Regarding those different options: starting to also return the path together 
with the metadata is not really backwards compatible, so we would need to add 
additional keyword like `path_collector` in addition to `metadata_collector`. 

For simply always populating the file path, that might depend on whether there 
are other use cases for collecting this metadata (although I assume dask is the 
main user of this keyword).   
A github search turned up dask, cudf and spatialpandas as users of the 
`metadata_collector` keyword. I assume `cudf` needs the same fix as dask. I 
didn't check yet how it's used in spatialpandas.

I suppose optionally populating it is the safest, I am only doubtful that 
having it optional behind a new keyword is actually useful (whether there are 
use cases for not wanting to populate it).

> [Python][Parquet] Add `write_to_dataset` option to populate the "file_path" 
> metadata fields
> ---
>
> Key: ARROW-8244
> URL: https://issues.apache.org/jira/browse/ARROW-8244
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Reporter: Rick Zamora
>Priority: Minor
>  Labels: parquet
> Fix For: 0.17.0
>
>
> Prior to [dask#6023|[https://github.com/dask/dask/pull/6023]], Dask has been 
> using the `write_to_dataset` API to write partitioned parquet datasets.  This 
> PR is switching to a (hopefully temporary) custom solution, because that API 
> makes it difficult to populate the the "file_path"  column-chunk metadata 
> fields that are returned within the optional `metadata_collector` kwarg.  
> Dask needs to set these fields correctly in order to generate a proper global 
> `"_metadata"` file.
> Possible solutions to this problem:
>  # Optionally populate the file-path fields within `write_to_dataset`
>  # Always populate the file-path fields within `write_to_dataset`
>  # Return the file paths for the data written within `write_to_dataset` (up 
> to the user to manually populate the file-path fields)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8244) [Python][Parquet] Add `write_to_dataset` option to populate the "file_path" metadata fields

2020-03-27 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8244:
-
Fix Version/s: 0.17.0

> [Python][Parquet] Add `write_to_dataset` option to populate the "file_path" 
> metadata fields
> ---
>
> Key: ARROW-8244
> URL: https://issues.apache.org/jira/browse/ARROW-8244
> Project: Apache Arrow
>  Issue Type: Wish
>Reporter: Rick Zamora
>Priority: Minor
> Fix For: 0.17.0
>
>
> Prior to [dask#6023|[https://github.com/dask/dask/pull/6023]], Dask has been 
> using the `write_to_dataset` API to write partitioned parquet datasets.  This 
> PR is switching to a (hopefully temporary) custom solution, because that API 
> makes it difficult to populate the the "file_path"  column-chunk metadata 
> fields that are returned within the optional `metadata_collector` kwarg.  
> Dask needs to set these fields correctly in order to generate a proper global 
> `"_metadata"` file.
> Possible solutions to this problem:
>  # Optionally populate the file-path fields within `write_to_dataset`
>  # Always populate the file-path fields within `write_to_dataset`
>  # Return the file paths for the data written within `write_to_dataset` (up 
> to the user to manually populate the file-path fields)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8244) [Python][Parquet] Add `write_to_dataset` option to populate the "file_path" metadata fields

2020-03-27 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8244:
-
Labels: parquet  (was: )

> [Python][Parquet] Add `write_to_dataset` option to populate the "file_path" 
> metadata fields
> ---
>
> Key: ARROW-8244
> URL: https://issues.apache.org/jira/browse/ARROW-8244
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Reporter: Rick Zamora
>Priority: Minor
>  Labels: parquet
> Fix For: 0.17.0
>
>
> Prior to [dask#6023|[https://github.com/dask/dask/pull/6023]], Dask has been 
> using the `write_to_dataset` API to write partitioned parquet datasets.  This 
> PR is switching to a (hopefully temporary) custom solution, because that API 
> makes it difficult to populate the the "file_path"  column-chunk metadata 
> fields that are returned within the optional `metadata_collector` kwarg.  
> Dask needs to set these fields correctly in order to generate a proper global 
> `"_metadata"` file.
> Possible solutions to this problem:
>  # Optionally populate the file-path fields within `write_to_dataset`
>  # Always populate the file-path fields within `write_to_dataset`
>  # Return the file paths for the data written within `write_to_dataset` (up 
> to the user to manually populate the file-path fields)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8244) [Python][Parquet] Add `write_to_dataset` option to populate the "file_path" metadata fields

2020-03-27 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8244:
-
Component/s: Python

> [Python][Parquet] Add `write_to_dataset` option to populate the "file_path" 
> metadata fields
> ---
>
> Key: ARROW-8244
> URL: https://issues.apache.org/jira/browse/ARROW-8244
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Reporter: Rick Zamora
>Priority: Minor
> Fix For: 0.17.0
>
>
> Prior to [dask#6023|[https://github.com/dask/dask/pull/6023]], Dask has been 
> using the `write_to_dataset` API to write partitioned parquet datasets.  This 
> PR is switching to a (hopefully temporary) custom solution, because that API 
> makes it difficult to populate the the "file_path"  column-chunk metadata 
> fields that are returned within the optional `metadata_collector` kwarg.  
> Dask needs to set these fields correctly in order to generate a proper global 
> `"_metadata"` file.
> Possible solutions to this problem:
>  # Optionally populate the file-path fields within `write_to_dataset`
>  # Always populate the file-path fields within `write_to_dataset`
>  # Return the file paths for the data written within `write_to_dataset` (up 
> to the user to manually populate the file-path fields)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8244) [Python][Parquet] Add `write_to_dataset` option to populate the "file_path" metadata fields

2020-03-27 Thread Rick Zamora (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rick Zamora updated ARROW-8244:
---
Summary: [Python][Parquet] Add `write_to_dataset` option to populate the 
"file_path" metadata fields  (was: [Python] Add `write_to_dataset` option to 
populate the "file_path" metadata fields)

> [Python][Parquet] Add `write_to_dataset` option to populate the "file_path" 
> metadata fields
> ---
>
> Key: ARROW-8244
> URL: https://issues.apache.org/jira/browse/ARROW-8244
> Project: Apache Arrow
>  Issue Type: Wish
>Reporter: Rick Zamora
>Priority: Minor
>
> Prior to [dask#6023|[https://github.com/dask/dask/pull/6023]], Dask has been 
> using the `write_to_dataset` API to write partitioned parquet datasets.  This 
> PR is switching to a (hopefully temporary) custom solution, because that API 
> makes it difficult to populate the the "file_path"  column-chunk metadata 
> fields that are returned within the optional `metadata_collector` kwarg.  
> Dask needs to set these fields correctly in order to generate a proper global 
> `"_metadata"` file.
> Possible solutions to this problem:
>  # Optionally populate the file-path fields within `write_to_dataset`
>  # Always populate the file-path fields within `write_to_dataset`
>  # Return the file paths for the data written within `write_to_dataset` (up 
> to the user to manually populate the file-path fields)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8244) [Python] Add `write_to_dataset` option to populate the "file_path" metadata fields

2020-03-27 Thread Rick Zamora (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rick Zamora updated ARROW-8244:
---
Summary: [Python] Add `write_to_dataset` option to populate the "file_path" 
metadata fields  (was: Add `write_to_dataset` option to populate the 
"file_path" metadata fields)

> [Python] Add `write_to_dataset` option to populate the "file_path" metadata 
> fields
> --
>
> Key: ARROW-8244
> URL: https://issues.apache.org/jira/browse/ARROW-8244
> Project: Apache Arrow
>  Issue Type: Wish
>Reporter: Rick Zamora
>Priority: Minor
>
> Prior to [dask#6023|[https://github.com/dask/dask/pull/6023]], Dask has been 
> using the `write_to_dataset` API to write partitioned parquet datasets.  This 
> PR is switching to a (hopefully temporary) custom solution, because that API 
> makes it difficult to populate the the "file_path"  column-chunk metadata 
> fields that are returned within the optional `metadata_collector` kwarg.  
> Dask needs to set these fields correctly in order to generate a proper global 
> `"_metadata"` file.
> Possible solutions to this problem:
>  # Optionally populate the file-path fields within `write_to_dataset`
>  # Always populate the file-path fields within `write_to_dataset`
>  # Return the file paths for the data written within `write_to_dataset` (up 
> to the user to manually populate the file-path fields)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8244) Add `write_to_dataset` option to populate the "file_path" metadata fields

2020-03-27 Thread Rick Zamora (Jira)
Rick Zamora created ARROW-8244:
--

 Summary: Add `write_to_dataset` option to populate the "file_path" 
metadata fields
 Key: ARROW-8244
 URL: https://issues.apache.org/jira/browse/ARROW-8244
 Project: Apache Arrow
  Issue Type: Wish
Reporter: Rick Zamora


Prior to [dask#6023|[https://github.com/dask/dask/pull/6023]], Dask has been 
using the `write_to_dataset` API to write partitioned parquet datasets.  This 
PR is switching to a (hopefully temporary) custom solution, because that API 
makes it difficult to populate the the "file_path"  column-chunk metadata 
fields that are returned within the optional `metadata_collector` kwarg.  Dask 
needs to set these fields correctly in order to generate a proper global 
`"_metadata"` file.

Possible solutions to this problem:
 # Optionally populate the file-path fields within `write_to_dataset`
 # Always populate the file-path fields within `write_to_dataset`
 # Return the file paths for the data written within `write_to_dataset` (up to 
the user to manually populate the file-path fields)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7941) [Rust] [DataFusion] Logical plan should support unresolved column references

2020-03-27 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-7941:
--
Description: 
It should be possible to build a logical plan using colum names rather than 
indices since it is more intuitive. There should be an optimizer rule that 
resolves the columns and replaces these unresolved columns with column indices.

 

  was:
I made a mistake in the design of the logical plan. It is better to refer to 
columns by name rather than index.

Benefits of making this change:
 * Allows for support for schemaless data sources e.g. JSON
 * Reduces the complexity of the optimizer rules

 


> [Rust] [DataFusion] Logical plan should support unresolved column references
> 
>
> Key: ARROW-7941
> URL: https://issues.apache.org/jira/browse/ARROW-7941
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Affects Versions: 0.16.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It should be possible to build a logical plan using colum names rather than 
> indices since it is more intuitive. There should be an optimizer rule that 
> resolves the columns and replaces these unresolved columns with column 
> indices.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7941) [Rust] [DataFusion] Logical plan should support unresolved column references

2020-03-27 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-7941:
--
Summary: [Rust] [DataFusion] Logical plan should support unresolved column 
references  (was: [Rust] [DataFusion] Logical plan should refer to columns by 
name not index)

> [Rust] [DataFusion] Logical plan should support unresolved column references
> 
>
> Key: ARROW-7941
> URL: https://issues.apache.org/jira/browse/ARROW-7941
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Affects Versions: 0.16.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I made a mistake in the design of the logical plan. It is better to refer to 
> columns by name rather than index.
> Benefits of making this change:
>  * Allows for support for schemaless data sources e.g. JSON
>  * Reduces the complexity of the optimizer rules
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8243) [Rust] [DataFusion] Fix inconsistent API in LogicalPlanBuilder

2020-03-27 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-8243:
--
Component/s: Rust - DataFusion
 Rust

> [Rust] [DataFusion] Fix inconsistent API in LogicalPlanBuilder
> --
>
> Key: ARROW-8243
> URL: https://issues.apache.org/jira/browse/ARROW-8243
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> LogicalPlanBuilder project method takes a  whereas other methods take a 
> Vec. It makes sense to take Vec and take ownership of these inputs since they 
> are being used to build the plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8243) [Rust] [DataFusion] Fix inconsistent API in LogicalPlanBuilder

2020-03-27 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-8243:
--
Fix Version/s: 0.17.0

> [Rust] [DataFusion] Fix inconsistent API in LogicalPlanBuilder
> --
>
> Key: ARROW-8243
> URL: https://issues.apache.org/jira/browse/ARROW-8243
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> LogicalPlanBuilder project method takes a  whereas other methods take a 
> Vec. It makes sense to take Vec and take ownership of these inputs since they 
> are being used to build the plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-4957) [Rust] [DataFusion] Implement get_supertype correctly

2020-03-27 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-4957:
--
Fix Version/s: (was: 0.17.0)
   1.0.0

> [Rust] [DataFusion] Implement get_supertype correctly
> -
>
> Key: ARROW-4957
> URL: https://issues.apache.org/jira/browse/ARROW-4957
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Affects Versions: 0.13.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The current implementation of get_supertype (used in type coercion logic) is 
> very hacky and should be re-implemented with better unit tests as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8205) [Rust] Arrow should enforce unique field names in a schema

2020-03-27 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-8205:
--
Description: 
There does not seem to be any validation to avoid schemas being created with 
duplicate field names. We should add this along with unit tests.

This will require changing the signature of the constructors to try_new with a 
Result return type.

  was:There does not seem to be any validation to avoid schemas being created 
with duplicate field names. We should add this along with unit tests.


> [Rust] Arrow should enforce unique field names in a schema
> --
>
> Key: ARROW-8205
> URL: https://issues.apache.org/jira/browse/ARROW-8205
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.16.0
>Reporter: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> There does not seem to be any validation to avoid schemas being created with 
> duplicate field names. We should add this along with unit tests.
> This will require changing the signature of the constructors to try_new with 
> a Result return type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8205) [Rust] Arrow should enforce unique field names in a schema

2020-03-27 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-8205:
--
Fix Version/s: (was: 0.17.0)
   1.0.0

> [Rust] Arrow should enforce unique field names in a schema
> --
>
> Key: ARROW-8205
> URL: https://issues.apache.org/jira/browse/ARROW-8205
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.16.0
>Reporter: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> There does not seem to be any validation to avoid schemas being created with 
> duplicate field names. We should add this along with unit tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8231) [Rust] Parse key_value_metadata from parquet FileMetaData into arrow schema metadata

2020-03-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8231:
--
Labels: pull-request-available  (was: )

> [Rust] Parse key_value_metadata from parquet FileMetaData into arrow schema 
> metadata
> 
>
> Key: ARROW-8231
> URL: https://issues.apache.org/jira/browse/ARROW-8231
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jörn Horstmann
>Priority: Minor
>  Labels: pull-request-available
>
> The parquet-format FileMetaData struct contains optional key value pairs with 
> additional metadata about the schema:
> [https://docs.rs/parquet-format/2.6.0/src/parquet_format/parquet_format.rs.html#3821]
> When the parquet file was generated using the java avro parquet writer, this 
> for example contains the original avro schema under the `parquet.avro.schema` 
> or `avro.schema` keys.
> It would be nice if this metadata was accessible through the 
> `arrow::datatypes::Schema.metadata` field.
> I'm willing to implement and create a pull request for this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-7507) [Rust] Bump Thrift version to 0.13 in parquet-format and parquet

2020-03-27 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-7507.
-
Resolution: Not A Problem

This issue was resolved by running cargo update IIRC

> [Rust] Bump Thrift version to 0.13 in parquet-format and parquet
> 
>
> Key: ARROW-7507
> URL: https://issues.apache.org/jira/browse/ARROW-7507
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.15.1
>Reporter: Mahmut Bulut
>Assignee: Andy Grove
>Priority: Major
>  Labels: parquet
> Fix For: 0.17.0
>
>
> *Problem Description*
> Currently, `byteorder` crate changes is not incorporated in both 
> `parquet-format` and `parquet` crates. Both should have consistently updated 
> to the thrift 0.13 in reverse order(first parquet-format then parquet) to 
> update the dependencies which are using older versions.
> This makes clashing versions from other crates that are following the 
> upstream.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7681) [Rust] Explicitly seeking a BufReader will discard the internal buffer

2020-03-27 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068757#comment-17068757
 ] 

Andy Grove commented on ARROW-7681:
---

Deferring this to 1.0.0 due to the concerns about the PR adding further 
dependencies on unstable Rust features

> [Rust] Explicitly seeking a BufReader will discard the internal buffer
> --
>
> Key: ARROW-7681
> URL: https://issues.apache.org/jira/browse/ARROW-7681
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Max Burke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> This behavior was observed in the Parquet Rust file reader 
> (parquet/src/util/io.rs).
>  
> Pull request: [https://github.com/apache/arrow/pull/6280]
>  
> From the Rust documentation for BufReader:
>  
> "Seeking always discards the internal buffer, even if the seek position would 
> otherwise fall within it. This guarantees that calling {{.into_inner()}} 
> immediately after a seek yields the underlying reader at the same position."
>  
> [https://doc.rust-lang.org/std/io/struct.BufReader.html#impl-Seek]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7681) [Rust] Explicitly seeking a BufReader will discard the internal buffer

2020-03-27 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-7681:
--
Fix Version/s: (was: 0.17.0)
   1.0.0

> [Rust] Explicitly seeking a BufReader will discard the internal buffer
> --
>
> Key: ARROW-7681
> URL: https://issues.apache.org/jira/browse/ARROW-7681
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Max Burke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> This behavior was observed in the Parquet Rust file reader 
> (parquet/src/util/io.rs).
>  
> Pull request: [https://github.com/apache/arrow/pull/6280]
>  
> From the Rust documentation for BufReader:
>  
> "Seeking always discards the internal buffer, even if the seek position would 
> otherwise fall within it. This guarantees that calling {{.into_inner()}} 
> immediately after a seek yields the underlying reader at the same position."
>  
> [https://doc.rust-lang.org/std/io/struct.BufReader.html#impl-Seek]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6583) [Rust] Question and Request for Examples of Array Operations

2020-03-27 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-6583:
--
Fix Version/s: (was: 0.17.0)
   1.0.0

> [Rust] Question and Request for Examples of Array Operations
> 
>
> Key: ARROW-6583
> URL: https://issues.apache.org/jira/browse/ARROW-6583
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Arthur Maciejewicz
>Priority: Minor
> Fix For: 1.0.0
>
>
> Hi all, thank you for your excellent work on Arrow.
> As I was going through the example for the Rust Arrow implementation, 
> specifically the read_csv example 
> [https://github.com/apache/arrow/blob/master/rust/arrow/examples/read_csv.rs] 
> , as well as the generated Rustdocs, and unit tests, it was not quite clear 
> what the intended usage is for operations such as filtering and masking over 
> Arrays.
> One particular use-case I'm interested in is finding all values in an Array 
> such that x >= N for all x. I came across arrow::compute::array_ops::filter, 
> which seems to be similar to what I want, although it's expecting a mask to 
> already be constructed before performing the filter operation, and it was not 
> obviously visible in the documentation, leading me to believe this might not 
> be idiomatic usage.
> More generally, is the expectation for Arrays on the Rust side that they are 
> just simple data abstractions, without exposing higher-order methods such as 
> filtering/masking? Is the intent to leave that to users? If I missed some 
> piece of documentation, please let me know. For my use-case I ended up trying 
> something like:
> {code:java}
> let column = batch.column(0).as_any().downcast_ref::().unwrap();
> let mut builder = BooleanBuilder::new(batch.num_rows());
> let N = 5.0;
> for i in 0..batch.num_rows() {
>if column.value(i).unwrap() > N {
>   builder.append_value(true).unwrap();
>} else {
>   builder.append_value(false).unwrap();
>}
> }
> let mask = builder.finish();
> let filtered_column = filter(column, mask);{code}
> If possible, could you provide examples of intended usage of Arrays? Thank 
> you!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8242) [C++] Flight fails to compile on GCC 4.8

2020-03-27 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-8242:
---
Priority: Blocker  (was: Major)

> [C++] Flight fails to compile on GCC 4.8
> 
>
> Key: ARROW-8242
> URL: https://issues.apache.org/jira/browse/ARROW-8242
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> See recent build log 
> https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=8944=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=5b4cc83a-7bb0-5664-5bb1-588f7e4dc05b=2186



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8242) [C++] Flight fails to compile on GCC 4.8

2020-03-27 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-8242:
---
Fix Version/s: 0.17.0

> [C++] Flight fails to compile on GCC 4.8
> 
>
> Key: ARROW-8242
> URL: https://issues.apache.org/jira/browse/ARROW-8242
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> See recent build log 
> https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=8944=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=5b4cc83a-7bb0-5664-5bb1-588f7e4dc05b=2186



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6890) [Rust] [Parquet] ArrowReader fails with seg fault

2020-03-27 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-6890.
-
Resolution: Fixed

This was fixed some time ago

> [Rust] [Parquet] ArrowReader fails with seg fault
> -
>
> Key: ARROW-6890
> URL: https://issues.apache.org/jira/browse/ARROW-6890
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.16.0
>Reporter: Andy Grove
>Assignee: Renjie Liu
>Priority: Major
> Fix For: 0.17.0
>
>
> ArrowReader fails with seg fault when trying to read an unsupported type, 
> like Utf8. We should have it return an Err instead of causing a segmentation 
> fault.
>  
> See [https://github.com/apache/arrow/pull/5641] for a reproducible test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8243) [Rust] [DataFusion] Fix inconsistent API in LogicalPlanBuilder

2020-03-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8243:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Fix inconsistent API in LogicalPlanBuilder
> --
>
> Key: ARROW-8243
> URL: https://issues.apache.org/jira/browse/ARROW-8243
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Minor
>  Labels: pull-request-available
>
> LogicalPlanBuilder project method takes a  whereas other methods take a 
> Vec. It makes sense to take Vec and take ownership of these inputs since they 
> are being used to build the plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8243) [Rust] [DataFusion] Fix inconsistent API in LogicalPlanBuilder

2020-03-27 Thread Andy Grove (Jira)
Andy Grove created ARROW-8243:
-

 Summary: [Rust] [DataFusion] Fix inconsistent API in 
LogicalPlanBuilder
 Key: ARROW-8243
 URL: https://issues.apache.org/jira/browse/ARROW-8243
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Andy Grove
Assignee: Andy Grove


LogicalPlanBuilder project method takes a  whereas other methods take a 
Vec. It makes sense to take Vec and take ownership of these inputs since they 
are being used to build the plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8241) [Rust] Add convenience methods to Schema

2020-03-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8241:
--
Labels: pull-request-available  (was: )

> [Rust] Add convenience methods to Schema
> 
>
> Key: ARROW-8241
> URL: https://issues.apache.org/jira/browse/ARROW-8241
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>
> I would like to add the following methods to Schema to make it easier to work 
> with.
>  
> {code:java}
> pub fn field_with_name(, name: ) -> Result<>;
> pub fn index_of(, name: ) -> Result;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8242) [C++] Flight fails to compile on GCC 4.8

2020-03-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8242:
--
Labels: pull-request-available  (was: )

> [C++] Flight fails to compile on GCC 4.8
> 
>
> Key: ARROW-8242
> URL: https://issues.apache.org/jira/browse/ARROW-8242
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>
> See recent build log 
> https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=8944=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=5b4cc83a-7bb0-5664-5bb1-588f7e4dc05b=2186



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8242) [C++] Flight fails to compile on GCC 4.8

2020-03-27 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-8242:
---
Summary: [C++] Flight fails to compile on GCC 4.8  (was: [C++] GCC 4.8 
fails to compileFlight)

> [C++] Flight fails to compile on GCC 4.8
> 
>
> Key: ARROW-8242
> URL: https://issues.apache.org/jira/browse/ARROW-8242
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>
> See recent build log 
> https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=8944=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=5b4cc83a-7bb0-5664-5bb1-588f7e4dc05b=2186



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8242) [C++] GCC 4.8 fails to compileFlight

2020-03-27 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-8242:
---
Summary: [C++] GCC 4.8 fails to compileFlight  (was: [C++] GCC 4.8 fails to 
compile Flight)

> [C++] GCC 4.8 fails to compileFlight
> 
>
> Key: ARROW-8242
> URL: https://issues.apache.org/jira/browse/ARROW-8242
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>
> See recent build log 
> https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=8944=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=5b4cc83a-7bb0-5664-5bb1-588f7e4dc05b=2186



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8242) [C++] GCC 4.8 fails to compile Flight

2020-03-27 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8242:
--

 Summary: [C++] GCC 4.8 fails to compile Flight
 Key: ARROW-8242
 URL: https://issues.apache.org/jira/browse/ARROW-8242
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


See recent build log 
https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=8944=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=5b4cc83a-7bb0-5664-5bb1-588f7e4dc05b=2186



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8241) [Rust] Add convenience methods to Schema

2020-03-27 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-8241:
--
Summary: [Rust] Add convenience methods to Schema  (was: Add convenience 
methods to Schema)

> [Rust] Add convenience methods to Schema
> 
>
> Key: ARROW-8241
> URL: https://issues.apache.org/jira/browse/ARROW-8241
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 0.17.0
>
>
> I would like to add the following methods to Schema to make it easier to work 
> with.
>  
> {code:java}
> pub fn field_with_name(, name: ) -> Result<>;
> pub fn index_of(, name: ) -> Result;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8241) Add convenience methods to Schema

2020-03-27 Thread Andy Grove (Jira)
Andy Grove created ARROW-8241:
-

 Summary: Add convenience methods to Schema
 Key: ARROW-8241
 URL: https://issues.apache.org/jira/browse/ARROW-8241
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.17.0


I would like to add the following methods to Schema to make it easier to work 
with.

 
{code:java}
pub fn field_with_name(, name: ) -> Result<>;

pub fn index_of(, name: ) -> Result;
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8240) [Python] New FS interface (pyarrow.fs) does not seem to work correctly for HDFS (Python 3.6, pyarrow 0.16.0)

2020-03-27 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068693#comment-17068693
 ] 

Antoine Pitrou commented on ARROW-8240:
---

cc [~kszucs]

> [Python] New FS interface (pyarrow.fs) does not seem to work correctly for 
> HDFS (Python 3.6, pyarrow 0.16.0)
> 
>
> Key: ARROW-8240
> URL: https://issues.apache.org/jira/browse/ARROW-8240
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Yaqub Alwan
>Priority: Major
>
> I'll preface this with the limited setup I had to do:
> {{export CLASSPATH=$(hadoop classpath --glob)}}
> {{export 
> ARROW_LIBHDFS_DIR=/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib64}}
>  
> Then I ran the following:
> {code}
> In [1]: import pyarrow.fs 
>   
>   
> 
> In [2]: c = pyarrow.fs.HadoopFileSystem() 
>   
>   
> 
> In [3]: sel = pyarrow.fs.FileSelector('/user/rwiumli')
>   
>   
> 
> In [4]: c.get_target_stats(sel)   
>   
>   
> 
> ---
> OSError   Traceback (most recent call last)
>  in 
> > 1 c.get_target_stats(sel)
> ~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in 
> pyarrow._fs.FileSystem.get_target_stats()
> ~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
> ~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> OSError: HDFS list directory failed, errno: 2 (No such file or directory)
> In [5]: sel = pyarrow.fs.FileSelector('.')
>   
>   
> 
> In [6]: c.get_target_stats(sel)   
>   
>   
> 
> Out[6]: 
> [,
>  ,
>  ]
> In [7]: !ls   
>   
>   
> 
> sample.py  sandeep  venv
> In [8]:   
> {code}
> It looks like the new hadoop fs interface is doing a local lookup?
> Ok fine...
> {code}
> In [8]: sel = pyarrow.fs.FileSelector('hdfs:///user/rwiumli') # shouldnt have 
> to do this
>   
> 
> In [9]: c.get_target_stats(sel)   
>   
>   
> 
> hdfsGetPathInfo(hdfs:///user/rwiumli): getFileInfo error:
> IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: 
> file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, 
> expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:593)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1418)
> hdfsListDirectory(hdfs:///user/rwiumli): FileSystem#listStatus error:
> IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: 
> 

[jira] [Updated] (ARROW-8240) [Python] New FS interface (pyarrow.fs) does not seem to work correctly for HDFS (Python 3.6, pyarrow 0.16.0)

2020-03-27 Thread Yaqub Alwan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaqub Alwan updated ARROW-8240:
---
Description: 
I'll preface this with the limited setup I had to do:


{{export CLASSPATH=$(hadoop classpath --glob)}}

{{export 
ARROW_LIBHDFS_DIR=/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib64}}

 
Then I ran the following:

{code}
In [1]: import pyarrow.fs   

  

In [2]: c = pyarrow.fs.HadoopFileSystem()   

  

In [3]: sel = pyarrow.fs.FileSelector('/user/rwiumli')  

  

In [4]: c.get_target_stats(sel) 

  
---
OSError   Traceback (most recent call last)
 in 
> 1 c.get_target_stats(sel)

~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in 
pyarrow._fs.FileSystem.get_target_stats()

~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in 
pyarrow.lib.check_status()

OSError: HDFS list directory failed, errno: 2 (No such file or directory)

In [5]: sel = pyarrow.fs.FileSelector('.')  

  

In [6]: c.get_target_stats(sel) 

  
Out[6]: 
[,
 ,
 ]

In [7]: !ls 

  
sample.py  sandeep  venv

In [8]:   
{code}

It looks like the new hadoop fs interface is doing a local lookup?

Ok fine...

{code}
In [8]: sel = pyarrow.fs.FileSelector('hdfs:///user/rwiumli') # shouldnt have 
to do this  


In [9]: c.get_target_stats(sel) 

  
hdfsGetPathInfo(hdfs:///user/rwiumli): getFileInfo error:
IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: 
file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, 
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:593)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1418)
hdfsListDirectory(hdfs:///user/rwiumli): FileSystem#listStatus error:
IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: 
file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, 
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at 
org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:410)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1566)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1609)
at 
org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:667)
---
OSError

[jira] [Commented] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-03-27 Thread Jacek Pliszka (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068688#comment-17068688
 ] 

Jacek Pliszka commented on ARROW-3329:
--

OK, tried that but there are errors there too

1 it also is inconsistent with pushd arrow

2 pushd arrow/cpp/build before cmake should be without build
 # libbz2 is missing even though it was not missing with pip

 # same error at the end

python setup.py build_ext --inplace

-- Running cmake --build for pyarrow
 cmake --build . --config release --
 Error: could not load cache
 error: command 'cmake' failed with exit status 1

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8240) [Python] New FS interface (pyarrow.fs) does not seem to work correctly for HDFS (Python 3.6, pyarrow 0.16.0)

2020-03-27 Thread Yaqub Alwan (Jira)
Yaqub Alwan created ARROW-8240:
--

 Summary: [Python] New FS interface (pyarrow.fs) does not seem to 
work correctly for HDFS (Python 3.6, pyarrow 0.16.0)
 Key: ARROW-8240
 URL: https://issues.apache.org/jira/browse/ARROW-8240
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Yaqub Alwan


I'll preface this with the limited setup I had to do:


{{export CLASSPATH=$(hadoop classpath --glob)}}

{{export 
ARROW_LIBHDFS_DIR=/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib64}}

 
Then I ran the following:

{{code}}
In [1]: import pyarrow.fs   

  

In [2]: c = pyarrow.fs.HadoopFileSystem()   

  

In [3]: sel = pyarrow.fs.FileSelector('/user/rwiumli')  

  

In [4]: c.get_target_stats(sel) 

  
---
OSError   Traceback (most recent call last)
 in 
> 1 c.get_target_stats(sel)

~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in 
pyarrow._fs.FileSystem.get_target_stats()

~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in 
pyarrow.lib.check_status()

OSError: HDFS list directory failed, errno: 2 (No such file or directory)

In [5]: sel = pyarrow.fs.FileSelector('.')  

  

In [6]: c.get_target_stats(sel) 

  
Out[6]: 
[,
 ,
 ]

In [7]: !ls 

  
sample.py  sandeep  venv

In [8]:   
{{code}}

It looks like the new hadoop fs interface is doing a local lookup?

Ok fine...

{{code}}
In [8]: sel = pyarrow.fs.FileSelector('hdfs:///user/rwiumli') # shouldnt have 
to do this  


In [9]: c.get_target_stats(sel) 

  
hdfsGetPathInfo(hdfs:///user/rwiumli): getFileInfo error:
IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: 
file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, 
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:593)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1418)
hdfsListDirectory(hdfs:///user/rwiumli): FileSystem#listStatus error:
IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: 
file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, 
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at 
org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:410)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1566)
at 

[jira] [Updated] (ARROW-8070) [C++] Cast segfaults on unsupported cast from list to utf8

2020-03-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8070:
--
Labels: pull-request-available  (was: )

> [C++] Cast segfaults on unsupported cast from list to utf8
> --
>
> Key: ARROW-8070
> URL: https://issues.apache.org/jira/browse/ARROW-8070
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Daniel Nugent
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>
> Was messing around with some nested arrays and found a pretty easy to 
> reproduce segfault:
> {code:java}
> Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48)
> [GCC 7.3.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np, pyarrow as pa
> >>> pa.__version__
> '0.16.0'
> >>> np.__version__
> '1.18.1'
> >>> x=[np.array([b'a',b'b'])]
> >>> a = pa.array(x,pa.list_(pa.binary()))
> >>> a
> 
> [
>   [
> 61,
> 62
>   ]
> ]
> >>> a.cast(pa.string())
> Segmentation fault
> {code}
> I don't know if that cast makes sense, but I left the checks on, so I would 
> not expect a segfault from it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8070) [C++] Cast segfaults on unsupported cast from list to utf8

2020-03-27 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-8070:
---
Summary: [C++] Cast segfaults on unsupported cast from list to utf8 
 (was: [Python] Array.cast segfaults on unsupported cast from list to 
utf8)

> [C++] Cast segfaults on unsupported cast from list to utf8
> --
>
> Key: ARROW-8070
> URL: https://issues.apache.org/jira/browse/ARROW-8070
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Daniel Nugent
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 0.17.0
>
>
> Was messing around with some nested arrays and found a pretty easy to 
> reproduce segfault:
> {code:java}
> Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48)
> [GCC 7.3.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np, pyarrow as pa
> >>> pa.__version__
> '0.16.0'
> >>> np.__version__
> '1.18.1'
> >>> x=[np.array([b'a',b'b'])]
> >>> a = pa.array(x,pa.list_(pa.binary()))
> >>> a
> 
> [
>   [
> 61,
> 62
>   ]
> ]
> >>> a.cast(pa.string())
> Segmentation fault
> {code}
> I don't know if that cast makes sense, but I left the checks on, so I would 
> not expect a segfault from it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8070) [C++] Cast segfaults on unsupported cast from list to utf8

2020-03-27 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-8070:
---
Component/s: (was: Python)
 C++

> [C++] Cast segfaults on unsupported cast from list to utf8
> --
>
> Key: ARROW-8070
> URL: https://issues.apache.org/jira/browse/ARROW-8070
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Daniel Nugent
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 0.17.0
>
>
> Was messing around with some nested arrays and found a pretty easy to 
> reproduce segfault:
> {code:java}
> Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48)
> [GCC 7.3.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np, pyarrow as pa
> >>> pa.__version__
> '0.16.0'
> >>> np.__version__
> '1.18.1'
> >>> x=[np.array([b'a',b'b'])]
> >>> a = pa.array(x,pa.list_(pa.binary()))
> >>> a
> 
> [
>   [
> 61,
> 62
>   ]
> ]
> >>> a.cast(pa.string())
> Segmentation fault
> {code}
> I don't know if that cast makes sense, but I left the checks on, so I would 
> not expect a segfault from it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7908) [R] Can't install package without setting LIBARROW_DOWNLOAD=true

2020-03-27 Thread Taeke (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068657#comment-17068657
 ] 

Taeke commented on ARROW-7908:
--

Hi,

Sorry for the long silence. With ARROW_R_DEV=TRUE I get:
{code:sh}
trying URL 'https://cloud.r-project.org/src/contrib/arrow_0.16.0.2.tar.gz'
Content type 'application/x-gzip' length 216119 bytes (211 KB)
==
downloaded 211 KB

* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
*** Generating code with data-raw/codegen.R
Fatal error: cannot open file 'data-raw/codegen.R': No such file or directory
PKG_CFLAGS=-I/tmp/Rtmp7CrqGP/R.INSTALL1bebe61d5312e/arrow/libarrow/arrow-0.16.0.2/include
 -DARROW_R_WITH_ARROW
PKG_LIBS=-L/tmp/Rtmp7CrqGP/R.INSTALL1bebe61d5312e/arrow/libarrow/arrow-0.16.0.2/lib
 -larrow_dataset -lparquet -larrow -lthrift -lsnappy -lz -lzstd -llz4 
-lbrotlidec-static -lbrotlienc-static -lbrotlicommon-static -lboost_filesystem 
-lboost_regex -lboost_system -ljemalloc_pic
** libs
g++ -m64 -std=gnu++11 -I"/usr/include/R" -DNDEBUG 
-I/tmp/Rtmp7CrqGP/R.INSTALL1bebe61d5312e/arrow/libarrow/arrow-0.16.0.2/include 
-DARROW_R_WITH_ARROW -I"/usr/lib64/R/library/Rcpp/include" -I/usr/local/include 
-fpic -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions 
-fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 
-mtune=generic -c array.cpp -o array.o
In file included from array.cpp:18:0:
./arrow_types.h:201:31: fatal error: arrow/dataset/api.h: No such file or 
directory
#include 
^
compilation terminated.
make: *** [array.o] Error 1
ERROR: compilation failed for package ‘arrow’
{code}
data-raw/codegen.R misses because it is specified in .Rbuildignore.

Removing that line from .Rbuildignore makes the installation run somewhat 
further, but data-raw/codegen.R than fails:
{code:java}
*** Generating code with data-raw/codegen.R
Error in library(decor) : there is no package called ‘decor’
Calls: suppressPackageStartupMessages -> withCallingHandlers -> library
Execution halted
{code}
That I could fix (manually, at least) by installing decor:
{code:java}
remotes::install_github("romainfrancois/decor")
{code}
That line is, understandably, commented out in data-raw/codegen.R

Finally configure tries to run tools/linuxlibs.R, which tries to 
download_source(), but that fails due to an invalid version:
{code:java}
trying URL 
'https://dl.bintray.com/ursalabs/arrow-r/libarrow/src/arrow-0.16.0.9000.zip'
Error in download.file(source_url, tf1, quiet = quietly) :
  (converted from warning) cannot open URL 
'https://dl.bintray.com/ursalabs/arrow-r/libarrow/src/arrow-0.16.0.9000.zip': 
HTTP status was '404 Not Found'
{code}
This I could fix by changing the version in the DESCRIPTION to 0.16.0.2

After that the installation concludes as expected.

In summary, for this to work:
 * data_raw/codegen.R needs not be included in .Rbuildignore
 * decor needs to become a dependency (?)
 * version number needs to be updated in DESCRIPTION to correspond with an 
available download

 

> [R] Can't install package without setting LIBARROW_DOWNLOAD=true
> 
>
> Key: ARROW-7908
> URL: https://issues.apache.org/jira/browse/ARROW-7908
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.16.0
> Environment: Operating System: Red Hat Enterprise Linux Server 7.6 
> (Maipo) 
> CPE OS Name: cpe:/o:redhat:enterprise_linux:7.6:GA:server
> Kernel: Linux 3.10.0-957.35.2.el7.x86_64
> Architecture: x86-64  
>Reporter: Taeke
>Priority: Major
>
> Hi,
> Installing arrow in R does not work intuitively on our server.
> {code:r}
> install.packages("arrow")`
> {code}
> results in an error:
> {code:sh}
> Installing package into '/home//R/x86_64-redhat-linux-gnu-library/3.6'
> (as 'lib' is unspecified)
> trying URL 'https://cloud.r-project.org/src/contrib/arrow_0.16.0.2.tar.gz'
> Content type 'application/x-gzip' length 216119 bytes (211 KB)
> ==
> downloaded 211 KB
> * installing *source* package 'arrow' ...
> ** package 'arrow' successfully unpacked and MD5 sums checked
> ** using staged installation
> PKG_CFLAGS=-I/tmp/Rtmp3v1BDf/R.INSTALL4a5d5d9f8bc8/arrow/libarrow/arrow-0.16.0.2/include
>   -DARROW_R_WITH_ARROW
> PKG_LIBS=-L/tmp/Rtmp3v1BDf/R.INSTALL4a5d5d9f8bc8/arrow/libarrow/arrow-0.16.0.2/lib
>  -larrow_dataset -lparquet -larrow -lthrift -lsnappy -lz -lzstd -llz4 
> -lbrotlidec-static -lbrotlienc-static -lbrotlicommon-static 
> -lboost_filesystem -lboost_regex -lboost_system -ljemalloc_pic
> ** libs
> g++ -m64 -std=gnu++11 -I"/usr/include/R" -DNDEBUG 
> -I/tmp/Rtmp3v1BDf/R.INSTALL4a5d5d9f8bc8/arrow/libarrow/arrow-0.16.0.2/include 
>  

[jira] [Commented] (ARROW-8238) [C++][Compute] Failed to build compute tests on windows with msvc2015

2020-03-27 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068499#comment-17068499
 ] 

Antoine Pitrou commented on ARROW-8238:
---

I don't think that would make a difference, but you can try it out.

> [C++][Compute] Failed to build compute tests on windows with msvc2015
> -
>
> Key: ARROW-8238
> URL: https://issues.apache.org/jira/browse/ARROW-8238
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Compute
>Reporter: Yibo Cai
>Priority: Minor
>
> Build Arrow compute tests on Windows10 with MSVC2015:
> {code:bash}
> cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DARROW_COMPUTE=ON 
> -DARROW_BUILD_TESTS=ON ..
> ninja -j3
> {code}
> Build failed with below message:
> {code:bash}
> [311/405] Linking CXX executable release\arrow-misc-test.exe
> FAILED: release/arrow-misc-test.exe
> cmd.exe /C "cd . && 
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\bin\cmake.exe -E 
> vs_link_exe --intdir=src\arrow\CMakeFiles\arrow-misc-test.dir 
> --rc=C:\PROGRA~2\WI3CF2~1\8.1\bin\x64\rc.exe 
> --mt=C:\PROGRA~2\WI3CF2~1\8.1\bin\x64\mt.exe --manifests  -- 
> C:\PROGRA~2\MICROS~1.0\VC\bin\amd64\link.exe /nologo 
> src\arrow\CMakeFiles\arrow-misc-test.dir\memory_pool_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\result_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\pretty_print_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\status_test.cc.obj  
> /out:release\arrow-misc-test.exe /implib:release\arrow-misc-test.lib 
> /pdb:release\arrow-misc-test.pdb /version:0.0  /machine:x64  
> /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console  
> release\arrow_testing.lib  release\arrow.lib  
> googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib  
> googletest_ep-prefix\src\googletest_ep\lib\gtest.lib  
> googletest_ep-prefix\src\googletest_ep\lib\gmock.lib  
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_filesystem.lib  
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_system.lib  
> Ws2_32.lib  kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib 
> ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib && cd ."
> LINK: command "C:\PROGRA~2\MICROS~1.0\VC\bin\amd64\link.exe /nologo 
> src\arrow\CMakeFiles\arrow-misc-test.dir\memory_pool_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\result_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\pretty_print_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\status_test.cc.obj 
> /out:release\arrow-misc-test.exe /implib:release\arrow-misc-test.lib 
> /pdb:release\arrow-misc-test.pdb /version:0.0 /machine:x64 
> /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console 
> release\arrow_testing.lib release\arrow.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gtest.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gmock.lib 
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_filesystem.lib 
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_system.lib 
> Ws2_32.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib 
> ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST 
> /MANIFESTFILE:release\arrow-misc-test.exe.manifest" failed (exit code 1169) 
> with the following output:
> arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: __cdecl 
> std::vector >::vector std::allocator >(class std::initializer_list,class 
> std::allocator const &)" 
> (??0?$vector@HV?$allocator@H@std@@@std@@QEAA@V?$initializer_list@H@1@AEBV?$allocator@H@1@@Z)
>  already defined in result_test.cc.obj
> arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: __cdecl 
> std::vector >::~vector std::allocator >(void)" (??1?$vector@HV?$allocator@H@std@@@std@@QEAA@XZ) 
> already defined in result_test.cc.obj
> arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: unsigned 
> __int64 __cdecl std::vector >::size(void)const 
> " (?size@?$vector@HV?$allocator@H@std@@@std@@QEBA_KXZ) already defined in 
> result_test.cc.obj
> release\arrow-misc-test.exe : fatal error LNK1169: one or more multiply 
> defined symbols found
> [313/405] Building CXX object 
> src\arrow\CMakeFiles\arrow-table-test.dir\table_builder_test.cc.obj
> ninja: build stopped: subcommand failed.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8238) [C++][Compute] Failed to build compute tests on windows with msvc2015

2020-03-27 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068495#comment-17068495
 ] 

Yibo Cai commented on ARROW-8238:
-

For these local helper functions in unit test files, any reason we didn't 
define them as static?

> [C++][Compute] Failed to build compute tests on windows with msvc2015
> -
>
> Key: ARROW-8238
> URL: https://issues.apache.org/jira/browse/ARROW-8238
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Compute
>Reporter: Yibo Cai
>Priority: Minor
>
> Build Arrow compute tests on Windows10 with MSVC2015:
> {code:bash}
> cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DARROW_COMPUTE=ON 
> -DARROW_BUILD_TESTS=ON ..
> ninja -j3
> {code}
> Build failed with below message:
> {code:bash}
> [311/405] Linking CXX executable release\arrow-misc-test.exe
> FAILED: release/arrow-misc-test.exe
> cmd.exe /C "cd . && 
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\bin\cmake.exe -E 
> vs_link_exe --intdir=src\arrow\CMakeFiles\arrow-misc-test.dir 
> --rc=C:\PROGRA~2\WI3CF2~1\8.1\bin\x64\rc.exe 
> --mt=C:\PROGRA~2\WI3CF2~1\8.1\bin\x64\mt.exe --manifests  -- 
> C:\PROGRA~2\MICROS~1.0\VC\bin\amd64\link.exe /nologo 
> src\arrow\CMakeFiles\arrow-misc-test.dir\memory_pool_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\result_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\pretty_print_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\status_test.cc.obj  
> /out:release\arrow-misc-test.exe /implib:release\arrow-misc-test.lib 
> /pdb:release\arrow-misc-test.pdb /version:0.0  /machine:x64  
> /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console  
> release\arrow_testing.lib  release\arrow.lib  
> googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib  
> googletest_ep-prefix\src\googletest_ep\lib\gtest.lib  
> googletest_ep-prefix\src\googletest_ep\lib\gmock.lib  
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_filesystem.lib  
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_system.lib  
> Ws2_32.lib  kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib 
> ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib && cd ."
> LINK: command "C:\PROGRA~2\MICROS~1.0\VC\bin\amd64\link.exe /nologo 
> src\arrow\CMakeFiles\arrow-misc-test.dir\memory_pool_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\result_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\pretty_print_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\status_test.cc.obj 
> /out:release\arrow-misc-test.exe /implib:release\arrow-misc-test.lib 
> /pdb:release\arrow-misc-test.pdb /version:0.0 /machine:x64 
> /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console 
> release\arrow_testing.lib release\arrow.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gtest.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gmock.lib 
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_filesystem.lib 
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_system.lib 
> Ws2_32.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib 
> ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST 
> /MANIFESTFILE:release\arrow-misc-test.exe.manifest" failed (exit code 1169) 
> with the following output:
> arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: __cdecl 
> std::vector >::vector std::allocator >(class std::initializer_list,class 
> std::allocator const &)" 
> (??0?$vector@HV?$allocator@H@std@@@std@@QEAA@V?$initializer_list@H@1@AEBV?$allocator@H@1@@Z)
>  already defined in result_test.cc.obj
> arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: __cdecl 
> std::vector >::~vector std::allocator >(void)" (??1?$vector@HV?$allocator@H@std@@@std@@QEAA@XZ) 
> already defined in result_test.cc.obj
> arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: unsigned 
> __int64 __cdecl std::vector >::size(void)const 
> " (?size@?$vector@HV?$allocator@H@std@@@std@@QEBA_KXZ) already defined in 
> result_test.cc.obj
> release\arrow-misc-test.exe : fatal error LNK1169: one or more multiply 
> defined symbols found
> [313/405] Building CXX object 
> src\arrow\CMakeFiles\arrow-table-test.dir\table_builder_test.cc.obj
> ninja: build stopped: subcommand failed.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8238) [C++][Compute] Failed to build compute tests on windows with msvc2015

2020-03-27 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068490#comment-17068490
 ] 

Yibo Cai commented on ARROW-8238:
-

[~apitrou], trying to nail down the issue bysimplified and manual link steps. 
Looks like symbol collisions.

One finding is this problem may be fixed(only through a simple test, not fully 
verified) by defining functions as static in all test sources. Will do more 
tests.

[https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bit_util_test.cc#L52]

> [C++][Compute] Failed to build compute tests on windows with msvc2015
> -
>
> Key: ARROW-8238
> URL: https://issues.apache.org/jira/browse/ARROW-8238
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Compute
>Reporter: Yibo Cai
>Priority: Minor
>
> Build Arrow compute tests on Windows10 with MSVC2015:
> {code:bash}
> cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DARROW_COMPUTE=ON 
> -DARROW_BUILD_TESTS=ON ..
> ninja -j3
> {code}
> Build failed with below message:
> {code:bash}
> [311/405] Linking CXX executable release\arrow-misc-test.exe
> FAILED: release/arrow-misc-test.exe
> cmd.exe /C "cd . && 
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\bin\cmake.exe -E 
> vs_link_exe --intdir=src\arrow\CMakeFiles\arrow-misc-test.dir 
> --rc=C:\PROGRA~2\WI3CF2~1\8.1\bin\x64\rc.exe 
> --mt=C:\PROGRA~2\WI3CF2~1\8.1\bin\x64\mt.exe --manifests  -- 
> C:\PROGRA~2\MICROS~1.0\VC\bin\amd64\link.exe /nologo 
> src\arrow\CMakeFiles\arrow-misc-test.dir\memory_pool_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\result_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\pretty_print_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\status_test.cc.obj  
> /out:release\arrow-misc-test.exe /implib:release\arrow-misc-test.lib 
> /pdb:release\arrow-misc-test.pdb /version:0.0  /machine:x64  
> /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console  
> release\arrow_testing.lib  release\arrow.lib  
> googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib  
> googletest_ep-prefix\src\googletest_ep\lib\gtest.lib  
> googletest_ep-prefix\src\googletest_ep\lib\gmock.lib  
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_filesystem.lib  
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_system.lib  
> Ws2_32.lib  kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib 
> ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib && cd ."
> LINK: command "C:\PROGRA~2\MICROS~1.0\VC\bin\amd64\link.exe /nologo 
> src\arrow\CMakeFiles\arrow-misc-test.dir\memory_pool_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\result_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\pretty_print_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\status_test.cc.obj 
> /out:release\arrow-misc-test.exe /implib:release\arrow-misc-test.lib 
> /pdb:release\arrow-misc-test.pdb /version:0.0 /machine:x64 
> /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console 
> release\arrow_testing.lib release\arrow.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gtest.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gmock.lib 
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_filesystem.lib 
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_system.lib 
> Ws2_32.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib 
> ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST 
> /MANIFESTFILE:release\arrow-misc-test.exe.manifest" failed (exit code 1169) 
> with the following output:
> arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: __cdecl 
> std::vector >::vector std::allocator >(class std::initializer_list,class 
> std::allocator const &)" 
> (??0?$vector@HV?$allocator@H@std@@@std@@QEAA@V?$initializer_list@H@1@AEBV?$allocator@H@1@@Z)
>  already defined in result_test.cc.obj
> arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: __cdecl 
> std::vector >::~vector std::allocator >(void)" (??1?$vector@HV?$allocator@H@std@@@std@@QEAA@XZ) 
> already defined in result_test.cc.obj
> arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: unsigned 
> __int64 __cdecl std::vector >::size(void)const 
> " (?size@?$vector@HV?$allocator@H@std@@@std@@QEBA_KXZ) already defined in 
> result_test.cc.obj
> release\arrow-misc-test.exe : fatal error LNK1169: one or more multiply 
> defined symbols found
> [313/405] Building CXX object 
> src\arrow\CMakeFiles\arrow-table-test.dir\table_builder_test.cc.obj
> ninja: build stopped: subcommand failed.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8238) [C++][Compute] Failed to build compute tests on windows with msvc2015

2020-03-27 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068473#comment-17068473
 ] 

Antoine Pitrou commented on ARROW-8238:
---

This looks weird. Have you found a fix?

> [C++][Compute] Failed to build compute tests on windows with msvc2015
> -
>
> Key: ARROW-8238
> URL: https://issues.apache.org/jira/browse/ARROW-8238
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Compute
>Reporter: Yibo Cai
>Priority: Minor
>
> Build Arrow compute tests on Windows10 with MSVC2015:
> {code:bash}
> cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DARROW_COMPUTE=ON 
> -DARROW_BUILD_TESTS=ON ..
> ninja -j3
> {code}
> Build failed with below message:
> {code:bash}
> [311/405] Linking CXX executable release\arrow-misc-test.exe
> FAILED: release/arrow-misc-test.exe
> cmd.exe /C "cd . && 
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\bin\cmake.exe -E 
> vs_link_exe --intdir=src\arrow\CMakeFiles\arrow-misc-test.dir 
> --rc=C:\PROGRA~2\WI3CF2~1\8.1\bin\x64\rc.exe 
> --mt=C:\PROGRA~2\WI3CF2~1\8.1\bin\x64\mt.exe --manifests  -- 
> C:\PROGRA~2\MICROS~1.0\VC\bin\amd64\link.exe /nologo 
> src\arrow\CMakeFiles\arrow-misc-test.dir\memory_pool_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\result_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\pretty_print_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\status_test.cc.obj  
> /out:release\arrow-misc-test.exe /implib:release\arrow-misc-test.lib 
> /pdb:release\arrow-misc-test.pdb /version:0.0  /machine:x64  
> /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console  
> release\arrow_testing.lib  release\arrow.lib  
> googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib  
> googletest_ep-prefix\src\googletest_ep\lib\gtest.lib  
> googletest_ep-prefix\src\googletest_ep\lib\gmock.lib  
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_filesystem.lib  
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_system.lib  
> Ws2_32.lib  kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib 
> ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib && cd ."
> LINK: command "C:\PROGRA~2\MICROS~1.0\VC\bin\amd64\link.exe /nologo 
> src\arrow\CMakeFiles\arrow-misc-test.dir\memory_pool_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\result_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\pretty_print_test.cc.obj 
> src\arrow\CMakeFiles\arrow-misc-test.dir\status_test.cc.obj 
> /out:release\arrow-misc-test.exe /implib:release\arrow-misc-test.lib 
> /pdb:release\arrow-misc-test.pdb /version:0.0 /machine:x64 
> /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console 
> release\arrow_testing.lib release\arrow.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gtest.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gmock.lib 
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_filesystem.lib 
> C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_system.lib 
> Ws2_32.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib 
> ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST 
> /MANIFESTFILE:release\arrow-misc-test.exe.manifest" failed (exit code 1169) 
> with the following output:
> arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: __cdecl 
> std::vector >::vector std::allocator >(class std::initializer_list,class 
> std::allocator const &)" 
> (??0?$vector@HV?$allocator@H@std@@@std@@QEAA@V?$initializer_list@H@1@AEBV?$allocator@H@1@@Z)
>  already defined in result_test.cc.obj
> arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: __cdecl 
> std::vector >::~vector std::allocator >(void)" (??1?$vector@HV?$allocator@H@std@@@std@@QEAA@XZ) 
> already defined in result_test.cc.obj
> arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: unsigned 
> __int64 __cdecl std::vector >::size(void)const 
> " (?size@?$vector@HV?$allocator@H@std@@@std@@QEBA_KXZ) already defined in 
> result_test.cc.obj
> release\arrow-misc-test.exe : fatal error LNK1169: one or more multiply 
> defined symbols found
> [313/405] Building CXX object 
> src\arrow\CMakeFiles\arrow-table-test.dir\table_builder_test.cc.obj
> ninja: build stopped: subcommand failed.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8239) [Java] fix param checks in splitAndTransfer method

2020-03-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8239:
--
Labels: pull-request-available  (was: )

> [Java] fix param checks in splitAndTransfer method
> --
>
> Key: ARROW-8239
> URL: https://issues.apache.org/jira/browse/ARROW-8239
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8239) [Java] fix param checks in splitAndTransfer method

2020-03-27 Thread Prudhvi Porandla (Jira)
Prudhvi Porandla created ARROW-8239:
---

 Summary: [Java] fix param checks in splitAndTransfer method
 Key: ARROW-8239
 URL: https://issues.apache.org/jira/browse/ARROW-8239
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8238) [C++][Compute] Failed to build compute tests on windows with msvc2015

2020-03-27 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8238:
---

 Summary: [C++][Compute] Failed to build compute tests on windows 
with msvc2015
 Key: ARROW-8238
 URL: https://issues.apache.org/jira/browse/ARROW-8238
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Compute
Reporter: Yibo Cai


Build Arrow compute tests on Windows10 with MSVC2015:
{code:bash}
cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DARROW_COMPUTE=ON 
-DARROW_BUILD_TESTS=ON ..

ninja -j3
{code}

Build failed with below message:
{code:bash}
[311/405] Linking CXX executable release\arrow-misc-test.exe
FAILED: release/arrow-misc-test.exe
cmd.exe /C "cd . && 
C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\bin\cmake.exe -E 
vs_link_exe --intdir=src\arrow\CMakeFiles\arrow-misc-test.dir 
--rc=C:\PROGRA~2\WI3CF2~1\8.1\bin\x64\rc.exe 
--mt=C:\PROGRA~2\WI3CF2~1\8.1\bin\x64\mt.exe --manifests  -- 
C:\PROGRA~2\MICROS~1.0\VC\bin\amd64\link.exe /nologo 
src\arrow\CMakeFiles\arrow-misc-test.dir\memory_pool_test.cc.obj 
src\arrow\CMakeFiles\arrow-misc-test.dir\result_test.cc.obj 
src\arrow\CMakeFiles\arrow-misc-test.dir\pretty_print_test.cc.obj 
src\arrow\CMakeFiles\arrow-misc-test.dir\status_test.cc.obj  
/out:release\arrow-misc-test.exe /implib:release\arrow-misc-test.lib 
/pdb:release\arrow-misc-test.pdb /version:0.0  /machine:x64  
/NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console  
release\arrow_testing.lib  release\arrow.lib  
googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib  
googletest_ep-prefix\src\googletest_ep\lib\gtest.lib  
googletest_ep-prefix\src\googletest_ep\lib\gmock.lib  
C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_filesystem.lib  
C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_system.lib  
Ws2_32.lib  kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib 
ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib && cd ."
LINK: command "C:\PROGRA~2\MICROS~1.0\VC\bin\amd64\link.exe /nologo 
src\arrow\CMakeFiles\arrow-misc-test.dir\memory_pool_test.cc.obj 
src\arrow\CMakeFiles\arrow-misc-test.dir\result_test.cc.obj 
src\arrow\CMakeFiles\arrow-misc-test.dir\pretty_print_test.cc.obj 
src\arrow\CMakeFiles\arrow-misc-test.dir\status_test.cc.obj 
/out:release\arrow-misc-test.exe /implib:release\arrow-misc-test.lib 
/pdb:release\arrow-misc-test.pdb /version:0.0 /machine:x64 /NODEFAULTLIB:LIBCMT 
/INCREMENTAL:NO /subsystem:console release\arrow_testing.lib release\arrow.lib 
googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib 
googletest_ep-prefix\src\googletest_ep\lib\gtest.lib 
googletest_ep-prefix\src\googletest_ep\lib\gmock.lib 
C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_filesystem.lib 
C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_system.lib 
Ws2_32.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib 
oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST 
/MANIFESTFILE:release\arrow-misc-test.exe.manifest" failed (exit code 1169) 
with the following output:
arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: __cdecl 
std::vector >::vector >(class std::initializer_list,class 
std::allocator const &)" 
(??0?$vector@HV?$allocator@H@std@@@std@@QEAA@V?$initializer_list@H@1@AEBV?$allocator@H@1@@Z)
 already defined in result_test.cc.obj
arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: __cdecl 
std::vector >::~vector >(void)" (??1?$vector@HV?$allocator@H@std@@@std@@QEAA@XZ) 
already defined in result_test.cc.obj
arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: unsigned __int64 
__cdecl std::vector >::size(void)const " 
(?size@?$vector@HV?$allocator@H@std@@@std@@QEBA_KXZ) already defined in 
result_test.cc.obj
release\arrow-misc-test.exe : fatal error LNK1169: one or more multiply defined 
symbols found
[313/405] Building CXX object 
src\arrow\CMakeFiles\arrow-table-test.dir\table_builder_test.cc.obj
ninja: build stopped: subcommand failed.
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)