[jira] [Updated] (ARROW-9733) [Rust][DataFusion] Aggregates COUNT/MIN/MAX don't work on VARCHAR columns

2020-08-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9733:
--
Labels: pull-request-available  (was: )

> [Rust][DataFusion] Aggregates COUNT/MIN/MAX don't work on VARCHAR columns
> -
>
> Key: ARROW-9733
> URL: https://issues.apache.org/jira/browse/ARROW-9733
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Attachments: repro.csv
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> h2. Reproducer:
> Create a table with a string column:
> Repro:
> {code}
> CREATE EXTERNAL TABLE repro(a INT, b VARCHAR)
> STORED AS CSV
> WITH HEADER ROW
> LOCATION 'repro.csv';
> {code}
> The contents of repro.csv are as follows (also attached):
> {code}
> a,b
> 1,One
> 1,Two
> 2,One
> 2,Two
> 2,Two
> {code}
> Now, run a query that tries to aggregate that column:
> {code}
> select a, count(b) from repro group by a;
> {code}
> *Actual behavior*:
> {code}
> > select a, count(b) from repro group by a;
> ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for 
> result of aggregate expression")))
> {code}
> *Expected Behavior*:
> The query runs and produces results
> {code}
> a, count(b)
> 1,2
> 2,3
> {code}
> h2. Discussion
> Using Min/Max aggregates on varchar also doesn't work (but should):
> {code}
> > select a, min(b) from repro group by a;
> ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for 
> result of aggregate expression")))
> > select a, max(b) from repro group by a;
> ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for 
> result of aggregate expression")))
> {code}
> Fascinatingly these formulations work fine:
> {code}
> > select a, count(a) from repro group by a;
> +---+--+
> | a | count(a) |
> +---+--+
> | 2 | 3|
> | 1 | 2|
> +---+--+
> 2 row in set. Query took 0 seconds.
> > select a, count(1) from repro group by a;
> +---+-+
> | a | count(UInt8(1)) |
> +---+-+
> | 2 | 3   |
> | 1 | 2   |
> +---+-+
> 2 row in set. Query took 0 seconds.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9778) [Rust] [DataFusion] Logical and physical schemas' nullability does not match in 8 out of 20 end-to-end tests

2020-08-17 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179356#comment-17179356
 ] 

Andy Grove commented on ARROW-9778:
---

Thanks [~jorgecarleitao] . When we construct the logical plan, we do open the 
source data files and infer the schema (unless a schema is provided) so I would 
consider this a bug in the logical plan.

> [Rust] [DataFusion] Logical and physical schemas' nullability does not match 
> in 8 out of 20 end-to-end tests
> 
>
> Key: ARROW-9778
> URL: https://issues.apache.org/jira/browse/ARROW-9778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Priority: Major
>
> In `tests/sql.rs`, if we re-write the ```execute``` function to test the end 
> schemas, as
> ```
> /// Execute query and return result set as tab delimited string
> fn execute(ctx:  ExecutionContext, sql: ) -> Vec {
> let plan = ctx.create_logical_plan().unwrap();
> let plan = ctx.optimize().unwrap();
> let physical_plan = ctx.create_physical_plan().unwrap();
> let results = ctx.collect(physical_plan.as_ref()).unwrap();
> if results.len() > 0 {
> // results must match the logical schema
> assert_eq!(plan.schema().as_ref(), results[0].schema().as_ref());
> }
> result_str()
> }
> ```
> we end up with 8 tests failing, which indicates that our physical and logical 
> plans are not aligned. In all cases, the issue is nullability: our logical 
> plan assumes nullability = true, while our physical plan may change the 
> nullability field.
> If we do not plan to track nullability on the logical level, we could 
> consider replacing Schema by a type that does not track nullability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9776) [R] read_feather causes segfault in R if file doesn't exist

2020-08-17 Thread Nathan TeBlunthuis (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179355#comment-17179355
 ] 

Nathan TeBlunthuis commented on ARROW-9776:
---

I think my os supports memory mapping but I'm not 100% sure how to check.

> [R] read_feather causes segfault in R if file doesn't exist
> ---
>
> Key: ARROW-9776
> URL: https://issues.apache.org/jira/browse/ARROW-9776
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.0
> Environment: R 4.0.2
> Centos 7
>Reporter: Nathan TeBlunthuis
>Priority: Major
>
> This is easy to reproduce. 
>  
> {code:java}
> library(arrow)
> read_feather("test")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9778) [Rust] [DataFusion] Logical and physical schemas' nullability does not match in 8 out of 20 end-to-end tests

2020-08-17 Thread Jorge (Jira)
Jorge created ARROW-9778:


 Summary: [Rust] [DataFusion] Logical and physical schemas' 
nullability does not match in 8 out of 20 end-to-end tests
 Key: ARROW-9778
 URL: https://issues.apache.org/jira/browse/ARROW-9778
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Jorge


In `tests/sql.rs`, if we re-write the ```execute``` function to test the end 
schemas, as

```
/// Execute query and return result set as tab delimited string
fn execute(ctx:  ExecutionContext, sql: ) -> Vec {
let plan = ctx.create_logical_plan().unwrap();
let plan = ctx.optimize().unwrap();
let physical_plan = ctx.create_physical_plan().unwrap();
let results = ctx.collect(physical_plan.as_ref()).unwrap();
if results.len() > 0 {
// results must match the logical schema
assert_eq!(plan.schema().as_ref(), results[0].schema().as_ref());
}
result_str()
}
```

we end up with 8 tests failing, which indicates that our physical and logical 
plans are not aligned. In all cases, the issue is nullability: our logical plan 
assumes nullability = true, while our physical plan may change the nullability 
field.

If we do not plan to track nullability on the logical level, we could consider 
replacing Schema by a type that does not track nullability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9744) [Python] Failed to install on aarch64

2020-08-17 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179335#comment-17179335
 ] 

Kouhei Sutou commented on ARROW-9744:
-

You need to install Apache Arrow C++ separately or build it while building 
pyarrow.

You can do the latter by {{PYARROW_BUNDLE_ARROW_CPP=1 
PYARROW_CMAKE_OPTIONS="-DARROW_ARMV8_ARCH=armv8-a" pip install pyarrow}}.

> [Python] Failed to install on aarch64
> -
>
> Key: ARROW-9744
> URL: https://issues.apache.org/jira/browse/ARROW-9744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
> Environment: AWS m6g (ARM64 'Graviton2' CPU):
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp 
> cpuid asimdrdm lrcpc dcpop asimddp ssbs
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant : 0x3
> CPU part: 0xd0c
> CPU revision: 1
> OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 
> (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 
> 10:03:03 UTC 2020
>Reporter: Matthew Meen
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Attachments: cmake-info.txt, pyarrow_017.txt
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> My team is attempting to migrate some workloads from x86-64 to ARM64, a 
> blocker for this is PyArrow failing to install. `pip install pyarrow` fails 
> to build the wheel as -march isn't correctly resolved:
> {noformat}
>  -- System processor: aarch64
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed
>  -- Arrow build warning level: PRODUCTION
>  CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):
>  Unsupported arch flag: -march=.
> {noformat}
> It's possible to get the build to work after editing 
> `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up 
> as an architecture such as 'armv8-a' - although some more elaborate logic is 
> really needed to pick up the correct extensions.
> I can see that there  have been a number of items discussed in the past both 
> on Jira and in GitHub issues ranging from simple fixes to the cmake script to 
> more elaborate fixes cross-product for arch detection - but I wasn't able to 
> discern how the project wishes to proceed.
> With AWS pushing their ARM-based instances heavily at this point I would 
> advocate for picking a direction before an influx of new issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9744) [Python] Failed to install on aarch64

2020-08-17 Thread Eamonn Nugent (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179311#comment-17179311
 ] 

Eamonn Nugent commented on ARROW-9744:
--

Just tuning back in. Tried out the workaround, and received this:



 
{code:java}
-- Looking for python3.8
 -- Found Python lib /usr/local/lib/libpython3.8.so
 -- Found PkgConfig: /usr/bin/pkg-config (found version "0.29")
 -- Could NOT find Arrow (missing: Arrow_DIR)
 -- Checking for module 'arrow'
 -- No package 'arrow' found
 CMake Error at 
/usr/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
 Could NOT find Arrow (missing: ARROW_INCLUDE_DIR ARROW_LIB_DIR
 ARROW_FULL_SO_VERSION ARROW_SO_VERSION)
 Call Stack (most recent call first):
 /usr/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:378 
(_FPHSA_FAILURE_MESSAGE)
 cmake_modules/FindArrow.cmake:412 (find_package_handle_standard_args)
 cmake_modules/FindArrowPython.cmake:46 (find_package)
 CMakeLists.txt:210 (find_package)

 -- Configuring incomplete, errors occurred!
 See also 
"/tmp/pip-install-av0q_7o5/pyarrow/build/temp.linux-aarch64-3.8/CMakeFiles/CMakeOutput.log".
 error: command 'cmake' failed with exit status 1
 
 ERROR: Failed building wheel for pyarrow
Failed to build pyarrow
ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
installed directly
The command '/bin/sh -c PYARROW_CMAKE_OPTIONS="-DARROW_ARMV8_ARCH=armv8-a" pip 
install pyarrow' returned a non-zero code: 1
{code}
 

Failing Dockerfile on an ARMv8 system:
{code:java}
FROM python:3.8-buster
RUN apt update
RUN apt -y install gcc g++ cmake
RUN PYARROW_CMAKE_OPTIONS="-DARROW_ARMV8_ARCH=armv8-a" pip install pyarrow
{code}
 

 

If there's anything I can do to help debug, please, feel free to let me know.

 

> [Python] Failed to install on aarch64
> -
>
> Key: ARROW-9744
> URL: https://issues.apache.org/jira/browse/ARROW-9744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
> Environment: AWS m6g (ARM64 'Graviton2' CPU):
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp 
> cpuid asimdrdm lrcpc dcpop asimddp ssbs
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant : 0x3
> CPU part: 0xd0c
> CPU revision: 1
> OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 
> (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 
> 10:03:03 UTC 2020
>Reporter: Matthew Meen
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Attachments: cmake-info.txt, pyarrow_017.txt
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> My team is attempting to migrate some workloads from x86-64 to ARM64, a 
> blocker for this is PyArrow failing to install. `pip install pyarrow` fails 
> to build the wheel as -march isn't correctly resolved:
> {noformat}
>  -- System processor: aarch64
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed
>  -- Arrow build warning level: PRODUCTION
>  CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):
>  Unsupported arch flag: -march=.
> {noformat}
> It's possible to get the build to work after editing 
> `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up 
> as an architecture such as 'armv8-a' - although some more elaborate logic is 
> really needed to pick up the correct extensions.
> I can see that there  have been a number of items discussed in the past both 
> on Jira and in GitHub issues ranging from simple fixes to the cmake script to 
> more elaborate fixes cross-product for arch detection - but I wasn't able to 
> discern how the project wishes to proceed.
> With AWS pushing their ARM-based instances heavily at this point I would 
> advocate for picking a direction before an influx of new issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8992) [CI][C++] march not passing correctly for docker-compose run

2020-08-17 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179305#comment-17179305
 ] 

Yibo Cai commented on ARROW-8992:
-

Guess below PR may address this issue
 [https://github.com/apache/arrow/pull/7982]

> [CI][C++] march not passing correctly for docker-compose run
> 
>
> Key: ARROW-8992
> URL: https://issues.apache.org/jira/browse/ARROW-8992
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, Python
>Affects Versions: 0.17.0, 0.17.1
> Environment: Mendel Linux 4.0
>Reporter: Elliott Kipp
>Assignee: Krisztian Szucs
>Priority: Critical
> Attachments: CMakeError.log, CMakeOutput.log
>
>
> [https://github.com/apache/arrow/issues/7307]
> Building on the new ASUS Tinker Edge T, running Mendel Linux 4.0 (Day). 
> docker-compose build commands work fine with no errors:
>  DEBIAN=10 ARCH=arm64v8 docker-compose build debian-cpp && DEBIAN=10 
> ARCH=arm64v8 docker-compose build debian-python
> DEBIAN=10 ARCH=arm64v8 docker-compose run debian-python - fails with the 
> following:
> – Running cmake for pyarrow
>  cmake -DPYTHON_EXECUTABLE=/usr/local/bin/python -G Ninja 
> -DPYARROW_BUILD_CUDA=off -DPYARROW_BUILD_FLIGHT=on -DPYARROW_BUILD_GANDIVA=on 
> -DPYARROW_BUILD_DATASET=on -DPYARROW_BUILD_ORC=on -DPYARROW_BUILD_PARQUET=on 
> -DPYARROW_BUILD_PLASMA=on -DPYARROW_BUILD_S3=off -DPYARROW_BUILD_HDFS=off 
> -DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off 
> -DPYARROW_BUNDLE_BOOST=off -DPYARROW_GENERATE_COVERAGE=off 
> -DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on 
> -DCMAKE_BUILD_TYPE=debug /arrow/python
>  – The C compiler identification is GNU 8.3.0
>  – The CXX compiler identification is GNU 8.3.0
>  – Check for working C compiler: /usr/lib/ccache/gcc
>  – Check for working C compiler: /usr/lib/ccache/gcc – works
>  – Detecting C compiler ABI info
>  – Detecting C compiler ABI info - done
>  – Detecting C compile features
>  – Detecting C compile features - done
>  – Check for working CXX compiler: /usr/lib/ccache/g++
>  – Check for working CXX compiler: /usr/lib/ccache/g++ – works
>  – Detecting CXX compiler ABI info
>  – Detecting CXX compiler ABI info - done
>  – Detecting CXX compile features
>  – Detecting CXX compile features - done
>  – System processor: aarch64
>  – Performing Test CXX_SUPPORTS_ARMV8_ARCH
>  – Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed
>  – Arrow build warning level: PRODUCTION
>  CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):
>  Unsupported arch flag: -march=.
>  Call Stack (most recent call first):
>  CMakeLists.txt:100 (include)
> – Configuring incomplete, errors occurred!
>  See also "/build/python/temp.linux-aarch64-3.7/CMakeFiles/CMakeOutput.log".
>  See also "/build/python/temp.linux-aarch64-3.7/CMakeFiles/CMakeError.log".
>  error: command 'cmake' failed with exit status 1
> Tried the tarball release for both 0.17.0 and 0.17.1, same result. Also tried 
> compiling manually (following these instructions: 
> [https://dzone.com/articles/building-pyarrow-with-cuda-support]) with the 
> same result.
> Only modifications I made to source are editing the docker-compose volumes, 
> as described here: [https://github.com/apache/arrow/pull/6907]
> Jira opened, per request at: [https://github.com/apache/arrow/issues/7307]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9777) [Rust] Implement IPC changes to catch up to 1.0.0 format

2020-08-17 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9777:
-

 Summary: [Rust] Implement IPC changes to catch up to 1.0.0 format
 Key: ARROW-9777
 URL: https://issues.apache.org/jira/browse/ARROW-9777
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


There are a number of IPC changes and features which the Rust implementation 
has fallen behind on. It's effectively using the legacy format that was 
released in 0.14.x.

Some that I encountered are:
 * change padding from 4 bytes to 8 bytes (along with the padding algorithm)
 * add an IPC writer option to support the legacy format and updated format
 * add error handling for the different metadata versions, we should support 
v4+ so it's an oversight to not explicitly return errors if unsupported 
versions are read

Some of the work already has Jiras open (e.g. body compression), I'll find them 
and mark them as related to this.

I'm tight for spare time, but I'll try work on this before the next release 
(along with the Parquet writer)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9742) [Rust] Create one standard DataFrame API

2020-08-17 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179274#comment-17179274
 ] 

Neville Dipale commented on ARROW-9742:
---

Hi [~jhorstmann], the scalar functions on the rust-dataframe library mainly 
call the Arrow compute functions. As we have implemented compute functions with 
an array being the smallest unit, I iterate the chunked arrays and call scalar 
functions on the arrays, before grouping them again into a chunk.

I explored usin Rayon for parallelising those compute functions, but it's not a 
priority (the project is really for me to explore ideas, with the goal being to 
create a lazy dataframe ala spark).

There's scope to add a lot of compute functions to Arrow so that downstream 
users can reuse them, and so we can optimise performance from one place. I 
haven't yet seen interest in functions like trig, temporal functions (I have a 
Jira open for this as I tend to do a lot of datetime conversions), and other 
functions beyond what we have. I think DF has some of these as UDFs, which 
probably makes sense to keep them there for now.

Regarding performance, we've found some patterns that help with 
autovectorisation when writing compute functions, I think at the least we could 
write them up so that downstream users can at least follow them.

One common mistake I've seen is that we iterate through array values, checking 
if a slot is valid or null, and computing the function if valid. An approach 
that works is to ignore nulls and calculate them from the validty mask.

> [Rust] Create one standard DataFrame API
> 
>
> Key: ARROW-9742
> URL: https://issues.apache.org/jira/browse/ARROW-9742
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
>  There was a discussion in last Arrow sync call about the fact that there are 
> numerous Rust DataFrame projects and it would be good to have one standard, 
> in the Arrow repo.
> I do think it would be good to have a DataFrame trait in Arrow, with an 
> implementation in DataFusion, and making it possible for other projects to 
> extend/replace the implementation e.g. for distributed compute, or for GPU 
> compute, as two examples. 
> [~jhorstmann] Does this capture what you were suggesting in the call?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9776) [R] read_feather causes segfault in R if file doesn't exist

2020-08-17 Thread Nathan TeBlunthuis (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179260#comment-17179260
 ] 

Nathan TeBlunthuis commented on ARROW-9776:
---

Hi Neal,

thanks for the help.
{code:java}
read_feather("asdfasdf", mmap = FALSE){code}
also segfaults.

read_parquet, read_json_arrow, and read_ipc_stream also segfault.  I didn't try 
the other functions.

I installed the R package from CRAN and then ran
{code:java}
install_arrow{code}
 

> [R] read_feather causes segfault in R if file doesn't exist
> ---
>
> Key: ARROW-9776
> URL: https://issues.apache.org/jira/browse/ARROW-9776
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.0
> Environment: R 4.0.2
> Centos 7
>Reporter: Nathan TeBlunthuis
>Priority: Major
>
> This is easy to reproduce. 
>  
> {code:java}
> library(arrow)
> read_feather("test")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-9299) [Python] Expose ORC metadata() in Python ORCFile

2020-08-17 Thread Jeremy Dyer (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179252#comment-17179252
 ] 

Jeremy Dyer edited comment on ARROW-9299 at 8/17/20, 10:09 PM:
---

[~calebwin] it is possible, but not currently visible as you mentioned. I think 
the easiest thing to do would be add a function in `orc/adaptor.cc` that did 
basically the same thing done here [1]. After that it would be exposed so that 
python could invoke it I believe? I'm no expert here but seems like that would 
do the trick.

[1] 
[https://github.com/apache/arrow/blob/d542482bdc6bea8a449f000bdd74de8990c20015/cpp/src/arrow/adapters/orc/adapter.cc#L235-L242]


was (Author: jeremy.dyer):
[~calebwin] it is possible but not currently exposed. I think the easiest thing 
to do would be add a function in `orc/adaptor.cc` that did basically the same 
thing done here [1]. After that it would be exposed so that python could invoke 
it I believe? I'm no expert here but seems like that would do the trick.

[1] 
[https://github.com/apache/arrow/blob/d542482bdc6bea8a449f000bdd74de8990c20015/cpp/src/arrow/adapters/orc/adapter.cc#L235-L242]

> [Python] Expose ORC metadata() in Python ORCFile
> 
>
> Key: ARROW-9299
> URL: https://issues.apache.org/jira/browse/ARROW-9299
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.17.1
>Reporter: Jeremy Dyer
>Priority: Major
>
> There is currently no way for a user to directly access the underlying ORC 
> metadata of a given file. It seems the C++ functions and objects already 
> existing and rather the plumbing is just missing the the cython/python and 
> potentially a few c++ shims. Giving users the ability to retrieve the 
> metadata without first reading the entire file could help numerous 
> applications to increase their query performance by allowing them to 
> intelligently determine which ORC stripes should be read.  
> This would allow for something like 
> {code:java}
> import pyarrow as pa 
> orc_metadata = pa.orc.ORCFile(filename).metadata()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9299) [Python] Expose ORC metadata() in Python ORCFile

2020-08-17 Thread Jeremy Dyer (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179252#comment-17179252
 ] 

Jeremy Dyer commented on ARROW-9299:


[~calebwin] it is possible but not currently exposed. I think the easiest thing 
to do would be add a function in `orc/adaptor.cc` that did basically the same 
thing done here [1]. After that it would be exposed so that python could invoke 
it I believe? I'm no expert here but seems like that would do the trick.

[1] 
[https://github.com/apache/arrow/blob/d542482bdc6bea8a449f000bdd74de8990c20015/cpp/src/arrow/adapters/orc/adapter.cc#L235-L242]

> [Python] Expose ORC metadata() in Python ORCFile
> 
>
> Key: ARROW-9299
> URL: https://issues.apache.org/jira/browse/ARROW-9299
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.17.1
>Reporter: Jeremy Dyer
>Priority: Major
>
> There is currently no way for a user to directly access the underlying ORC 
> metadata of a given file. It seems the C++ functions and objects already 
> existing and rather the plumbing is just missing the the cython/python and 
> potentially a few c++ shims. Giving users the ability to retrieve the 
> metadata without first reading the entire file could help numerous 
> applications to increase their query performance by allowing them to 
> intelligently determine which ORC stripes should be read.  
> This would allow for something like 
> {code:java}
> import pyarrow as pa 
> orc_metadata = pa.orc.ORCFile(filename).metadata()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9744) [Python] Failed to install on aarch64

2020-08-17 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9744:


Assignee: Apache Arrow JIRA Bot  (was: Kouhei Sutou)

> [Python] Failed to install on aarch64
> -
>
> Key: ARROW-9744
> URL: https://issues.apache.org/jira/browse/ARROW-9744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
> Environment: AWS m6g (ARM64 'Graviton2' CPU):
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp 
> cpuid asimdrdm lrcpc dcpop asimddp ssbs
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant : 0x3
> CPU part: 0xd0c
> CPU revision: 1
> OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 
> (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 
> 10:03:03 UTC 2020
>Reporter: Matthew Meen
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
> Attachments: cmake-info.txt, pyarrow_017.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> My team is attempting to migrate some workloads from x86-64 to ARM64, a 
> blocker for this is PyArrow failing to install. `pip install pyarrow` fails 
> to build the wheel as -march isn't correctly resolved:
> {noformat}
>  -- System processor: aarch64
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed
>  -- Arrow build warning level: PRODUCTION
>  CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):
>  Unsupported arch flag: -march=.
> {noformat}
> It's possible to get the build to work after editing 
> `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up 
> as an architecture such as 'armv8-a' - although some more elaborate logic is 
> really needed to pick up the correct extensions.
> I can see that there  have been a number of items discussed in the past both 
> on Jira and in GitHub issues ranging from simple fixes to the cmake script to 
> more elaborate fixes cross-product for arch detection - but I wasn't able to 
> discern how the project wishes to proceed.
> With AWS pushing their ARM-based instances heavily at this point I would 
> advocate for picking a direction before an influx of new issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9744) [Python] Failed to install on aarch64

2020-08-17 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9744:


Assignee: Kouhei Sutou  (was: Apache Arrow JIRA Bot)

> [Python] Failed to install on aarch64
> -
>
> Key: ARROW-9744
> URL: https://issues.apache.org/jira/browse/ARROW-9744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
> Environment: AWS m6g (ARM64 'Graviton2' CPU):
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp 
> cpuid asimdrdm lrcpc dcpop asimddp ssbs
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant : 0x3
> CPU part: 0xd0c
> CPU revision: 1
> OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 
> (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 
> 10:03:03 UTC 2020
>Reporter: Matthew Meen
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Attachments: cmake-info.txt, pyarrow_017.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> My team is attempting to migrate some workloads from x86-64 to ARM64, a 
> blocker for this is PyArrow failing to install. `pip install pyarrow` fails 
> to build the wheel as -march isn't correctly resolved:
> {noformat}
>  -- System processor: aarch64
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed
>  -- Arrow build warning level: PRODUCTION
>  CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):
>  Unsupported arch flag: -march=.
> {noformat}
> It's possible to get the build to work after editing 
> `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up 
> as an architecture such as 'armv8-a' - although some more elaborate logic is 
> really needed to pick up the correct extensions.
> I can see that there  have been a number of items discussed in the past both 
> on Jira and in GitHub issues ranging from simple fixes to the cmake script to 
> more elaborate fixes cross-product for arch detection - but I wasn't able to 
> discern how the project wishes to proceed.
> With AWS pushing their ARM-based instances heavily at this point I would 
> advocate for picking a direction before an influx of new issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9744) [Python] Failed to install on aarch64

2020-08-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9744:
--
Labels: pull-request-available  (was: )

> [Python] Failed to install on aarch64
> -
>
> Key: ARROW-9744
> URL: https://issues.apache.org/jira/browse/ARROW-9744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
> Environment: AWS m6g (ARM64 'Graviton2' CPU):
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp 
> cpuid asimdrdm lrcpc dcpop asimddp ssbs
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant : 0x3
> CPU part: 0xd0c
> CPU revision: 1
> OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 
> (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 
> 10:03:03 UTC 2020
>Reporter: Matthew Meen
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Attachments: cmake-info.txt, pyarrow_017.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> My team is attempting to migrate some workloads from x86-64 to ARM64, a 
> blocker for this is PyArrow failing to install. `pip install pyarrow` fails 
> to build the wheel as -march isn't correctly resolved:
> {noformat}
>  -- System processor: aarch64
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed
>  -- Arrow build warning level: PRODUCTION
>  CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):
>  Unsupported arch flag: -march=.
> {noformat}
> It's possible to get the build to work after editing 
> `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up 
> as an architecture such as 'armv8-a' - although some more elaborate logic is 
> really needed to pick up the correct extensions.
> I can see that there  have been a number of items discussed in the past both 
> on Jira and in GitHub issues ranging from simple fixes to the cmake script to 
> more elaborate fixes cross-product for arch detection - but I wasn't able to 
> discern how the project wishes to proceed.
> With AWS pushing their ARM-based instances heavily at this point I would 
> advocate for picking a direction before an influx of new issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9744) [Python] Failed to install on aarch64

2020-08-17 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179247#comment-17179247
 ] 

Kouhei Sutou commented on ARROW-9744:
-

Ah, I got it.
pyarrow uses only {{SetupCxxFlags.cmake}}. It doesn't use 
{{DefineOptions.cmake}}.

I'll create a pull request to fix this.

Workaround: {{PYARROW_CMAKE_OPTIONS="-DARROW_ARMV8_ARCH=armv8-a" pip install 
pyarrow}}

> [Python] Failed to install on aarch64
> -
>
> Key: ARROW-9744
> URL: https://issues.apache.org/jira/browse/ARROW-9744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
> Environment: AWS m6g (ARM64 'Graviton2' CPU):
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp 
> cpuid asimdrdm lrcpc dcpop asimddp ssbs
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant : 0x3
> CPU part: 0xd0c
> CPU revision: 1
> OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 
> (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 
> 10:03:03 UTC 2020
>Reporter: Matthew Meen
>Assignee: Kouhei Sutou
>Priority: Major
> Attachments: cmake-info.txt, pyarrow_017.txt
>
>
> My team is attempting to migrate some workloads from x86-64 to ARM64, a 
> blocker for this is PyArrow failing to install. `pip install pyarrow` fails 
> to build the wheel as -march isn't correctly resolved:
> {noformat}
>  -- System processor: aarch64
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed
>  -- Arrow build warning level: PRODUCTION
>  CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):
>  Unsupported arch flag: -march=.
> {noformat}
> It's possible to get the build to work after editing 
> `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up 
> as an architecture such as 'armv8-a' - although some more elaborate logic is 
> really needed to pick up the correct extensions.
> I can see that there  have been a number of items discussed in the past both 
> on Jira and in GitHub issues ranging from simple fixes to the cmake script to 
> more elaborate fixes cross-product for arch detection - but I wasn't able to 
> discern how the project wishes to proceed.
> With AWS pushing their ARM-based instances heavily at this point I would 
> advocate for picking a direction before an influx of new issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9745) [Python] Reading Parquet file crashes on windows - python3.8

2020-08-17 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-9745:

Summary: [Python] Reading Parquet file crashes on windows - python3.8  
(was: parrow fails to read on windows - python3.8)

> [Python] Reading Parquet file crashes on windows - python3.8
> 
>
> Key: ARROW-9745
> URL: https://issues.apache.org/jira/browse/ARROW-9745
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0
> Environment: Installation done with pip:
> pip install pyarrow pandas
> for python3.8 on a windows machine running windows 10 Enterprise (v1809). The 
> resulting wheel is:
> pyarrow-1.0.0-cp38-cp38-win_amd64.whl 
>Reporter: Dylan Modesitt
>Priority: Major
>
> {code:java}
> import pandas 
> import numpy 
> df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), 
> columns=list("1234"))
> df.to_parquet("the.parquet")
> pd.read_parquet("the.parquet")  # fails here
> {code}
> fails with
> {code:java}
> Process finished with exit code -1073741795 (0xC01D)
> {code}
> {code:java}
> pyarrow.parquet.read_pandas(pyarrow.BufferReader(...)).to_pandas()
> {code}
> also fails with the same exit message. Has this been seen before? Is there a 
> known solution? I experienced the same issue installing the pyarrow nightlies 
> as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9745) [Python] Reading Parquet file crashes on windows - python3.8

2020-08-17 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179245#comment-17179245
 ] 

Wes McKinney commented on ARROW-9745:
-

Can you provide a reproducible example and any information about your hardware 
(CPU type etc.)?

> [Python] Reading Parquet file crashes on windows - python3.8
> 
>
> Key: ARROW-9745
> URL: https://issues.apache.org/jira/browse/ARROW-9745
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0
> Environment: Installation done with pip:
> pip install pyarrow pandas
> for python3.8 on a windows machine running windows 10 Enterprise (v1809). The 
> resulting wheel is:
> pyarrow-1.0.0-cp38-cp38-win_amd64.whl 
>Reporter: Dylan Modesitt
>Priority: Major
>
> {code:java}
> import pandas 
> import numpy 
> df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), 
> columns=list("1234"))
> df.to_parquet("the.parquet")
> pd.read_parquet("the.parquet")  # fails here
> {code}
> fails with
> {code:java}
> Process finished with exit code -1073741795 (0xC01D)
> {code}
> {code:java}
> pyarrow.parquet.read_pandas(pyarrow.BufferReader(...)).to_pandas()
> {code}
> also fails with the same exit message. Has this been seen before? Is there a 
> known solution? I experienced the same issue installing the pyarrow nightlies 
> as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-9744) [Python] Failed to install on aarch64

2020-08-17 Thread Eamonn Nugent (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179243#comment-17179243
 ] 

Eamonn Nugent edited comment on ARROW-9744 at 8/17/20, 9:48 PM:


Hiya. Just wanted to tune in about `pip install pyarrow` with 1.0 on an ARMv8 
instance (AWS m6g.medium). It seems to have the same error:
{code:java}
CMake Error at cmake_modules/SetupCxxFlags.cmake:368 (message):
 Unsupported arch flag: -march=.
Call Stack (most recent call first):
 CMakeLists.txt:100 (include){code}
 

Is there a good workaround for this? Or should I wait for the next release


was (Author: space55):
Hiya. Just wanted to tune in about `pip install pyarrow` with 1.0 on an ARMv8 
instance (AWS m6g.medium). It seems to have the same error:


```
CMake Error at cmake_modules/SetupCxxFlags.cmake:368 (message):
 Unsupported arch flag: -march=.
 Call Stack (most recent call first):
 CMakeLists.txt:100 (include)
```

 

Is there a good workaround for this? Or should I wait for the next release

> [Python] Failed to install on aarch64
> -
>
> Key: ARROW-9744
> URL: https://issues.apache.org/jira/browse/ARROW-9744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
> Environment: AWS m6g (ARM64 'Graviton2' CPU):
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp 
> cpuid asimdrdm lrcpc dcpop asimddp ssbs
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant : 0x3
> CPU part: 0xd0c
> CPU revision: 1
> OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 
> (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 
> 10:03:03 UTC 2020
>Reporter: Matthew Meen
>Assignee: Kouhei Sutou
>Priority: Major
> Attachments: cmake-info.txt, pyarrow_017.txt
>
>
> My team is attempting to migrate some workloads from x86-64 to ARM64, a 
> blocker for this is PyArrow failing to install. `pip install pyarrow` fails 
> to build the wheel as -march isn't correctly resolved:
> {noformat}
>  -- System processor: aarch64
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed
>  -- Arrow build warning level: PRODUCTION
>  CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):
>  Unsupported arch flag: -march=.
> {noformat}
> It's possible to get the build to work after editing 
> `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up 
> as an architecture such as 'armv8-a' - although some more elaborate logic is 
> really needed to pick up the correct extensions.
> I can see that there  have been a number of items discussed in the past both 
> on Jira and in GitHub issues ranging from simple fixes to the cmake script to 
> more elaborate fixes cross-product for arch detection - but I wasn't able to 
> discern how the project wishes to proceed.
> With AWS pushing their ARM-based instances heavily at this point I would 
> advocate for picking a direction before an influx of new issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9744) [Python] Failed to install on aarch64

2020-08-17 Thread Eamonn Nugent (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179243#comment-17179243
 ] 

Eamonn Nugent commented on ARROW-9744:
--

Hiya. Just wanted to tune in about `pip install pyarrow` with 1.0 on an ARMv8 
instance (AWS m6g.medium). It seems to have the same error:


```
CMake Error at cmake_modules/SetupCxxFlags.cmake:368 (message):
 Unsupported arch flag: -march=.
 Call Stack (most recent call first):
 CMakeLists.txt:100 (include)
```

 

Is there a good workaround for this? Or should I wait for the next release

> [Python] Failed to install on aarch64
> -
>
> Key: ARROW-9744
> URL: https://issues.apache.org/jira/browse/ARROW-9744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
> Environment: AWS m6g (ARM64 'Graviton2' CPU):
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp 
> cpuid asimdrdm lrcpc dcpop asimddp ssbs
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant : 0x3
> CPU part: 0xd0c
> CPU revision: 1
> OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 
> (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 
> 10:03:03 UTC 2020
>Reporter: Matthew Meen
>Assignee: Kouhei Sutou
>Priority: Major
> Attachments: cmake-info.txt, pyarrow_017.txt
>
>
> My team is attempting to migrate some workloads from x86-64 to ARM64, a 
> blocker for this is PyArrow failing to install. `pip install pyarrow` fails 
> to build the wheel as -march isn't correctly resolved:
> {noformat}
>  -- System processor: aarch64
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed
>  -- Arrow build warning level: PRODUCTION
>  CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):
>  Unsupported arch flag: -march=.
> {noformat}
> It's possible to get the build to work after editing 
> `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up 
> as an architecture such as 'armv8-a' - although some more elaborate logic is 
> really needed to pick up the correct extensions.
> I can see that there  have been a number of items discussed in the past both 
> on Jira and in GitHub issues ranging from simple fixes to the cmake script to 
> more elaborate fixes cross-product for arch detection - but I wasn't able to 
> discern how the project wishes to proceed.
> With AWS pushing their ARM-based instances heavily at this point I would 
> advocate for picking a direction before an influx of new issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9776) [R] read_feather causes segfault in R if file doesn't exist

2020-08-17 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-9776:

Summary: [R] read_feather causes segfault in R if file doesn't exist  (was: 
read_feather causes segfault in R if file doesn't exist)

> [R] read_feather causes segfault in R if file doesn't exist
> ---
>
> Key: ARROW-9776
> URL: https://issues.apache.org/jira/browse/ARROW-9776
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.0
> Environment: R 4.0.2
> Centos 7
>Reporter: Nathan TeBlunthuis
>Priority: Major
>
> This is easy to reproduce. 
>  
> {code:java}
> library(arrow)
> read_feather("test")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9729) [Java] Error Prone causes other annotation processors to not work with Eclipse

2020-08-17 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-9729:

Summary: [Java] Error Prone causes other annotation processors to not work 
with Eclipse  (was: Error Prone causes other annotation processors to not work 
with Eclipse)

> [Java] Error Prone causes other annotation processors to not work with Eclipse
> --
>
> Key: ARROW-9729
> URL: https://issues.apache.org/jira/browse/ARROW-9729
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Laurent Goujon
>Assignee: Laurent Goujon
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> An incompatibility between Eclipse (m2e-apt), Error Prone prevents other 
> annotation processors to work correctly within Eclipse, which is especially 
> an issue with Immutables.org annotation processor as it generated classes 
> needed for the project to compile.
> This is explained in more detailed in this bug report for m2e-apt Eclipse 
> plugin: https://github.com/jbosstools/m2e-apt/issues/62
> There's no easy workaround Eclipse user can apply by themselves, but the 
> Arrow project could not include Error Prone as an annotation processor when 
> being imported into Eclipse, in order for the other annotation processors to 
> work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9776) read_feather causes segfault in R if file doesn't exist

2020-08-17 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179240#comment-17179240
 ] 

Neal Richardson commented on ARROW-9776:


Works on my machine (well, fails gracefully):

{code}
> library(arrow)

Attaching package: ‘arrow’

The following object is masked from ‘package:utils’:

timestamp

> read_feather("asdfasdf")
Error in io___MemoryMappedFile__Open(path, mode) : 
  IOError: Failed to open local file 'asdfasdf'. Detail: [errno 2] No such file 
or directory
{code}

Does your file system support memory mapping? Does {{read_feather("test", mmap 
= FALSE)}} also segfault? Do other read_* functions behave the same?

Can you provide details on how you've installed the R package?



> read_feather causes segfault in R if file doesn't exist
> ---
>
> Key: ARROW-9776
> URL: https://issues.apache.org/jira/browse/ARROW-9776
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.0
> Environment: R 4.0.2
> Centos 7
>Reporter: Nathan TeBlunthuis
>Priority: Major
>
> This is easy to reproduce. 
>  
> {code:java}
> library(arrow)
> read_feather("test")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9776) read_feather causes segfault in R if file doesn't exist

2020-08-17 Thread Nathan TeBlunthuis (Jira)
Nathan TeBlunthuis created ARROW-9776:
-

 Summary: read_feather causes segfault in R if file doesn't exist
 Key: ARROW-9776
 URL: https://issues.apache.org/jira/browse/ARROW-9776
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 1.0.0
 Environment: R 4.0.2
Centos 7
Reporter: Nathan TeBlunthuis


This is easy to reproduce. 

 
{code:java}
library(arrow)

read_feather("test")


{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9744) [Python] Failed to install on aarch64

2020-08-17 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179237#comment-17179237
 ] 

Krisztian Szucs commented on ARROW-9744:


It is available now on PyPI https://pypi.org/project/pyarrow/#files

> [Python] Failed to install on aarch64
> -
>
> Key: ARROW-9744
> URL: https://issues.apache.org/jira/browse/ARROW-9744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
> Environment: AWS m6g (ARM64 'Graviton2' CPU):
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp 
> cpuid asimdrdm lrcpc dcpop asimddp ssbs
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant : 0x3
> CPU part: 0xd0c
> CPU revision: 1
> OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 
> (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 
> 10:03:03 UTC 2020
>Reporter: Matthew Meen
>Assignee: Kouhei Sutou
>Priority: Major
> Attachments: cmake-info.txt, pyarrow_017.txt
>
>
> My team is attempting to migrate some workloads from x86-64 to ARM64, a 
> blocker for this is PyArrow failing to install. `pip install pyarrow` fails 
> to build the wheel as -march isn't correctly resolved:
> {noformat}
>  -- System processor: aarch64
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed
>  -- Arrow build warning level: PRODUCTION
>  CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):
>  Unsupported arch flag: -march=.
> {noformat}
> It's possible to get the build to work after editing 
> `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up 
> as an architecture such as 'armv8-a' - although some more elaborate logic is 
> really needed to pick up the correct extensions.
> I can see that there  have been a number of items discussed in the past both 
> on Jira and in GitHub issues ranging from simple fixes to the cmake script to 
> more elaborate fixes cross-product for arch detection - but I wasn't able to 
> discern how the project wishes to proceed.
> With AWS pushing their ARM-based instances heavily at this point I would 
> advocate for picking a direction before an influx of new issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9744) [Python] Failed to install on aarch64

2020-08-17 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-9744:

Description: 
My team is attempting to migrate some workloads from x86-64 to ARM64, a blocker 
for this is PyArrow failing to install. `pip install pyarrow` fails to build 
the wheel as -march isn't correctly resolved:

{noformat}
 -- System processor: aarch64
 -- Performing Test CXX_SUPPORTS_ARMV8_ARCH
 -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed
 -- Arrow build warning level: PRODUCTION
 CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):
 Unsupported arch flag: -march=.
{noformat}

It's possible to get the build to work after editing 
`cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up as 
an architecture such as 'armv8-a' - although some more elaborate logic is 
really needed to pick up the correct extensions.

I can see that there  have been a number of items discussed in the past both on 
Jira and in GitHub issues ranging from simple fixes to the cmake script to more 
elaborate fixes cross-product for arch detection - but I wasn't able to discern 
how the project wishes to proceed.

With AWS pushing their ARM-based instances heavily at this point I would 
advocate for picking a direction before an influx of new issues.

 

  was:
My team is attempting to migrate some workloads from x86-64 to ARM64, a blocker 
for this is PyArrow failing to install. `pip install pyarrow` fails to build 
the wheel as -march isn't correctly resolved:

{{ -- System processor: aarch64}}
{{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH}}
{{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed}}
{{ -- Arrow build warning level: PRODUCTION}}
{{ CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):}}
{{ Unsupported arch flag: -march=.}}

It's possible to get the build to work after editing 
`cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up as 
an architecture such as 'armv8-a' - although some more elaborate logic is 
really needed to pick up the correct extensions.

I can see that there  have been a number of items discussed in the past both on 
Jira and in GitHub issues ranging from simple fixes to the cmake script to more 
elaborate fixes cross-product for arch detection - but I wasn't able to discern 
how the project wishes to proceed.

With AWS pushing their ARM-based instances heavily at this point I would 
advocate for picking a direction before an influx of new issues.

 


> [Python] Failed to install on aarch64
> -
>
> Key: ARROW-9744
> URL: https://issues.apache.org/jira/browse/ARROW-9744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
> Environment: AWS m6g (ARM64 'Graviton2' CPU):
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp 
> cpuid asimdrdm lrcpc dcpop asimddp ssbs
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant : 0x3
> CPU part: 0xd0c
> CPU revision: 1
> OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 
> (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 
> 10:03:03 UTC 2020
>Reporter: Matthew Meen
>Assignee: Kouhei Sutou
>Priority: Major
> Attachments: cmake-info.txt, pyarrow_017.txt
>
>
> My team is attempting to migrate some workloads from x86-64 to ARM64, a 
> blocker for this is PyArrow failing to install. `pip install pyarrow` fails 
> to build the wheel as -march isn't correctly resolved:
> {noformat}
>  -- System processor: aarch64
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH
>  -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed
>  -- Arrow build warning level: PRODUCTION
>  CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):
>  Unsupported arch flag: -march=.
> {noformat}
> It's possible to get the build to work after editing 
> `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up 
> as an architecture such as 'armv8-a' - although some more elaborate logic is 
> really needed to pick up the correct extensions.
> I can see that there  have been a number of items discussed in the past both 
> on Jira and in GitHub issues ranging from simple fixes to the cmake script to 
> more elaborate fixes cross-product for arch detection - but I wasn't able to 
> discern how the project wishes to proceed.
> With AWS pushing their ARM-based instances heavily at this point I would 
> advocate for picking a direction before an influx of new issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9744) [Python] Failed to install on aarch64

2020-08-17 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-9744:

Summary: [Python] Failed to install on aarch64  (was: [Python] aarch64 
Installation Error)

> [Python] Failed to install on aarch64
> -
>
> Key: ARROW-9744
> URL: https://issues.apache.org/jira/browse/ARROW-9744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
> Environment: AWS m6g (ARM64 'Graviton2' CPU):
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp 
> cpuid asimdrdm lrcpc dcpop asimddp ssbs
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant : 0x3
> CPU part: 0xd0c
> CPU revision: 1
> OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 
> (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 
> 10:03:03 UTC 2020
>Reporter: Matthew Meen
>Assignee: Kouhei Sutou
>Priority: Major
> Attachments: cmake-info.txt, pyarrow_017.txt
>
>
> My team is attempting to migrate some workloads from x86-64 to ARM64, a 
> blocker for this is PyArrow failing to install. `pip install pyarrow` fails 
> to build the wheel as -march isn't correctly resolved:
> {{ -- System processor: aarch64}}
> {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH}}
> {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed}}
> {{ -- Arrow build warning level: PRODUCTION}}
> {{ CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):}}
> {{ Unsupported arch flag: -march=.}}
> It's possible to get the build to work after editing 
> `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up 
> as an architecture such as 'armv8-a' - although some more elaborate logic is 
> really needed to pick up the correct extensions.
> I can see that there  have been a number of items discussed in the past both 
> on Jira and in GitHub issues ranging from simple fixes to the cmake script to 
> more elaborate fixes cross-product for arch detection - but I wasn't able to 
> discern how the project wishes to proceed.
> With AWS pushing their ARM-based instances heavily at this point I would 
> advocate for picking a direction before an influx of new issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9744) [Python] aarch64 Installation Error

2020-08-17 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-9744:
---

Assignee: Kouhei Sutou

> [Python] aarch64 Installation Error
> ---
>
> Key: ARROW-9744
> URL: https://issues.apache.org/jira/browse/ARROW-9744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
> Environment: AWS m6g (ARM64 'Graviton2' CPU):
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp 
> cpuid asimdrdm lrcpc dcpop asimddp ssbs
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant : 0x3
> CPU part: 0xd0c
> CPU revision: 1
> OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 
> (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 
> 10:03:03 UTC 2020
>Reporter: Matthew Meen
>Assignee: Kouhei Sutou
>Priority: Major
> Attachments: cmake-info.txt, pyarrow_017.txt
>
>
> My team is attempting to migrate some workloads from x86-64 to ARM64, a 
> blocker for this is PyArrow failing to install. `pip install pyarrow` fails 
> to build the wheel as -march isn't correctly resolved:
> {{ -- System processor: aarch64}}
> {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH}}
> {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed}}
> {{ -- Arrow build warning level: PRODUCTION}}
> {{ CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):}}
> {{ Unsupported arch flag: -march=.}}
> It's possible to get the build to work after editing 
> `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up 
> as an architecture such as 'armv8-a' - although some more elaborate logic is 
> really needed to pick up the correct extensions.
> I can see that there  have been a number of items discussed in the past both 
> on Jira and in GitHub issues ranging from simple fixes to the cmake script to 
> more elaborate fixes cross-product for arch detection - but I wasn't able to 
> discern how the project wishes to proceed.
> With AWS pushing their ARM-based instances heavily at this point I would 
> advocate for picking a direction before an influx of new issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9299) [Python] Expose ORC metadata() in Python ORCFile

2020-08-17 Thread Caleb Winston (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179231#comment-17179231
 ] 

Caleb Winston commented on ARROW-9299:
--

[~jeremy.dyer] Is it possible to get metadata using arrow-cpp though? I'm 
seeing a private field [1] storing an ORC `Reader` which could be used to get 
metadata. There isn't a way to access this through C++ API even though the 
metadata is in there - correct?

 [1] 
[https://github.com/apache/arrow/blob/d542482bdc6bea8a449f000bdd74de8990c20015/cpp/src/arrow/adapters/orc/adapter.cc#L411]

> [Python] Expose ORC metadata() in Python ORCFile
> 
>
> Key: ARROW-9299
> URL: https://issues.apache.org/jira/browse/ARROW-9299
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.17.1
>Reporter: Jeremy Dyer
>Priority: Major
>
> There is currently no way for a user to directly access the underlying ORC 
> metadata of a given file. It seems the C++ functions and objects already 
> existing and rather the plumbing is just missing the the cython/python and 
> potentially a few c++ shims. Giving users the ability to retrieve the 
> metadata without first reading the entire file could help numerous 
> applications to increase their query performance by allowing them to 
> intelligently determine which ORC stripes should be read.  
> This would allow for something like 
> {code:java}
> import pyarrow as pa 
> orc_metadata = pa.orc.ORCFile(filename).metadata()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-9299) [Python] Expose ORC metadata() in Python ORCFile

2020-08-17 Thread Caleb Winston (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179230#comment-17179230
 ] 

Caleb Winston edited comment on ARROW-9299 at 8/17/20, 9:16 PM:


This would be very useful for our use-case in cuDF where we want to select 
stripes to read onto GPU based on statistics stored in the ORC metadata.

Edit: Didn't see who was posting this haha.


was (Author: calebwin):
This would be very useful for our use-case in cuDF where we want to select 
stripes to read onto GPU based on statistics stored in the ORC metadata.

> [Python] Expose ORC metadata() in Python ORCFile
> 
>
> Key: ARROW-9299
> URL: https://issues.apache.org/jira/browse/ARROW-9299
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.17.1
>Reporter: Jeremy Dyer
>Priority: Major
>
> There is currently no way for a user to directly access the underlying ORC 
> metadata of a given file. It seems the C++ functions and objects already 
> existing and rather the plumbing is just missing the the cython/python and 
> potentially a few c++ shims. Giving users the ability to retrieve the 
> metadata without first reading the entire file could help numerous 
> applications to increase their query performance by allowing them to 
> intelligently determine which ORC stripes should be read.  
> This would allow for something like 
> {code:java}
> import pyarrow as pa 
> orc_metadata = pa.orc.ORCFile(filename).metadata()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9299) [Python] Expose ORC metadata() in Python ORCFile

2020-08-17 Thread Caleb Winston (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179230#comment-17179230
 ] 

Caleb Winston commented on ARROW-9299:
--

This would be very useful for our use-case in cuDF where we want to select 
stripes to read onto GPU based on statistics stored in the ORC metadata.

> [Python] Expose ORC metadata() in Python ORCFile
> 
>
> Key: ARROW-9299
> URL: https://issues.apache.org/jira/browse/ARROW-9299
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.17.1
>Reporter: Jeremy Dyer
>Priority: Major
>
> There is currently no way for a user to directly access the underlying ORC 
> metadata of a given file. It seems the C++ functions and objects already 
> existing and rather the plumbing is just missing the the cython/python and 
> potentially a few c++ shims. Giving users the ability to retrieve the 
> metadata without first reading the entire file could help numerous 
> applications to increase their query performance by allowing them to 
> intelligently determine which ORC stripes should be read.  
> This would allow for something like 
> {code:java}
> import pyarrow as pa 
> orc_metadata = pa.orc.ORCFile(filename).metadata()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9775) Automatic S3 region selection

2020-08-17 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179229#comment-17179229
 ] 

Antoine Pitrou commented on ARROW-9775:
---

Do you want to submit a PR with the desired changes?

> Automatic S3 region selection
> -
>
> Key: ARROW-9775
> URL: https://issues.apache.org/jira/browse/ARROW-9775
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Python
> Environment: macOS, Linux.
>Reporter: Sahil Gupta
>Priority: Major
>
> Currently, PyArrow and ArrowCpp need to be provided the region of the S3 
> file/bucket, else it defaults to using 'us-east-1'. Ideally, PyArrow and 
> ArrowCpp can automatically detect the region and get the files, etc. For 
> instance, s3fs and boto3 can read and write files without having to specify 
> the region explicitly. Similar functionality to auto-detect the region would 
> be great to have in PyArrow and ArrowCpp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9744) [Python] aarch64 Installation Error

2020-08-17 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179227#comment-17179227
 ] 

Kouhei Sutou commented on ARROW-9744:
-

Thanks!

> [Python] aarch64 Installation Error
> ---
>
> Key: ARROW-9744
> URL: https://issues.apache.org/jira/browse/ARROW-9744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
> Environment: AWS m6g (ARM64 'Graviton2' CPU):
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp 
> cpuid asimdrdm lrcpc dcpop asimddp ssbs
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant : 0x3
> CPU part: 0xd0c
> CPU revision: 1
> OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 
> (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 
> 10:03:03 UTC 2020
>Reporter: Matthew Meen
>Priority: Major
> Attachments: cmake-info.txt, pyarrow_017.txt
>
>
> My team is attempting to migrate some workloads from x86-64 to ARM64, a 
> blocker for this is PyArrow failing to install. `pip install pyarrow` fails 
> to build the wheel as -march isn't correctly resolved:
> {{ -- System processor: aarch64}}
> {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH}}
> {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed}}
> {{ -- Arrow build warning level: PRODUCTION}}
> {{ CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):}}
> {{ Unsupported arch flag: -march=.}}
> It's possible to get the build to work after editing 
> `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up 
> as an architecture such as 'armv8-a' - although some more elaborate logic is 
> really needed to pick up the correct extensions.
> I can see that there  have been a number of items discussed in the past both 
> on Jira and in GitHub issues ranging from simple fixes to the cmake script to 
> more elaborate fixes cross-product for arch detection - but I wasn't able to 
> discern how the project wishes to proceed.
> With AWS pushing their ARM-based instances heavily at this point I would 
> advocate for picking a direction before an influx of new issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9744) [Python] aarch64 Installation Error

2020-08-17 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179226#comment-17179226
 ] 

Krisztian Szucs commented on ARROW-9744:


Ouch, I'm uploading it.

> [Python] aarch64 Installation Error
> ---
>
> Key: ARROW-9744
> URL: https://issues.apache.org/jira/browse/ARROW-9744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
> Environment: AWS m6g (ARM64 'Graviton2' CPU):
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp 
> cpuid asimdrdm lrcpc dcpop asimddp ssbs
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant : 0x3
> CPU part: 0xd0c
> CPU revision: 1
> OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 
> (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 
> 10:03:03 UTC 2020
>Reporter: Matthew Meen
>Priority: Major
> Attachments: cmake-info.txt, pyarrow_017.txt
>
>
> My team is attempting to migrate some workloads from x86-64 to ARM64, a 
> blocker for this is PyArrow failing to install. `pip install pyarrow` fails 
> to build the wheel as -march isn't correctly resolved:
> {{ -- System processor: aarch64}}
> {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH}}
> {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed}}
> {{ -- Arrow build warning level: PRODUCTION}}
> {{ CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):}}
> {{ Unsupported arch flag: -march=.}}
> It's possible to get the build to work after editing 
> `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up 
> as an architecture such as 'armv8-a' - although some more elaborate logic is 
> really needed to pick up the correct extensions.
> I can see that there  have been a number of items discussed in the past both 
> on Jira and in GitHub issues ranging from simple fixes to the cmake script to 
> more elaborate fixes cross-product for arch detection - but I wasn't able to 
> discern how the project wishes to proceed.
> With AWS pushing their ARM-based instances heavily at this point I would 
> advocate for picking a direction before an influx of new issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9744) [Python] aarch64 Installation Error

2020-08-17 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179225#comment-17179225
 ] 

Kouhei Sutou commented on ARROW-9744:
-

[~kszucs] It seems that we forgot to release source package to PyPI. Could you 
upload it? (If you prefer, I can do it.)

> [Python] aarch64 Installation Error
> ---
>
> Key: ARROW-9744
> URL: https://issues.apache.org/jira/browse/ARROW-9744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
> Environment: AWS m6g (ARM64 'Graviton2' CPU):
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp 
> cpuid asimdrdm lrcpc dcpop asimddp ssbs
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant : 0x3
> CPU part: 0xd0c
> CPU revision: 1
> OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 
> (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 
> 10:03:03 UTC 2020
>Reporter: Matthew Meen
>Priority: Major
> Attachments: cmake-info.txt, pyarrow_017.txt
>
>
> My team is attempting to migrate some workloads from x86-64 to ARM64, a 
> blocker for this is PyArrow failing to install. `pip install pyarrow` fails 
> to build the wheel as -march isn't correctly resolved:
> {{ -- System processor: aarch64}}
> {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH}}
> {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed}}
> {{ -- Arrow build warning level: PRODUCTION}}
> {{ CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):}}
> {{ Unsupported arch flag: -march=.}}
> It's possible to get the build to work after editing 
> `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up 
> as an architecture such as 'armv8-a' - although some more elaborate logic is 
> really needed to pick up the correct extensions.
> I can see that there  have been a number of items discussed in the past both 
> on Jira and in GitHub issues ranging from simple fixes to the cmake script to 
> more elaborate fixes cross-product for arch detection - but I wasn't able to 
> discern how the project wishes to proceed.
> With AWS pushing their ARM-based instances heavily at this point I would 
> advocate for picking a direction before an influx of new issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9733) [Rust][DataFusion] Aggregates COUNT/MIN/MAX don't work on VARCHAR columns

2020-08-17 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179224#comment-17179224
 ] 

Andrew Lamb commented on ARROW-9733:


yes, I would think MAX() on strings would be the same as `A` < `B` (aka
lexographical ordering)




> [Rust][DataFusion] Aggregates COUNT/MIN/MAX don't work on VARCHAR columns
> -
>
> Key: ARROW-9733
> URL: https://issues.apache.org/jira/browse/ARROW-9733
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Priority: Major
> Attachments: repro.csv
>
>
> h2. Reproducer:
> Create a table with a string column:
> Repro:
> {code}
> CREATE EXTERNAL TABLE repro(a INT, b VARCHAR)
> STORED AS CSV
> WITH HEADER ROW
> LOCATION 'repro.csv';
> {code}
> The contents of repro.csv are as follows (also attached):
> {code}
> a,b
> 1,One
> 1,Two
> 2,One
> 2,Two
> 2,Two
> {code}
> Now, run a query that tries to aggregate that column:
> {code}
> select a, count(b) from repro group by a;
> {code}
> *Actual behavior*:
> {code}
> > select a, count(b) from repro group by a;
> ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for 
> result of aggregate expression")))
> {code}
> *Expected Behavior*:
> The query runs and produces results
> {code}
> a, count(b)
> 1,2
> 2,3
> {code}
> h2. Discussion
> Using Min/Max aggregates on varchar also doesn't work (but should):
> {code}
> > select a, min(b) from repro group by a;
> ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for 
> result of aggregate expression")))
> > select a, max(b) from repro group by a;
> ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for 
> result of aggregate expression")))
> {code}
> Fascinatingly these formulations work fine:
> {code}
> > select a, count(a) from repro group by a;
> +---+--+
> | a | count(a) |
> +---+--+
> | 2 | 3|
> | 1 | 2|
> +---+--+
> 2 row in set. Query took 0 seconds.
> > select a, count(1) from repro group by a;
> +---+-+
> | a | count(UInt8(1)) |
> +---+-+
> | 2 | 3   |
> | 1 | 2   |
> +---+-+
> 2 row in set. Query took 0 seconds.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9775) Automatic S3 region selection

2020-08-17 Thread Sahil Gupta (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Gupta updated ARROW-9775:
---
Description: Currently, PyArrow and ArrowCpp need to be provided the region 
of the S3 file/bucket, else it defaults to using 'us-east-1'. Ideally, PyArrow 
and ArrowCpp can automatically detect the region and get the files, etc. For 
instance, s3fs and boto3 can read and write files without having to specify the 
region explicitly. Similar functionality to auto-detect the region would be 
great to have in PyArrow and ArrowCpp.  (was: Currently, PyArrow and ArrowCpp 
need to be provided the region of the S3 file/bucket, else it defaults to using 
'us-east-1'. Ideally, PyArrow and ArrowCpp can automatically detect the region 
and get the files, etc.)

> Automatic S3 region selection
> -
>
> Key: ARROW-9775
> URL: https://issues.apache.org/jira/browse/ARROW-9775
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Python
> Environment: macOS, Linux.
>Reporter: Sahil Gupta
>Priority: Major
>
> Currently, PyArrow and ArrowCpp need to be provided the region of the S3 
> file/bucket, else it defaults to using 'us-east-1'. Ideally, PyArrow and 
> ArrowCpp can automatically detect the region and get the files, etc. For 
> instance, s3fs and boto3 can read and write files without having to specify 
> the region explicitly. Similar functionality to auto-detect the region would 
> be great to have in PyArrow and ArrowCpp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9775) Automatic S3 region selection

2020-08-17 Thread Sahil Gupta (Jira)
Sahil Gupta created ARROW-9775:
--

 Summary: Automatic S3 region selection
 Key: ARROW-9775
 URL: https://issues.apache.org/jira/browse/ARROW-9775
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++, Python
 Environment: macOS, Linux.
Reporter: Sahil Gupta


Currently, PyArrow and ArrowCpp need to be provided the region of the S3 
file/bucket, else it defaults to using 'us-east-1'. Ideally, PyArrow and 
ArrowCpp can automatically detect the region and get the files, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9774) Document metadata

2020-08-17 Thread Mathieu Dutour Sikiric (Jira)
Mathieu Dutour Sikiric created ARROW-9774:
-

 Summary: Document metadata
 Key: ARROW-9774
 URL: https://issues.apache.org/jira/browse/ARROW-9774
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.0.0
 Environment: Linux
Reporter: Mathieu Dutour Sikiric


I would like to write down a dataframe into a parquet file.

The problem that I have is the output dataframe shows up as

```0 \{'field0': 5, 'field1': 8}
1 \{'field0': 5, 'field1': 8}
2 \{'field0': 4, 'field1': 7}```

while what I want is

```0 \{'A': 5, 'B': 8}
1 \{'A': 5, 'B': 8}
2 \{'A': 4, 'B': 7}```

As I understand the discrepancy is because I did not pass the metadata in the 
creation of the table. That is I did

schema_metadata = ::arrow::key_value_metadata(\{{"pandas", metadata.data()}});

schema = std::make_shared(schema_vector, schema_metadata);

arrow_table = arrow::Table::Make(schema, columns, row_group_size);

status = parquet::arrow::WriteTable( *arrow_table, pool, out_stream, 
row_group_size, writer_properties, ...)

The problem is that I could not find any documentation on how the metadata is 
to be built. Adding documentation would be much helpful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9773) [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array

2020-08-17 Thread David Li (Jira)
David Li created ARROW-9773:
---

 Summary: [C++] Take kernel can't handle ChunkedArrays that don't 
fit in an Array
 Key: ARROW-9773
 URL: https://issues.apache.org/jira/browse/ARROW-9773
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 1.0.0
Reporter: David Li


Take() currently concatenates ChunkedArrays first. However, this breaks down 
when calling Take() from a ChunkedArray or Table where concatenating the arrays 
would result in an array that's too large. While inconvenient to implement, it 
would be useful if this case were handled.

This could be done as a higher-level wrapper around Take(), perhaps.

Example in Python:
{code:python}
>>> import pyarrow as pa
>>> pa.__version__
'1.0.0'
>>> rb1 = pa.RecordBatch.from_arrays([["a" * 2**30]], names=["a"])
>>> rb2 = pa.RecordBatch.from_arrays([["b" * 2**30]], names=["a"])
>>> table = pa.Table.from_batches([rb1, rb2], schema=rb1.schema)
>>> table.take([1, 0])
Traceback (most recent call last):
  File "", line 1, in 
  File "pyarrow/table.pxi", line 1145, in pyarrow.lib.Table.take
  File 
"/home/lidavidm/Code/twosigma/arrow/venv/lib/python3.8/site-packages/pyarrow/compute.py",
 line 268, in take
return call_function('take', [data, indices], options)
  File "pyarrow/_compute.pyx", line 298, in pyarrow._compute.call_function
  File "pyarrow/_compute.pyx", line 192, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 122, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
{code}

In this example, it would be useful if Take() or a higher-level wrapper could 
generate multiple record batches as output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9733) [Rust][DataFusion] Aggregates COUNT/MIN/MAX don't work on VARCHAR columns

2020-08-17 Thread Jorge (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179194#comment-17179194
 ] 

Jorge commented on ARROW-9733:
--

Just to check, the max/min of charvar would be their alphabetical order?

> [Rust][DataFusion] Aggregates COUNT/MIN/MAX don't work on VARCHAR columns
> -
>
> Key: ARROW-9733
> URL: https://issues.apache.org/jira/browse/ARROW-9733
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Priority: Major
> Attachments: repro.csv
>
>
> h2. Reproducer:
> Create a table with a string column:
> Repro:
> {code}
> CREATE EXTERNAL TABLE repro(a INT, b VARCHAR)
> STORED AS CSV
> WITH HEADER ROW
> LOCATION 'repro.csv';
> {code}
> The contents of repro.csv are as follows (also attached):
> {code}
> a,b
> 1,One
> 1,Two
> 2,One
> 2,Two
> 2,Two
> {code}
> Now, run a query that tries to aggregate that column:
> {code}
> select a, count(b) from repro group by a;
> {code}
> *Actual behavior*:
> {code}
> > select a, count(b) from repro group by a;
> ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for 
> result of aggregate expression")))
> {code}
> *Expected Behavior*:
> The query runs and produces results
> {code}
> a, count(b)
> 1,2
> 2,3
> {code}
> h2. Discussion
> Using Min/Max aggregates on varchar also doesn't work (but should):
> {code}
> > select a, min(b) from repro group by a;
> ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for 
> result of aggregate expression")))
> > select a, max(b) from repro group by a;
> ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for 
> result of aggregate expression")))
> {code}
> Fascinatingly these formulations work fine:
> {code}
> > select a, count(a) from repro group by a;
> +---+--+
> | a | count(a) |
> +---+--+
> | 2 | 3|
> | 1 | 2|
> +---+--+
> 2 row in set. Query took 0 seconds.
> > select a, count(1) from repro group by a;
> +---+-+
> | a | count(UInt8(1)) |
> +---+-+
> | 2 | 3   |
> | 1 | 2   |
> +---+-+
> 2 row in set. Query took 0 seconds.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9710) [C++] Generalize Decimal ToString in preparation for Decimal256

2020-08-17 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9710.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 7945
[https://github.com/apache/arrow/pull/7945]

> [C++] Generalize Decimal ToString in preparation for Decimal256
> ---
>
> Key: ARROW-9710
> URL: https://issues.apache.org/jira/browse/ARROW-9710
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Mingyu Zhong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Generalize Decimal ToString method in preparation for introducing Decimal256 
> bit type (and other bit widths as needed).  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9670) [C++][FlightRPC] Close()ing a DoPut with an ongoing read locks up the client

2020-08-17 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9670:


Assignee: David Li  (was: Apache Arrow JIRA Bot)

> [C++][FlightRPC] Close()ing a DoPut with an ongoing read locks up the client
> 
>
> Key: ARROW-9670
> URL: https://issues.apache.org/jira/browse/ARROW-9670
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Affects Versions: 1.0.0
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This section accidentally recurses and ends up trying to re-acquire a lock: 
> https://github.com/apache/arrow/blob/9c04867930eae5454dbb1ea4c7bd869b12fc6e9d/cpp/src/arrow/flight/client.cc#L215



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9670) [C++][FlightRPC] Close()ing a DoPut with an ongoing read locks up the client

2020-08-17 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9670:


Assignee: Apache Arrow JIRA Bot  (was: David Li)

> [C++][FlightRPC] Close()ing a DoPut with an ongoing read locks up the client
> 
>
> Key: ARROW-9670
> URL: https://issues.apache.org/jira/browse/ARROW-9670
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Affects Versions: 1.0.0
>Reporter: David Li
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This section accidentally recurses and ends up trying to re-acquire a lock: 
> https://github.com/apache/arrow/blob/9c04867930eae5454dbb1ea4c7bd869b12fc6e9d/cpp/src/arrow/flight/client.cc#L215



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9670) [C++][FlightRPC] Close()ing a DoPut with an ongoing read locks up the client

2020-08-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9670:
--
Labels: pull-request-available  (was: )

> [C++][FlightRPC] Close()ing a DoPut with an ongoing read locks up the client
> 
>
> Key: ARROW-9670
> URL: https://issues.apache.org/jira/browse/ARROW-9670
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Affects Versions: 1.0.0
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This section accidentally recurses and ends up trying to re-acquire a lock: 
> https://github.com/apache/arrow/blob/9c04867930eae5454dbb1ea4c7bd869b12fc6e9d/cpp/src/arrow/flight/client.cc#L215



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9772) Optionally allow for to_pandas to return writeable pandas objects

2020-08-17 Thread Brandon B. Miller (Jira)
Brandon B. Miller created ARROW-9772:


 Summary: Optionally allow for to_pandas to return writeable pandas 
objects
 Key: ARROW-9772
 URL: https://issues.apache.org/jira/browse/ARROW-9772
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Affects Versions: 0.17.1
Reporter: Brandon B. Miller


In cuDF, I'd like to leverage pyarrow to facilitate the conversion from cuDF 
series and dataframe objects into the equivalent pandas objects. Concretely I'd 
like something like this to work:

 

`pandas_object = cudf_object.to_arrow().to_pandas()`. 

 

This allows us to stay consistent with the way the rest of the pyarrow 
ecosystem handles nulls, dtype conversions and the like without having to 
reinvent the wheel. However I noticed that in some zero copy scenarios, pyarrow 
doesn't seem to fully release the underlying buffers when converting 
`to_pandas()`. The resulting objects are immutable and if one tries to mutate 
the data they will encounter 

 

`ValueError: assignment destination is read-only`

 

This creates a slightly strange situation where a user might encounter issues 
that subtly stem from the fact that arrow was used to construct the offending 
pandas object. It would be nice to be able to toggle this behavior using a 
kwarg or something similar. I suspect this could come up in other situations 
where libraries want to convert back and forth between equivalent python 
objects through arrow and expect the final object they get to behave as if it 
were constructed via other means. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9518) [Python] Deprecate pyarrow serialization

2020-08-17 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179103#comment-17179103
 ] 

Antoine Pitrou commented on ARROW-9518:
---

I widened the topic for this issue, since PyArrow serialization is being 
obsolete by pickle protocol 5; also, the main users of pyarrow.serialize (i.e. 
Ray) have stopped using it.

> [Python] Deprecate pyarrow serialization
> 
>
> Key: ARROW-9518
> URL: https://issues.apache.org/jira/browse/ARROW-9518
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 2.0.0
>
>
> Per mailing list discussion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9518) [Python] Deprecate pyarrow serialization

2020-08-17 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-9518:
--
Summary: [Python] Deprecate pyarrow serialization  (was: [Python] Deprecate 
Union-based serialization implemented by pyarrow.serialization)

> [Python] Deprecate pyarrow serialization
> 
>
> Key: ARROW-9518
> URL: https://issues.apache.org/jira/browse/ARROW-9518
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 2.0.0
>
>
> Per mailing list discussion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9768) [Python] Pyarrow allows for unsafe conversions of datetime objects to timestamp nanoseconds

2020-08-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179092#comment-17179092
 ] 

Joris Van den Bossche commented on ARROW-9768:
--

Sorry about the noise, another PR had a typo in the issue number, which led to 
this automatically being closed.

> [Python] Pyarrow allows for unsafe conversions of datetime objects to 
> timestamp nanoseconds
> ---
>
> Key: ARROW-9768
> URL: https://issues.apache.org/jira/browse/ARROW-9768
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1, 1.0.0
> Environment: OS: MacOSX Catalina
> Python Version: 3.7
>Reporter: Joshua Lay
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Hi, 
> In parquet, I want to store date values as timestamp format with nanoseconds 
> precision. This works fine with most dates except those past 
> pandas.Timestamp.max: 
> [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html.|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html]
> I was expecting some exception to be raised (like in Pandas), however this 
> did not happen and the value was processed incorrectly. Note that this is 
> with safe=True. Can this please be looked into? Thanks
> {{Example Code:}}
> {{pa.array([datetime(2262,4,12)], type=pa.timestamp("ns"))}}
>  \{{}}
> {{Return:}}
> {{[}}
>  \{{ 1677-09-21 00:25:26.290448384}}
>  {{]}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (ARROW-9768) [Python] Pyarrow allows for unsafe conversions of datetime objects to timestamp nanoseconds

2020-08-17 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reopened ARROW-9768:
--

> [Python] Pyarrow allows for unsafe conversions of datetime objects to 
> timestamp nanoseconds
> ---
>
> Key: ARROW-9768
> URL: https://issues.apache.org/jira/browse/ARROW-9768
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1, 1.0.0
> Environment: OS: MacOSX Catalina
> Python Version: 3.7
>Reporter: Joshua Lay
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Hi, 
> In parquet, I want to store date values as timestamp format with nanoseconds 
> precision. This works fine with most dates except those past 
> pandas.Timestamp.max: 
> [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html.|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html]
> I was expecting some exception to be raised (like in Pandas), however this 
> did not happen and the value was processed incorrectly. Note that this is 
> with safe=True. Can this please be looked into? Thanks
> {{Example Code:}}
> {{pa.array([datetime(2262,4,12)], type=pa.timestamp("ns"))}}
>  \{{}}
> {{Return:}}
> {{[}}
>  \{{ 1677-09-21 00:25:26.290448384}}
>  {{]}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9771) [Rust] [DataFusion] Predicate Pushdown Improvement: treat predicates separated by AND separately

2020-08-17 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-9771:
---
Priority: Minor  (was: Major)

> [Rust] [DataFusion] Predicate Pushdown Improvement: treat predicates 
> separated by AND separately
> 
>
> Key: ARROW-9771
> URL: https://issues.apache.org/jira/browse/ARROW-9771
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Andrew Lamb
>Priority: Minor
>
> As discussed by [~jorgecarleitao] and [~houqp] here: 
> https://github.com/apache/arrow/pull/7880#pullrequestreview-468057624
> If a predicate is a conjunction (aka AND'd) together, each of the clauses can 
> be treated separately (e.g. a single filter expression {{A > 5 And B < 4}} 
> can be broken up and each of {{A > 5}} and {{B < 4}} can be potentially 
> pushed down different levels
> The filter pushdown logic works for the following case (when {{a}} and {{b}} 
> are are separate selections, predicate for a is pushed below the 
> {{Aggregate}} in the optimized plan):
> {code}
> Original plan:
> Selection: #b GtEq Int64(1)
>   Selection: #a LtEq Int64(1)
> Aggregate: groupBy=[[#a]], aggr=[[MIN(#b)]]
>   TableScan: test projection=None
> Optimized plan:
> Selection: #b GtEq Int64(1)
>   Aggregate: groupBy=[[#a]], aggr=[[MIN(#b)]]
> Selection: #a LtEq Int64(1)
>   TableScan: test projection=None
> {code}
> But not for this when {{a}} and {{b}} are {{AND}}'d together
> {code}
> Original plan:
> Selection: #a LtEq Int64(1) And #b GtEq Int64(1)
>   Aggregate: groupBy=[[#a]], aggr=[[MIN(#b)]]
> TableScan: test projection=None
> Optimized plan:
> Selection: #a LtEq Int64(1) And #b GtEq Int64(1)
>   Aggregate: groupBy=[[#a]], aggr=[[MIN(#b)]]
> TableScan: test projection=None
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9771) [Rust] [DataFusion] Predicate Pushdown Improvement: treat predicates separated by AND separately

2020-08-17 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-9771:
--

 Summary: [Rust] [DataFusion] Predicate Pushdown Improvement: treat 
predicates separated by AND separately
 Key: ARROW-9771
 URL: https://issues.apache.org/jira/browse/ARROW-9771
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Andrew Lamb


As discussed by [~jorgecarleitao] and [~houqp] here: 
https://github.com/apache/arrow/pull/7880#pullrequestreview-468057624

If a predicate is a conjunction (aka AND'd) together, each of the clauses can 
be treated separately (e.g. a single filter expression {{A > 5 And B < 4}} can 
be broken up and each of {{A > 5}} and {{B < 4}} can be potentially pushed down 
different levels

The filter pushdown logic works for the following case (when {{a}} and {{b}} 
are are separate selections, predicate for a is pushed below the {{Aggregate}} 
in the optimized plan):

{code}
Original plan:
Selection: #b GtEq Int64(1)
  Selection: #a LtEq Int64(1)
Aggregate: groupBy=[[#a]], aggr=[[MIN(#b)]]
  TableScan: test projection=None

Optimized plan:
Selection: #b GtEq Int64(1)
  Aggregate: groupBy=[[#a]], aggr=[[MIN(#b)]]
Selection: #a LtEq Int64(1)
  TableScan: test projection=None
{code}

But not for this when {{a}} and {{b}} are {{AND}}'d together

{code}
Original plan:
Selection: #a LtEq Int64(1) And #b GtEq Int64(1)
  Aggregate: groupBy=[[#a]], aggr=[[MIN(#b)]]
TableScan: test projection=None
Optimized plan:
Selection: #a LtEq Int64(1) And #b GtEq Int64(1)
  Aggregate: groupBy=[[#a]], aggr=[[MIN(#b)]]
TableScan: test projection=None
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9770) [Rust] [DataFusion] Add constant folding to expressions during logically planning

2020-08-17 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-9770:
--

 Summary: [Rust] [DataFusion] Add constant folding to expressions 
during logically planning
 Key: ARROW-9770
 URL: https://issues.apache.org/jira/browse/ARROW-9770
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Andrew Lamb


The high level idea is that if an expression can be partially evaluated during 
planning time then
# The execution time will be increased
# There may be additional optimizations possible (like removing entire 
LogicalPlan nodes, for example)

I recently saw the following selection expression created (by the [predicate 
push down|https://github.com/apache/arrow/pull/7880])

{code}
Selection: #a Eq Int64(1) And #b GtEq Int64(1) And #a LtEq Int64(1) And #a Eq 
Int64(1) And #b GtEq Int64(1) And #a LtEq Int64(1)
  TableScan: test projection=None
{code}

This could be simplified significantly:
1. Duplicate clauses could be removed (e.g. `#a Eq Int64(1) And #a Eq Int64(1)` 
--> `#a Eq Int64(1)`)
2. Algebraic simplification (e.g. if `A<=B and A=5`, is the same as `A=5`)

Inspiration can be taken from the postgres code that evaluates constant 
expressions 
https://doxygen.postgresql.org/clauses_8c.html#ac91c4055a7eb3aa6f1bc104479464b28

(in this case, for example if you have a predicate A=5 then you can basically 
substitute in A=5 for any expression higher up in the the plan).

Other classic optimizations include things such as `A OR TRUE` --> `A`, `A AND 
TRUE` --> TRUE,  etc.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9744) [Python] aarch64 Installation Error

2020-08-17 Thread Matthew Meen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179029#comment-17179029
 ] 

Matthew Meen commented on ARROW-9744:
-

1.0.0 behaves the same as 0.17.1 where the value to cmake's -march argument 
ends up as a blank value.

> [Python] aarch64 Installation Error
> ---
>
> Key: ARROW-9744
> URL: https://issues.apache.org/jira/browse/ARROW-9744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
> Environment: AWS m6g (ARM64 'Graviton2' CPU):
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp 
> cpuid asimdrdm lrcpc dcpop asimddp ssbs
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant : 0x3
> CPU part: 0xd0c
> CPU revision: 1
> OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 
> (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 
> 10:03:03 UTC 2020
>Reporter: Matthew Meen
>Priority: Major
> Attachments: cmake-info.txt, pyarrow_017.txt
>
>
> My team is attempting to migrate some workloads from x86-64 to ARM64, a 
> blocker for this is PyArrow failing to install. `pip install pyarrow` fails 
> to build the wheel as -march isn't correctly resolved:
> {{ -- System processor: aarch64}}
> {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH}}
> {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed}}
> {{ -- Arrow build warning level: PRODUCTION}}
> {{ CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):}}
> {{ Unsupported arch flag: -march=.}}
> It's possible to get the build to work after editing 
> `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up 
> as an architecture such as 'armv8-a' - although some more elaborate logic is 
> really needed to pick up the correct extensions.
> I can see that there  have been a number of items discussed in the past both 
> on Jira and in GitHub issues ranging from simple fixes to the cmake script to 
> more elaborate fixes cross-product for arch detection - but I wasn't able to 
> discern how the project wishes to proceed.
> With AWS pushing their ARM-based instances heavily at this point I would 
> advocate for picking a direction before an influx of new issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9495) [C++] Equality assertions don't handle Inf / -Inf properly

2020-08-17 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-9495:
-

Assignee: Liya Fan

> [C++] Equality assertions don't handle Inf /  -Inf properly
> ---
>
> Key: ARROW-9495
> URL: https://issues.apache.org/jira/browse/ARROW-9495
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> I got this error when working on a PR which added unit tests:
> {code}
> ../src/arrow/testing/gtest_util.cc:101: Failure
> Failed
> Expected:
>   [
> 2.5,
> inf,
> -inf
>   ]
> Actual:
>   [
> 2.5,
> inf,
> -inf
>   ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9495) [C++] Equality assertions don't handle Inf / -Inf properly

2020-08-17 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9495.
---
Resolution: Fixed

Issue resolved by pull request 7826
[https://github.com/apache/arrow/pull/7826]

> [C++] Equality assertions don't handle Inf /  -Inf properly
> ---
>
> Key: ARROW-9495
> URL: https://issues.apache.org/jira/browse/ARROW-9495
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I got this error when working on a PR which added unit tests:
> {code}
> ../src/arrow/testing/gtest_util.cc:101: Failure
> Failed
> Expected:
>   [
> 2.5,
> inf,
> -inf
>   ]
> Actual:
>   [
> 2.5,
> inf,
> -inf
>   ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9768) [Python] Pyarrow allows for unsafe conversions of datetime objects to timestamp nanoseconds

2020-08-17 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9768.
---
Resolution: Fixed

Issue resolved by pull request 7980
[https://github.com/apache/arrow/pull/7980]

> [Python] Pyarrow allows for unsafe conversions of datetime objects to 
> timestamp nanoseconds
> ---
>
> Key: ARROW-9768
> URL: https://issues.apache.org/jira/browse/ARROW-9768
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1, 1.0.0
> Environment: OS: MacOSX Catalina
> Python Version: 3.7
>Reporter: Joshua Lay
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Hi, 
> In parquet, I want to store date values as timestamp format with nanoseconds 
> precision. This works fine with most dates except those past 
> pandas.Timestamp.max: 
> [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html.|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html]
> I was expecting some exception to be raised (like in Pandas), however this 
> did not happen and the value was processed incorrectly. Note that this is 
> with safe=True. Can this please be looked into? Thanks
> {{Example Code:}}
> {{pa.array([datetime(2262,4,12)], type=pa.timestamp("ns"))}}
>  \{{}}
> {{Return:}}
> {{[}}
>  \{{ 1677-09-21 00:25:26.290448384}}
>  {{]}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9769) [Python] Remove skip for in-memory fsspec in test_move_file

2020-08-17 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9769:
--

 Summary: [Python] Remove skip for in-memory fsspec in 
test_move_file
 Key: ARROW-9769
 URL: https://issues.apache.org/jira/browse/ARROW-9769
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Krisztian Szucs
 Fix For: 2.0.0


Follow-up of https://issues.apache.org/jira/browse/ARROW-9621 which should be 
applied once a new version of fsspec is going to be available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9621) [Python] test_move_file() is failed with fsspec 0.8.0

2020-08-17 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9621:
---
Issue Type: Bug  (was: Improvement)

> [Python] test_move_file() is failed with fsspec 0.8.0
> -
>
> Key: ARROW-9621
> URL: https://issues.apache.org/jira/browse/ARROW-9621
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Kouhei Sutou
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> It works with fsspec 0.7.4: 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/34414340/job/os9t8kj9t4afgym9
> Failed with fsspec 0.8.0: 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/34422556/job/abedu9it26qvfxkm
> {noformat}
> == FAILURES 
> ===
> __ test_move_file[PyFileSystem(FSSpecHandler(fsspec.filesystem("memory")))] 
> ___
> fs = 
> pathfn = . at 0x003D04F70B58>
> def test_move_file(fs, pathfn):
> s = pathfn('test-move-source-file')
> t = pathfn('test-move-target-file')
> 
> with fs.open_output_stream(s):
> pass
> 
> >   fs.move(s, t)
> pyarrow\tests\test_fs.py:798: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> pyarrow\_fs.pyx:519: in pyarrow._fs.FileSystem.move
> check_status(self.fs.Move(source, destination))
> pyarrow\_fs.pyx:1024: in pyarrow._fs._cb_move
> handler.move(frombytes(src), frombytes(dest))
> pyarrow\fs.py:199: in move
> self.fs.mv(src, dest, recursive=True)
> C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\spec.py:744: in mv
> self.copy(path1, path2, recursive=recursive, maxdepth=maxdepth)
> C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\spec.py:719: in copy
> self.cp_file(p1, p2, **kwargs)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> self =  0x003D01096A78>
> path1 = 'test-move-source-file/', path2 = 'test-move-target-file/'
> kwargs = {'maxdepth': None}
> def cp_file(self, path1, path2, **kwargs):
> if self.isfile(path1):
> >   self.store[path2] = MemoryFile(self, path2, 
> > self.store[path1].getbuffer())
> E   KeyError: 'test-move-source-file/'
> C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\implementations\memory.py:134:
>  KeyError
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9621) [Python] test_move_file() is failed with fsspec 0.8.0

2020-08-17 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9621:
---
Fix Version/s: 2.0.0
   1.0.1

> [Python] test_move_file() is failed with fsspec 0.8.0
> -
>
> Key: ARROW-9621
> URL: https://issues.apache.org/jira/browse/ARROW-9621
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Kouhei Sutou
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> It works with fsspec 0.7.4: 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/34414340/job/os9t8kj9t4afgym9
> Failed with fsspec 0.8.0: 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/34422556/job/abedu9it26qvfxkm
> {noformat}
> == FAILURES 
> ===
> __ test_move_file[PyFileSystem(FSSpecHandler(fsspec.filesystem("memory")))] 
> ___
> fs = 
> pathfn = . at 0x003D04F70B58>
> def test_move_file(fs, pathfn):
> s = pathfn('test-move-source-file')
> t = pathfn('test-move-target-file')
> 
> with fs.open_output_stream(s):
> pass
> 
> >   fs.move(s, t)
> pyarrow\tests\test_fs.py:798: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> pyarrow\_fs.pyx:519: in pyarrow._fs.FileSystem.move
> check_status(self.fs.Move(source, destination))
> pyarrow\_fs.pyx:1024: in pyarrow._fs._cb_move
> handler.move(frombytes(src), frombytes(dest))
> pyarrow\fs.py:199: in move
> self.fs.mv(src, dest, recursive=True)
> C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\spec.py:744: in mv
> self.copy(path1, path2, recursive=recursive, maxdepth=maxdepth)
> C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\spec.py:719: in copy
> self.cp_file(p1, p2, **kwargs)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> self =  0x003D01096A78>
> path1 = 'test-move-source-file/', path2 = 'test-move-target-file/'
> kwargs = {'maxdepth': None}
> def cp_file(self, path1, path2, **kwargs):
> if self.isfile(path1):
> >   self.store[path2] = MemoryFile(self, path2, 
> > self.store[path1].getbuffer())
> E   KeyError: 'test-move-source-file/'
> C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\implementations\memory.py:134:
>  KeyError
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9517) [C++][Python] Allow session_token argument when initializing S3FileSystem

2020-08-17 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-9517:
-

Assignee: Matthew Corley

> [C++][Python] Allow session_token argument when initializing S3FileSystem
> -
>
> Key: ARROW-9517
> URL: https://issues.apache.org/jira/browse/ARROW-9517
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.17.1
>Reporter: Matthew Corley
>Assignee: Matthew Corley
>Priority: Major
>  Labels: AWS, filesystem, pull-request-available, s3
> Fix For: 2.0.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> In order to access S3 using temporary credentials (from STS), users must 
> supply a session token in addition to the usual access key and secret key. 
> However, currently, the S3FileSystem class only accepts access_key and 
> secret_key arguments.  The only workaround is to provide the session token as 
> an environment variable, but this not ideal for a variety of reasons.
> This is a request to allow an optional session_token argument when 
> initializing the S3FileSystem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9517) [C++][Python] Allow session_token argument when initializing S3FileSystem

2020-08-17 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9517.
---
Resolution: Fixed

Issue resolved by pull request 7803
[https://github.com/apache/arrow/pull/7803]

> [C++][Python] Allow session_token argument when initializing S3FileSystem
> -
>
> Key: ARROW-9517
> URL: https://issues.apache.org/jira/browse/ARROW-9517
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.17.1
>Reporter: Matthew Corley
>Priority: Major
>  Labels: AWS, filesystem, pull-request-available, s3
> Fix For: 2.0.0
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> In order to access S3 using temporary credentials (from STS), users must 
> supply a session token in addition to the usual access key and secret key. 
> However, currently, the S3FileSystem class only accepts access_key and 
> secret_key arguments.  The only workaround is to provide the session token as 
> an environment variable, but this not ideal for a variety of reasons.
> This is a request to allow an optional session_token argument when 
> initializing the S3FileSystem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-9633) [C++] Do not toggle memory mapping globally in LocalFileSystem

2020-08-17 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178980#comment-17178980
 ] 

Wes McKinney edited comment on ARROW-9633 at 8/17/20, 1:37 PM:
---

I mostly want to be sure that file formats that are sensitive to a file 
handle's performance characteristics (for example, Parquet files are highly 
sensitive to the latency of reads) are able to understand what they are getting 
so that they can choose to set other options to improve performance. For 
example:

* Will read buffering (or pre-buffering) improve performance?
* Is it OK to make blocking IO calls or should an IO call allow a CPU core to 
be made available to other threads for execution? 
* Do Read calls allocate memory? 

I'm all for abstraction/encapsulation where it makes sense but these issues can 
result in meaningful changes to the wall clock time of accessing data.

I'm fine to take no action right now but if we want Arrow to be the gold 
standard for data access and the platform that people choose to build on we 
should be vigilant. 


was (Author: wesmckinn):
I mostly want to be sure that file formats that are sensitive to a file 
handle's performance characteristics (for example, Parquet files are highly 
sensitive to the latency of reads) are able to understand what they are getting 
so that they can choose to set other options to improve performance. For 
example:

* Will read buffering (or pre-buffering) to improve performance?
* Is it OK to make blocking IO calls or should an IO call allow a CPU core to 
be made available to other threads for execution? 
* Do Read calls allocate memory? 

I'm all for abstraction/encapsulation where it makes sense but these issues can 
result in meaningful changes to the wall clock time of accessing data.

I'm fine to take no action right now but if we want Arrow to be the gold 
standard for data access and the platform that people choose to build on we 
should be vigilant. 

> [C++] Do not toggle memory mapping globally in LocalFileSystem
> --
>
> Key: ARROW-9633
> URL: https://issues.apache.org/jira/browse/ARROW-9633
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 2.0.0
>
>
> In the context of the Datasets API, some file formats benefit greatly from 
> memory mapping (like Arrow IPC files) while other less so. Additionally, in 
> some scenarios, memory mapping could fail when used on network-attached 
> storage devices. Since a filesystem may be used to read different kinds of 
> files and use both memory mapping and non-memory mapping, and additionally 
> the Datasets API should be able to fall back on non-memory mapping if the 
> attempt to memory map fails, it would make sense to have a non-global option 
> for this:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/localfs.h
> I would suggest adding a new filesystem API with something like 
> {{OpenMappedInputFile}} with some options to control the behavior when memory 
> mapping is not possible. These options may be among:
> * Falling back on a normal RandomAccessFile
> * Reading the entire file into memory (or even tmpfs?) and then wrapping it 
> in a BufferReader
> * Failing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9633) [C++] Do not toggle memory mapping globally in LocalFileSystem

2020-08-17 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178980#comment-17178980
 ] 

Wes McKinney commented on ARROW-9633:
-

I mostly want to be sure that file formats that are sensitive to a file 
handle's performance characteristics (for example, Parquet files are highly 
sensitive to the latency of reads) are able to understand what they are getting 
so that they can choose to set other options to improve performance. For 
example:

* Will read buffering (or pre-buffering) to improve performance?
* Is it OK to make blocking IO calls or should an IO call allow a CPU core to 
be made available to other threads for execution? 
* Do Read calls allocate memory? 

I'm all for abstraction/encapsulation where it makes sense but these issues can 
result in meaningful changes to the wall clock time of accessing data.

I'm fine to take no action right now but if we want Arrow to be the gold 
standard for data access and the platform that people choose to build on we 
should be vigilant. 

> [C++] Do not toggle memory mapping globally in LocalFileSystem
> --
>
> Key: ARROW-9633
> URL: https://issues.apache.org/jira/browse/ARROW-9633
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 2.0.0
>
>
> In the context of the Datasets API, some file formats benefit greatly from 
> memory mapping (like Arrow IPC files) while other less so. Additionally, in 
> some scenarios, memory mapping could fail when used on network-attached 
> storage devices. Since a filesystem may be used to read different kinds of 
> files and use both memory mapping and non-memory mapping, and additionally 
> the Datasets API should be able to fall back on non-memory mapping if the 
> attempt to memory map fails, it would make sense to have a non-global option 
> for this:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/localfs.h
> I would suggest adding a new filesystem API with something like 
> {{OpenMappedInputFile}} with some options to control the behavior when memory 
> mapping is not possible. These options may be among:
> * Falling back on a normal RandomAccessFile
> * Reading the entire file into memory (or even tmpfs?) and then wrapping it 
> in a BufferReader
> * Failing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9402) [C++] Add portable wrappers for __builtin_add_overflow and friends

2020-08-17 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9402:
---
Fix Version/s: 1.0.1

> [C++] Add portable wrappers for __builtin_add_overflow and friends
> --
>
> Key: ARROW-9402
> URL: https://issues.apache.org/jira/browse/ARROW-9402
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9744) [Python] aarch64 Installation Error

2020-08-17 Thread Matthew Meen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178962#comment-17178962
 ] 

Matthew Meen commented on ARROW-9744:
-

There's isn't a 1.0.0 .tar.gz on the [https://pypi.org/simple/pyarrow/,] so 
this fails to find the package:

(test_env) ubuntu@ip-10-143-19-162:/usr/local/test_env$ sudo pip install 
pyarrow==1.0.0
ERROR: Could not find a version that satisfies the requirement pyarrow==1.0.0 
(from versions: 0.9.0, 0.10.0, 0.11.0, 0.11.1, 0.12.0, 0.12.1, 0.13.0, 0.14.0, 
0.15.1, 0.16.0, 0.17.0, 0.17.1)
ERROR: No matching distribution found for pyarrow==1.0.0

This is the full output for pip install pyarrow, which finds the latest as 
0.17.1: [^pyarrow_017.txt]

I'll try cloning and building 1.0.0 directly shortly.

> [Python] aarch64 Installation Error
> ---
>
> Key: ARROW-9744
> URL: https://issues.apache.org/jira/browse/ARROW-9744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
> Environment: AWS m6g (ARM64 'Graviton2' CPU):
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp 
> cpuid asimdrdm lrcpc dcpop asimddp ssbs
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant : 0x3
> CPU part: 0xd0c
> CPU revision: 1
> OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 
> (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 
> 10:03:03 UTC 2020
>Reporter: Matthew Meen
>Priority: Major
> Attachments: cmake-info.txt, pyarrow_017.txt
>
>
> My team is attempting to migrate some workloads from x86-64 to ARM64, a 
> blocker for this is PyArrow failing to install. `pip install pyarrow` fails 
> to build the wheel as -march isn't correctly resolved:
> {{ -- System processor: aarch64}}
> {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH}}
> {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed}}
> {{ -- Arrow build warning level: PRODUCTION}}
> {{ CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):}}
> {{ Unsupported arch flag: -march=.}}
> It's possible to get the build to work after editing 
> `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up 
> as an architecture such as 'armv8-a' - although some more elaborate logic is 
> really needed to pick up the correct extensions.
> I can see that there  have been a number of items discussed in the past both 
> on Jira and in GitHub issues ranging from simple fixes to the cmake script to 
> more elaborate fixes cross-product for arch detection - but I wasn't able to 
> discern how the project wishes to proceed.
> With AWS pushing their ARM-based instances heavily at this point I would 
> advocate for picking a direction before an influx of new issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9744) [Python] aarch64 Installation Error

2020-08-17 Thread Matthew Meen (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Meen updated ARROW-9744:

Attachment: pyarrow_017.txt

> [Python] aarch64 Installation Error
> ---
>
> Key: ARROW-9744
> URL: https://issues.apache.org/jira/browse/ARROW-9744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
> Environment: AWS m6g (ARM64 'Graviton2' CPU):
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp 
> cpuid asimdrdm lrcpc dcpop asimddp ssbs
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant : 0x3
> CPU part: 0xd0c
> CPU revision: 1
> OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 
> (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 
> 10:03:03 UTC 2020
>Reporter: Matthew Meen
>Priority: Major
> Attachments: cmake-info.txt, pyarrow_017.txt
>
>
> My team is attempting to migrate some workloads from x86-64 to ARM64, a 
> blocker for this is PyArrow failing to install. `pip install pyarrow` fails 
> to build the wheel as -march isn't correctly resolved:
> {{ -- System processor: aarch64}}
> {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH}}
> {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed}}
> {{ -- Arrow build warning level: PRODUCTION}}
> {{ CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):}}
> {{ Unsupported arch flag: -march=.}}
> It's possible to get the build to work after editing 
> `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up 
> as an architecture such as 'armv8-a' - although some more elaborate logic is 
> really needed to pick up the correct extensions.
> I can see that there  have been a number of items discussed in the past both 
> on Jira and in GitHub issues ranging from simple fixes to the cmake script to 
> more elaborate fixes cross-product for arch detection - but I wasn't able to 
> discern how the project wishes to proceed.
> With AWS pushing their ARM-based instances heavily at this point I would 
> advocate for picking a direction before an influx of new issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9672) [Python][Parquet] Expose _filters_to_expression

2020-08-17 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178948#comment-17178948
 ] 

Krisztian Szucs commented on ARROW-9672:


Since it's mostly about an API change which should be discouraged for patch 
releases I'm excluding it from 1.0.1.

> [Python][Parquet] Expose _filters_to_expression
> ---
>
> Key: ARROW-9672
> URL: https://issues.apache.org/jira/browse/ARROW-9672
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Reporter: Caleb Winston
>Priority: Trivial
> Fix For: 1.0.1
>
>
> `_filters_to_expression` converts filters expressed in disjunctive normal 
> form (DNF) to `dataset.Expression`. Can `_filters_to_expression` be added to 
> the public API?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9672) [Python][Parquet] Expose _filters_to_expression

2020-08-17 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9672:
---
Fix Version/s: (was: 1.0.1)
   2.0.0

> [Python][Parquet] Expose _filters_to_expression
> ---
>
> Key: ARROW-9672
> URL: https://issues.apache.org/jira/browse/ARROW-9672
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Reporter: Caleb Winston
>Priority: Trivial
> Fix For: 2.0.0
>
>
> `_filters_to_expression` converts filters expressed in disjunctive normal 
> form (DNF) to `dataset.Expression`. Can `_filters_to_expression` be added to 
> the public API?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9768) [Python] Pyarrow allows for unsafe conversions of datetime objects to timestamp nanoseconds

2020-08-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9768:
--
Labels: pull-request-available  (was: )

> [Python] Pyarrow allows for unsafe conversions of datetime objects to 
> timestamp nanoseconds
> ---
>
> Key: ARROW-9768
> URL: https://issues.apache.org/jira/browse/ARROW-9768
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1, 1.0.0
> Environment: OS: MacOSX Catalina
> Python Version: 3.7
>Reporter: Joshua Lay
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hi, 
> In parquet, I want to store date values as timestamp format with nanoseconds 
> precision. This works fine with most dates except those past 
> pandas.Timestamp.max: 
> [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html.|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html]
> I was expecting some exception to be raised (like in Pandas), however this 
> did not happen and the value was processed incorrectly. Note that this is 
> with safe=True. Can this please be looked into? Thanks
> {{Example Code:}}
> {{pa.array([datetime(2262,4,12)], type=pa.timestamp("ns"))}}
>  \{{}}
> {{Return:}}
> {{[}}
>  \{{ 1677-09-21 00:25:26.290448384}}
>  {{]}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9716) [Rust] [DataFusion] MergeExec should have concurrency limit

2020-08-17 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9716:
---
Fix Version/s: (was: 1.0.1)

> [Rust] [DataFusion] MergeExec  should have concurrency limit
> 
>
> Key: ARROW-9716
> URL: https://issues.apache.org/jira/browse/ARROW-9716
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> MergeExec currently spins up one thread per input partition which causes apps 
> to effectively hang if there are substantially more partitions than available 
> cores.
> We can implement a configurable limit here pretty easily.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-9716) [Rust] [DataFusion] MergeExec should have concurrency limit

2020-08-17 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178941#comment-17178941
 ] 

Krisztian Szucs edited comment on ARROW-9716 at 8/17/20, 12:06 PM:
---

Depends on backward incompatible improvement 
https://github.com/apache/arrow/pull/7958 and 
https://github.com/apache/arrow/pull/7951 which also depends on the previous 
dependency, so I'm removing it from 1.0.1 patch release.



was (Author: kszucs):
Depends on backward incompatible improvement 
https://github.com/apache/arrow/pull/7958 and 
https://github.com/apache/arrow/pull/7951 which also depends on the previous 
dependency.

> [Rust] [DataFusion] MergeExec  should have concurrency limit
> 
>
> Key: ARROW-9716
> URL: https://issues.apache.org/jira/browse/ARROW-9716
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> MergeExec currently spins up one thread per input partition which causes apps 
> to effectively hang if there are substantially more partitions than available 
> cores.
> We can implement a configurable limit here pretty easily.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9716) [Rust] [DataFusion] MergeExec should have concurrency limit

2020-08-17 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178941#comment-17178941
 ] 

Krisztian Szucs commented on ARROW-9716:


Depends on backward incompatible improvement 
https://github.com/apache/arrow/pull/7958 and 
https://github.com/apache/arrow/pull/7951 which also depends on the previous 
dependency.

> [Rust] [DataFusion] MergeExec  should have concurrency limit
> 
>
> Key: ARROW-9716
> URL: https://issues.apache.org/jira/browse/ARROW-9716
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> MergeExec currently spins up one thread per input partition which causes apps 
> to effectively hang if there are substantially more partitions than available 
> cores.
> We can implement a configurable limit here pretty easily.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9732) [Rust] [DataFusion] Add "Physical Planner" type thing which can do optimizations

2020-08-17 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-9732.

Resolution: Fixed

Dupe of ARROW-9758, fixed by [~andygrove]

> [Rust] [DataFusion] Add "Physical Planner" type thing which can do 
> optimizations
> 
>
> Key: ARROW-9732
> URL: https://issues.apache.org/jira/browse/ARROW-9732
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Andrew Lamb
>Priority: Major
>
> [~andygrove] implemented what I would describe as a "limit pushdown" 
> optimization within Limit here: 
> https://github.com/apache/arrow/pull/7958#discussion_r470175966
> However, it was implemented by directly instantiating Partition objects 
> during plan execution. This "pick the top N from each partition and then pick 
> the top N from the merged result" is an example of operator pushdown that 
> could be done at planning time
> This ticket tracks the work to add some way to represent the in the planning 
> stage, rather than execution, in order to open up more optimization 
> opportunities. 
> One example of pushdown that could potentially happen at planning time would 
> be pushing the limit down past Projections for example. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9714) [Rust] [DataFusion] TypeCoercionRule not implemented for Limit or Sort

2020-08-17 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178933#comment-17178933
 ] 

Krisztian Szucs commented on ARROW-9714:


It heavily depends on https://github.com/apache/arrow/pull/7833 which was not 
part of 1.0 release, so removing it from 1.0.1 patch release.

> [Rust] [DataFusion] TypeCoercionRule not implemented for Limit or Sort
> --
>
> Key: ARROW-9714
> URL: https://issues.apache.org/jira/browse/ARROW-9714
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> TypeCoercionRule not implemented for Limit or Sort, causing TPC-H query 1 to 
> fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9714) [Rust] [DataFusion] TypeCoercionRule not implemented for Limit or Sort

2020-08-17 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9714:
---
Fix Version/s: (was: 1.0.1)

> [Rust] [DataFusion] TypeCoercionRule not implemented for Limit or Sort
> --
>
> Key: ARROW-9714
> URL: https://issues.apache.org/jira/browse/ARROW-9714
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> TypeCoercionRule not implemented for Limit or Sort, causing TPC-H query 1 to 
> fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9556) [Python][C++] Segfaults in UnionArray with null values

2020-08-17 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-9556.

Resolution: Fixed

Issue resolved by pull request 7952
[https://github.com/apache/arrow/pull/7952]

> [Python][C++] Segfaults in UnionArray with null values
> --
>
> Key: ARROW-9556
> URL: https://issues.apache.org/jira/browse/ARROW-9556
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0
> Environment: Conda, but pyarrow was installed using pip (in the conda 
> environment)
>Reporter: Jim Pivarski
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0, 1.0.1
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Extracting null values from a UnionArray containing nulls and constructing a 
> UnionArray with a bitmask in pyarrow.Array.from_buffers causes segfaults in 
> pyarrow 1.0.0. I have an environment with pyarrow 0.17.0 and all of the 
> following run correctly without segfaults in the older version.
> Here's a UnionArray that works (because there are no nulls):
>  
> {code:java}
> # GOOD
> a = pyarrow.UnionArray.from_sparse(
>  pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()),
>  [
>  pyarrow.array([0.0, 1.1, 2.2, 3.3, 4.4]),
>  pyarrow.array([True, True, False, True, False]),
>  ],
> )
> a.to_pylist(){code}
>  
> Here's one the fails when you try a.to_pylist() or even just a[2], because 
> one of the children has a null at 2:
>  
> {code:java}
> # SEGFAULT
> a = pyarrow.UnionArray.from_sparse(
>  pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()),
>  [
>  pyarrow.array([0.0, 1.1, None, 3.3, 4.4]),
>  pyarrow.array([True, True, False, True, False]),
>  ],
> )
> a.to_pylist() # also just a[2] causes a segfault{code}
>  
> Here's another that fails because both children have nulls; the segfault 
> occurs at both positions with nulls:
>  
> {code:java}
> # SEGFAULT
> a = pyarrow.UnionArray.from_sparse(
>  pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()),
>  [
>  pyarrow.array([0.0, 1.1, None, 3.3, 4.4]),
>  pyarrow.array([True, None, False, True, False]),
>  ],
> )
> a.to_pylist() # also a[1] and a[2] cause segfaults{code}
>  
> Here's one that succeeds, but it's dense, rather than sparse:
>  
> {code:java}
> # GOOD
> a = pyarrow.UnionArray.from_dense(
>  pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()),
>  pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()),
>  [pyarrow.array([0.0, 1.1, 2.2, 3.3]), pyarrow.array([True, True, False])],
> )
> a.to_pylist(){code}
>  
> Here's a dense that fails because one child has a null:
>  
> {code:java}
> # SEGFAULT
> a = pyarrow.UnionArray.from_dense(
>  pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()),
>  pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()),
>  [pyarrow.array([0.0, 1.1, None, 3.3]), pyarrow.array([True, True, False])],
> )
> a.to_pylist() # also just a[3] causes a segfault{code}
>  
> Here's a dense that fails in two positions because both children have a null:
>  
> {code:java}
> # SEGFAULT
> a = pyarrow.UnionArray.from_dense(
>  pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()),
>  pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()),
>  [pyarrow.array([0.0, 1.1, None, 3.3]), pyarrow.array([True, None, False])],
> )
> a.to_pylist() # also a[3] and a[5] cause segfaults{code}
>  
> In all of the above, we created the UnionArray using its from_dense method. 
> We could instead create it with pyarrow.Array.from_buffers. If created with 
> content0 and content1 that have no nulls, it's fine, but if created with 
> nulls in the content, it segfaults as soon as you view the null value.
>  
> {code:java}
> # GOOD
> content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3, 4.4])
> content1 = pyarrow.array([True, True, False, True, False])
> # SEGFAULT
> content0 = pyarrow.array([0.0, 1.1, 2.2, None, 4.4])
> content1 = pyarrow.array([True, True, False, True, False])
> types = pyarrow.union(
>  [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)],
>  "sparse",
>  [0, 1],
> )
> a = pyarrow.Array.from_buffers(
>  types,
>  5,
>  [
>  None,
>  pyarrow.py_buffer(numpy.array([0, 1, 0, 0, 1], numpy.int8)),
>  ],
>  children=[content0, content1],
> )
> a.to_pylist() # also just a[3] causes a segfault{code}
>  
> Similarly for a dense union.
>  
> {code:java}
> # GOOD
> content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3])
> content1 = pyarrow.array([True, True, False])
> # SEGFAULT
> content0 = pyarrow.array([0.0, 1.1, None, 3.3])
> content1 = pyarrow.array([True, True, False])
> types = pyarrow.union(
>  [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)],
>  "dense",
>  [0, 1],
> )
> a = pyarrow.Array.from_buffers(
>  types,
>  

[jira] [Commented] (ARROW-9633) [C++] Do not toggle memory mapping globally in LocalFileSystem

2020-08-17 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178875#comment-17178875
 ] 

Antoine Pitrou commented on ARROW-9633:
---

My concern is that memory-mapping is an optimization specific to local 
filesystem files, and it would burden the generic API with those optimization 
details.

Did you enconter a use case where the current API produces detrimental results? 
Or where the proposed change (attempt to memory-map and then fall back to 
regular reading) would?

> [C++] Do not toggle memory mapping globally in LocalFileSystem
> --
>
> Key: ARROW-9633
> URL: https://issues.apache.org/jira/browse/ARROW-9633
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 2.0.0
>
>
> In the context of the Datasets API, some file formats benefit greatly from 
> memory mapping (like Arrow IPC files) while other less so. Additionally, in 
> some scenarios, memory mapping could fail when used on network-attached 
> storage devices. Since a filesystem may be used to read different kinds of 
> files and use both memory mapping and non-memory mapping, and additionally 
> the Datasets API should be able to fall back on non-memory mapping if the 
> attempt to memory map fails, it would make sense to have a non-global option 
> for this:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/localfs.h
> I would suggest adding a new filesystem API with something like 
> {{OpenMappedInputFile}} with some options to control the behavior when memory 
> mapping is not possible. These options may be among:
> * Falling back on a normal RandomAccessFile
> * Reading the entire file into memory (or even tmpfs?) and then wrapping it 
> in a BufferReader
> * Failing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage

2020-08-17 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-1231:
-

Assignee: (was: Antoine Pitrou)

> [C++] Add filesystem / IO implementation for Google Cloud Storage
> -
>
> Key: ARROW-1231
> URL: https://issues.apache.org/jira/browse/ARROW-1231
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
>
> See example jumping off point
> https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9768) [Python] Pyarrow allows for unsafe conversions of datetime objects to timestamp nanoseconds

2020-08-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178785#comment-17178785
 ] 

Joris Van den Bossche commented on ARROW-9768:
--

[~Joshual] thanks for the report! We should indeed ensure that this raises. 

On casting we already check for this and raise appropriately:

{code}
In [13]: pa.array(np.array([datetime(2262,4,12)])).cast(pa.timestamp('ns')) 

   
...
ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out 
of bounds timestamp: 92233728
{code}

but this should also be done in the typed array converter.

> [Python] Pyarrow allows for unsafe conversions of datetime objects to 
> timestamp nanoseconds
> ---
>
> Key: ARROW-9768
> URL: https://issues.apache.org/jira/browse/ARROW-9768
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1, 1.0.0
> Environment: OS: MacOSX Catalina
> Python Version: 3.7
>Reporter: Joshua Lay
>Priority: Minor
> Fix For: 2.0.0
>
>
> Hi, 
> In parquet, I want to store date values as timestamp format with nanoseconds 
> precision. This works fine with most dates except those past 
> pandas.Timestamp.max: 
> [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html.|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html]
> I was expecting some exception to be raised (like in Pandas), however this 
> did not happen and the value was processed incorrectly. Note that this is 
> with safe=True. Can this please be looked into? Thanks
> {{Example Code:}}
> {{pa.array([datetime(2262,4,12)], type=pa.timestamp("ns"))}}
>  \{{}}
> {{Return:}}
> {{[}}
>  \{{ 1677-09-21 00:25:26.290448384}}
>  {{]}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9768) [Python] Pyarrow allows for unsafe conversions of datetime objects to timestamp nanoseconds

2020-08-17 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-9768:
-
Fix Version/s: 2.0.0

> [Python] Pyarrow allows for unsafe conversions of datetime objects to 
> timestamp nanoseconds
> ---
>
> Key: ARROW-9768
> URL: https://issues.apache.org/jira/browse/ARROW-9768
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1, 1.0.0
> Environment: OS: MacOSX Catalina
> Python Version: 3.7
>Reporter: Joshua Lay
>Priority: Minor
> Fix For: 2.0.0
>
>
> Hi, 
> In parquet, I want to store date values as timestamp format with nanoseconds 
> precision. This works fine with most dates except those past 
> pandas.Timestamp.max: 
> [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html.|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html]
> I was expecting some exception to be raised (like in Pandas), however this 
> did not happen and the value was processed incorrectly. Note that this is 
> with safe=True. Can this please be looked into? Thanks
> {{Example Code:}}
> {{pa.array([datetime(2262,4,12)], type=pa.timestamp("ns"))}}
>  \{{}}
> {{Return:}}
> {{[}}
>  \{{ 1677-09-21 00:25:26.290448384}}
>  {{]}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9672) [Python][Parquet] Expose _filters_to_expression

2020-08-17 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-9672:
-
Fix Version/s: 1.0.1

> [Python][Parquet] Expose _filters_to_expression
> ---
>
> Key: ARROW-9672
> URL: https://issues.apache.org/jira/browse/ARROW-9672
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Reporter: Caleb Winston
>Priority: Trivial
> Fix For: 1.0.1
>
>
> `_filters_to_expression` converts filters expressed in disjunctive normal 
> form (DNF) to `dataset.Expression`. Can `_filters_to_expression` be added to 
> the public API?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9672) [Python][Parquet] Expose _filters_to_expression

2020-08-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178775#comment-17178775
 ] 

Joris Van den Bossche commented on ARROW-9672:
--

I think the clear disclaimer of such a function would be that the returned 
expression is _only_ to be used to pass to one of the pyarrow.dataset functions 
as {{filter}} argument. And so when we have a more general expression API, this 
function should also be updated to return this new expression type, so that it 
keeps working for pyarrow.dataset. 
_If_ that is the only case for which the function would be used, I don't think 
there is any risk in increasing the surface area.

Alternatively, we could also accept the DNF-like lists of tuples in the 
pyarrow.dataset functions and methods, so that external projects like dask and 
cudf don't have to convert this to a pyarrow Expression themselves. 
We decided against it (not wanting to expand support for DNF-like nested 
lists), but doing this would actually decrease the exposure of the current 
dataset-specific expressions, as external project would not need to create them 
to be able to use the filtering functionality.

> [Python][Parquet] Expose _filters_to_expression
> ---
>
> Key: ARROW-9672
> URL: https://issues.apache.org/jira/browse/ARROW-9672
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Reporter: Caleb Winston
>Priority: Trivial
>
> `_filters_to_expression` converts filters expressed in disjunctive normal 
> form (DNF) to `dataset.Expression`. Can `_filters_to_expression` be added to 
> the public API?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9755) pyarrow deserialize return datetime.datetime

2020-08-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178770#comment-17178770
 ] 

Joris Van den Bossche commented on ARROW-9755:
--

Can you show a code example that reproduces your issue? Also, what did it 
return in 0.17.1?

> pyarrow deserialize return datetime.datetime
> 
>
> Key: ARROW-9755
> URL: https://issues.apache.org/jira/browse/ARROW-9755
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Ruotian Luo
>Priority: Major
>
> With latest pyarrow 1.0, pyarrow deserialize return datetime.datetime. Was 
> fine with 0.17.1
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9755) [Python] pyarrow deserialize return datetime.datetime

2020-08-17 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-9755:
-
Summary: [Python] pyarrow deserialize return datetime.datetime  (was: 
pyarrow deserialize return datetime.datetime)

> [Python] pyarrow deserialize return datetime.datetime
> -
>
> Key: ARROW-9755
> URL: https://issues.apache.org/jira/browse/ARROW-9755
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Ruotian Luo
>Priority: Major
>
> With latest pyarrow 1.0, pyarrow deserialize return datetime.datetime. Was 
> fine with 0.17.1
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9766) [C++][Parquet] Add EngineVersion to properties to allow for toggling new vs old logic

2020-08-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178766#comment-17178766
 ] 

Joris Van den Bossche commented on ARROW-9766:
--

Should this be added to the 1.0.1 milestone?

> [C++][Parquet] Add EngineVersion to properties to allow for toggling new vs 
> old logic
> -
>
> Key: ARROW-9766
> URL: https://issues.apache.org/jira/browse/ARROW-9766
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This will provide an escape hatch in case the new logic some how has 
> unuseable bugs in it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)