[jira] [Updated] (ARROW-9733) [Rust][DataFusion] Aggregates COUNT/MIN/MAX don't work on VARCHAR columns
[ https://issues.apache.org/jira/browse/ARROW-9733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9733: -- Labels: pull-request-available (was: ) > [Rust][DataFusion] Aggregates COUNT/MIN/MAX don't work on VARCHAR columns > - > > Key: ARROW-9733 > URL: https://issues.apache.org/jira/browse/ARROW-9733 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andrew Lamb >Priority: Major > Labels: pull-request-available > Attachments: repro.csv > > Time Spent: 10m > Remaining Estimate: 0h > > h2. Reproducer: > Create a table with a string column: > Repro: > {code} > CREATE EXTERNAL TABLE repro(a INT, b VARCHAR) > STORED AS CSV > WITH HEADER ROW > LOCATION 'repro.csv'; > {code} > The contents of repro.csv are as follows (also attached): > {code} > a,b > 1,One > 1,Two > 2,One > 2,Two > 2,Two > {code} > Now, run a query that tries to aggregate that column: > {code} > select a, count(b) from repro group by a; > {code} > *Actual behavior*: > {code} > > select a, count(b) from repro group by a; > ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for > result of aggregate expression"))) > {code} > *Expected Behavior*: > The query runs and produces results > {code} > a, count(b) > 1,2 > 2,3 > {code} > h2. Discussion > Using Min/Max aggregates on varchar also doesn't work (but should): > {code} > > select a, min(b) from repro group by a; > ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for > result of aggregate expression"))) > > select a, max(b) from repro group by a; > ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for > result of aggregate expression"))) > {code} > Fascinatingly these formulations work fine: > {code} > > select a, count(a) from repro group by a; > +---+--+ > | a | count(a) | > +---+--+ > | 2 | 3| > | 1 | 2| > +---+--+ > 2 row in set. Query took 0 seconds. > > select a, count(1) from repro group by a; > +---+-+ > | a | count(UInt8(1)) | > +---+-+ > | 2 | 3 | > | 1 | 2 | > +---+-+ > 2 row in set. Query took 0 seconds. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9778) [Rust] [DataFusion] Logical and physical schemas' nullability does not match in 8 out of 20 end-to-end tests
[ https://issues.apache.org/jira/browse/ARROW-9778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179356#comment-17179356 ] Andy Grove commented on ARROW-9778: --- Thanks [~jorgecarleitao] . When we construct the logical plan, we do open the source data files and infer the schema (unless a schema is provided) so I would consider this a bug in the logical plan. > [Rust] [DataFusion] Logical and physical schemas' nullability does not match > in 8 out of 20 end-to-end tests > > > Key: ARROW-9778 > URL: https://issues.apache.org/jira/browse/ARROW-9778 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Jorge >Priority: Major > > In `tests/sql.rs`, if we re-write the ```execute``` function to test the end > schemas, as > ``` > /// Execute query and return result set as tab delimited string > fn execute(ctx: ExecutionContext, sql: ) -> Vec { > let plan = ctx.create_logical_plan().unwrap(); > let plan = ctx.optimize().unwrap(); > let physical_plan = ctx.create_physical_plan().unwrap(); > let results = ctx.collect(physical_plan.as_ref()).unwrap(); > if results.len() > 0 { > // results must match the logical schema > assert_eq!(plan.schema().as_ref(), results[0].schema().as_ref()); > } > result_str() > } > ``` > we end up with 8 tests failing, which indicates that our physical and logical > plans are not aligned. In all cases, the issue is nullability: our logical > plan assumes nullability = true, while our physical plan may change the > nullability field. > If we do not plan to track nullability on the logical level, we could > consider replacing Schema by a type that does not track nullability. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9776) [R] read_feather causes segfault in R if file doesn't exist
[ https://issues.apache.org/jira/browse/ARROW-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179355#comment-17179355 ] Nathan TeBlunthuis commented on ARROW-9776: --- I think my os supports memory mapping but I'm not 100% sure how to check. > [R] read_feather causes segfault in R if file doesn't exist > --- > > Key: ARROW-9776 > URL: https://issues.apache.org/jira/browse/ARROW-9776 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 1.0.0 > Environment: R 4.0.2 > Centos 7 >Reporter: Nathan TeBlunthuis >Priority: Major > > This is easy to reproduce. > > {code:java} > library(arrow) > read_feather("test") > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9778) [Rust] [DataFusion] Logical and physical schemas' nullability does not match in 8 out of 20 end-to-end tests
Jorge created ARROW-9778: Summary: [Rust] [DataFusion] Logical and physical schemas' nullability does not match in 8 out of 20 end-to-end tests Key: ARROW-9778 URL: https://issues.apache.org/jira/browse/ARROW-9778 Project: Apache Arrow Issue Type: Bug Components: Rust, Rust - DataFusion Reporter: Jorge In `tests/sql.rs`, if we re-write the ```execute``` function to test the end schemas, as ``` /// Execute query and return result set as tab delimited string fn execute(ctx: ExecutionContext, sql: ) -> Vec { let plan = ctx.create_logical_plan().unwrap(); let plan = ctx.optimize().unwrap(); let physical_plan = ctx.create_physical_plan().unwrap(); let results = ctx.collect(physical_plan.as_ref()).unwrap(); if results.len() > 0 { // results must match the logical schema assert_eq!(plan.schema().as_ref(), results[0].schema().as_ref()); } result_str() } ``` we end up with 8 tests failing, which indicates that our physical and logical plans are not aligned. In all cases, the issue is nullability: our logical plan assumes nullability = true, while our physical plan may change the nullability field. If we do not plan to track nullability on the logical level, we could consider replacing Schema by a type that does not track nullability. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9744) [Python] Failed to install on aarch64
[ https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179335#comment-17179335 ] Kouhei Sutou commented on ARROW-9744: - You need to install Apache Arrow C++ separately or build it while building pyarrow. You can do the latter by {{PYARROW_BUNDLE_ARROW_CPP=1 PYARROW_CMAKE_OPTIONS="-DARROW_ARMV8_ARCH=armv8-a" pip install pyarrow}}. > [Python] Failed to install on aarch64 > - > > Key: ARROW-9744 > URL: https://issues.apache.org/jira/browse/ARROW-9744 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 > Environment: AWS m6g (ARM64 'Graviton2' CPU): > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp > cpuid asimdrdm lrcpc dcpop asimddp ssbs > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x3 > CPU part: 0xd0c > CPU revision: 1 > OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 > (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 > 10:03:03 UTC 2020 >Reporter: Matthew Meen >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Attachments: cmake-info.txt, pyarrow_017.txt > > Time Spent: 20m > Remaining Estimate: 0h > > My team is attempting to migrate some workloads from x86-64 to ARM64, a > blocker for this is PyArrow failing to install. `pip install pyarrow` fails > to build the wheel as -march isn't correctly resolved: > {noformat} > -- System processor: aarch64 > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed > -- Arrow build warning level: PRODUCTION > CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message): > Unsupported arch flag: -march=. > {noformat} > It's possible to get the build to work after editing > `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up > as an architecture such as 'armv8-a' - although some more elaborate logic is > really needed to pick up the correct extensions. > I can see that there have been a number of items discussed in the past both > on Jira and in GitHub issues ranging from simple fixes to the cmake script to > more elaborate fixes cross-product for arch detection - but I wasn't able to > discern how the project wishes to proceed. > With AWS pushing their ARM-based instances heavily at this point I would > advocate for picking a direction before an influx of new issues. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9744) [Python] Failed to install on aarch64
[ https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179311#comment-17179311 ] Eamonn Nugent commented on ARROW-9744: -- Just tuning back in. Tried out the workaround, and received this: {code:java} -- Looking for python3.8 -- Found Python lib /usr/local/lib/libpython3.8.so -- Found PkgConfig: /usr/bin/pkg-config (found version "0.29") -- Could NOT find Arrow (missing: Arrow_DIR) -- Checking for module 'arrow' -- No package 'arrow' found CMake Error at /usr/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:137 (message): Could NOT find Arrow (missing: ARROW_INCLUDE_DIR ARROW_LIB_DIR ARROW_FULL_SO_VERSION ARROW_SO_VERSION) Call Stack (most recent call first): /usr/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE) cmake_modules/FindArrow.cmake:412 (find_package_handle_standard_args) cmake_modules/FindArrowPython.cmake:46 (find_package) CMakeLists.txt:210 (find_package) -- Configuring incomplete, errors occurred! See also "/tmp/pip-install-av0q_7o5/pyarrow/build/temp.linux-aarch64-3.8/CMakeFiles/CMakeOutput.log". error: command 'cmake' failed with exit status 1 ERROR: Failed building wheel for pyarrow Failed to build pyarrow ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly The command '/bin/sh -c PYARROW_CMAKE_OPTIONS="-DARROW_ARMV8_ARCH=armv8-a" pip install pyarrow' returned a non-zero code: 1 {code} Failing Dockerfile on an ARMv8 system: {code:java} FROM python:3.8-buster RUN apt update RUN apt -y install gcc g++ cmake RUN PYARROW_CMAKE_OPTIONS="-DARROW_ARMV8_ARCH=armv8-a" pip install pyarrow {code} If there's anything I can do to help debug, please, feel free to let me know. > [Python] Failed to install on aarch64 > - > > Key: ARROW-9744 > URL: https://issues.apache.org/jira/browse/ARROW-9744 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 > Environment: AWS m6g (ARM64 'Graviton2' CPU): > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp > cpuid asimdrdm lrcpc dcpop asimddp ssbs > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x3 > CPU part: 0xd0c > CPU revision: 1 > OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 > (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 > 10:03:03 UTC 2020 >Reporter: Matthew Meen >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Attachments: cmake-info.txt, pyarrow_017.txt > > Time Spent: 20m > Remaining Estimate: 0h > > My team is attempting to migrate some workloads from x86-64 to ARM64, a > blocker for this is PyArrow failing to install. `pip install pyarrow` fails > to build the wheel as -march isn't correctly resolved: > {noformat} > -- System processor: aarch64 > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed > -- Arrow build warning level: PRODUCTION > CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message): > Unsupported arch flag: -march=. > {noformat} > It's possible to get the build to work after editing > `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up > as an architecture such as 'armv8-a' - although some more elaborate logic is > really needed to pick up the correct extensions. > I can see that there have been a number of items discussed in the past both > on Jira and in GitHub issues ranging from simple fixes to the cmake script to > more elaborate fixes cross-product for arch detection - but I wasn't able to > discern how the project wishes to proceed. > With AWS pushing their ARM-based instances heavily at this point I would > advocate for picking a direction before an influx of new issues. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8992) [CI][C++] march not passing correctly for docker-compose run
[ https://issues.apache.org/jira/browse/ARROW-8992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179305#comment-17179305 ] Yibo Cai commented on ARROW-8992: - Guess below PR may address this issue [https://github.com/apache/arrow/pull/7982] > [CI][C++] march not passing correctly for docker-compose run > > > Key: ARROW-8992 > URL: https://issues.apache.org/jira/browse/ARROW-8992 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging, Python >Affects Versions: 0.17.0, 0.17.1 > Environment: Mendel Linux 4.0 >Reporter: Elliott Kipp >Assignee: Krisztian Szucs >Priority: Critical > Attachments: CMakeError.log, CMakeOutput.log > > > [https://github.com/apache/arrow/issues/7307] > Building on the new ASUS Tinker Edge T, running Mendel Linux 4.0 (Day). > docker-compose build commands work fine with no errors: > DEBIAN=10 ARCH=arm64v8 docker-compose build debian-cpp && DEBIAN=10 > ARCH=arm64v8 docker-compose build debian-python > DEBIAN=10 ARCH=arm64v8 docker-compose run debian-python - fails with the > following: > – Running cmake for pyarrow > cmake -DPYTHON_EXECUTABLE=/usr/local/bin/python -G Ninja > -DPYARROW_BUILD_CUDA=off -DPYARROW_BUILD_FLIGHT=on -DPYARROW_BUILD_GANDIVA=on > -DPYARROW_BUILD_DATASET=on -DPYARROW_BUILD_ORC=on -DPYARROW_BUILD_PARQUET=on > -DPYARROW_BUILD_PLASMA=on -DPYARROW_BUILD_S3=off -DPYARROW_BUILD_HDFS=off > -DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off > -DPYARROW_BUNDLE_BOOST=off -DPYARROW_GENERATE_COVERAGE=off > -DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on > -DCMAKE_BUILD_TYPE=debug /arrow/python > – The C compiler identification is GNU 8.3.0 > – The CXX compiler identification is GNU 8.3.0 > – Check for working C compiler: /usr/lib/ccache/gcc > – Check for working C compiler: /usr/lib/ccache/gcc – works > – Detecting C compiler ABI info > – Detecting C compiler ABI info - done > – Detecting C compile features > – Detecting C compile features - done > – Check for working CXX compiler: /usr/lib/ccache/g++ > – Check for working CXX compiler: /usr/lib/ccache/g++ – works > – Detecting CXX compiler ABI info > – Detecting CXX compiler ABI info - done > – Detecting CXX compile features > – Detecting CXX compile features - done > – System processor: aarch64 > – Performing Test CXX_SUPPORTS_ARMV8_ARCH > – Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed > – Arrow build warning level: PRODUCTION > CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message): > Unsupported arch flag: -march=. > Call Stack (most recent call first): > CMakeLists.txt:100 (include) > – Configuring incomplete, errors occurred! > See also "/build/python/temp.linux-aarch64-3.7/CMakeFiles/CMakeOutput.log". > See also "/build/python/temp.linux-aarch64-3.7/CMakeFiles/CMakeError.log". > error: command 'cmake' failed with exit status 1 > Tried the tarball release for both 0.17.0 and 0.17.1, same result. Also tried > compiling manually (following these instructions: > [https://dzone.com/articles/building-pyarrow-with-cuda-support]) with the > same result. > Only modifications I made to source are editing the docker-compose volumes, > as described here: [https://github.com/apache/arrow/pull/6907] > Jira opened, per request at: [https://github.com/apache/arrow/issues/7307] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9777) [Rust] Implement IPC changes to catch up to 1.0.0 format
Neville Dipale created ARROW-9777: - Summary: [Rust] Implement IPC changes to catch up to 1.0.0 format Key: ARROW-9777 URL: https://issues.apache.org/jira/browse/ARROW-9777 Project: Apache Arrow Issue Type: Bug Components: Rust Affects Versions: 1.0.0 Reporter: Neville Dipale There are a number of IPC changes and features which the Rust implementation has fallen behind on. It's effectively using the legacy format that was released in 0.14.x. Some that I encountered are: * change padding from 4 bytes to 8 bytes (along with the padding algorithm) * add an IPC writer option to support the legacy format and updated format * add error handling for the different metadata versions, we should support v4+ so it's an oversight to not explicitly return errors if unsupported versions are read Some of the work already has Jiras open (e.g. body compression), I'll find them and mark them as related to this. I'm tight for spare time, but I'll try work on this before the next release (along with the Parquet writer) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9742) [Rust] Create one standard DataFrame API
[ https://issues.apache.org/jira/browse/ARROW-9742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179274#comment-17179274 ] Neville Dipale commented on ARROW-9742: --- Hi [~jhorstmann], the scalar functions on the rust-dataframe library mainly call the Arrow compute functions. As we have implemented compute functions with an array being the smallest unit, I iterate the chunked arrays and call scalar functions on the arrays, before grouping them again into a chunk. I explored usin Rayon for parallelising those compute functions, but it's not a priority (the project is really for me to explore ideas, with the goal being to create a lazy dataframe ala spark). There's scope to add a lot of compute functions to Arrow so that downstream users can reuse them, and so we can optimise performance from one place. I haven't yet seen interest in functions like trig, temporal functions (I have a Jira open for this as I tend to do a lot of datetime conversions), and other functions beyond what we have. I think DF has some of these as UDFs, which probably makes sense to keep them there for now. Regarding performance, we've found some patterns that help with autovectorisation when writing compute functions, I think at the least we could write them up so that downstream users can at least follow them. One common mistake I've seen is that we iterate through array values, checking if a slot is valid or null, and computing the function if valid. An approach that works is to ignore nulls and calculate them from the validty mask. > [Rust] Create one standard DataFrame API > > > Key: ARROW-9742 > URL: https://issues.apache.org/jira/browse/ARROW-9742 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > There was a discussion in last Arrow sync call about the fact that there are > numerous Rust DataFrame projects and it would be good to have one standard, > in the Arrow repo. > I do think it would be good to have a DataFrame trait in Arrow, with an > implementation in DataFusion, and making it possible for other projects to > extend/replace the implementation e.g. for distributed compute, or for GPU > compute, as two examples. > [~jhorstmann] Does this capture what you were suggesting in the call? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9776) [R] read_feather causes segfault in R if file doesn't exist
[ https://issues.apache.org/jira/browse/ARROW-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179260#comment-17179260 ] Nathan TeBlunthuis commented on ARROW-9776: --- Hi Neal, thanks for the help. {code:java} read_feather("asdfasdf", mmap = FALSE){code} also segfaults. read_parquet, read_json_arrow, and read_ipc_stream also segfault. I didn't try the other functions. I installed the R package from CRAN and then ran {code:java} install_arrow{code} > [R] read_feather causes segfault in R if file doesn't exist > --- > > Key: ARROW-9776 > URL: https://issues.apache.org/jira/browse/ARROW-9776 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 1.0.0 > Environment: R 4.0.2 > Centos 7 >Reporter: Nathan TeBlunthuis >Priority: Major > > This is easy to reproduce. > > {code:java} > library(arrow) > read_feather("test") > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-9299) [Python] Expose ORC metadata() in Python ORCFile
[ https://issues.apache.org/jira/browse/ARROW-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179252#comment-17179252 ] Jeremy Dyer edited comment on ARROW-9299 at 8/17/20, 10:09 PM: --- [~calebwin] it is possible, but not currently visible as you mentioned. I think the easiest thing to do would be add a function in `orc/adaptor.cc` that did basically the same thing done here [1]. After that it would be exposed so that python could invoke it I believe? I'm no expert here but seems like that would do the trick. [1] [https://github.com/apache/arrow/blob/d542482bdc6bea8a449f000bdd74de8990c20015/cpp/src/arrow/adapters/orc/adapter.cc#L235-L242] was (Author: jeremy.dyer): [~calebwin] it is possible but not currently exposed. I think the easiest thing to do would be add a function in `orc/adaptor.cc` that did basically the same thing done here [1]. After that it would be exposed so that python could invoke it I believe? I'm no expert here but seems like that would do the trick. [1] [https://github.com/apache/arrow/blob/d542482bdc6bea8a449f000bdd74de8990c20015/cpp/src/arrow/adapters/orc/adapter.cc#L235-L242] > [Python] Expose ORC metadata() in Python ORCFile > > > Key: ARROW-9299 > URL: https://issues.apache.org/jira/browse/ARROW-9299 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.17.1 >Reporter: Jeremy Dyer >Priority: Major > > There is currently no way for a user to directly access the underlying ORC > metadata of a given file. It seems the C++ functions and objects already > existing and rather the plumbing is just missing the the cython/python and > potentially a few c++ shims. Giving users the ability to retrieve the > metadata without first reading the entire file could help numerous > applications to increase their query performance by allowing them to > intelligently determine which ORC stripes should be read. > This would allow for something like > {code:java} > import pyarrow as pa > orc_metadata = pa.orc.ORCFile(filename).metadata() > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9299) [Python] Expose ORC metadata() in Python ORCFile
[ https://issues.apache.org/jira/browse/ARROW-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179252#comment-17179252 ] Jeremy Dyer commented on ARROW-9299: [~calebwin] it is possible but not currently exposed. I think the easiest thing to do would be add a function in `orc/adaptor.cc` that did basically the same thing done here [1]. After that it would be exposed so that python could invoke it I believe? I'm no expert here but seems like that would do the trick. [1] [https://github.com/apache/arrow/blob/d542482bdc6bea8a449f000bdd74de8990c20015/cpp/src/arrow/adapters/orc/adapter.cc#L235-L242] > [Python] Expose ORC metadata() in Python ORCFile > > > Key: ARROW-9299 > URL: https://issues.apache.org/jira/browse/ARROW-9299 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.17.1 >Reporter: Jeremy Dyer >Priority: Major > > There is currently no way for a user to directly access the underlying ORC > metadata of a given file. It seems the C++ functions and objects already > existing and rather the plumbing is just missing the the cython/python and > potentially a few c++ shims. Giving users the ability to retrieve the > metadata without first reading the entire file could help numerous > applications to increase their query performance by allowing them to > intelligently determine which ORC stripes should be read. > This would allow for something like > {code:java} > import pyarrow as pa > orc_metadata = pa.orc.ORCFile(filename).metadata() > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9744) [Python] Failed to install on aarch64
[ https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-9744: Assignee: Apache Arrow JIRA Bot (was: Kouhei Sutou) > [Python] Failed to install on aarch64 > - > > Key: ARROW-9744 > URL: https://issues.apache.org/jira/browse/ARROW-9744 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 > Environment: AWS m6g (ARM64 'Graviton2' CPU): > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp > cpuid asimdrdm lrcpc dcpop asimddp ssbs > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x3 > CPU part: 0xd0c > CPU revision: 1 > OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 > (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 > 10:03:03 UTC 2020 >Reporter: Matthew Meen >Assignee: Apache Arrow JIRA Bot >Priority: Major > Labels: pull-request-available > Attachments: cmake-info.txt, pyarrow_017.txt > > Time Spent: 10m > Remaining Estimate: 0h > > My team is attempting to migrate some workloads from x86-64 to ARM64, a > blocker for this is PyArrow failing to install. `pip install pyarrow` fails > to build the wheel as -march isn't correctly resolved: > {noformat} > -- System processor: aarch64 > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed > -- Arrow build warning level: PRODUCTION > CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message): > Unsupported arch flag: -march=. > {noformat} > It's possible to get the build to work after editing > `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up > as an architecture such as 'armv8-a' - although some more elaborate logic is > really needed to pick up the correct extensions. > I can see that there have been a number of items discussed in the past both > on Jira and in GitHub issues ranging from simple fixes to the cmake script to > more elaborate fixes cross-product for arch detection - but I wasn't able to > discern how the project wishes to proceed. > With AWS pushing their ARM-based instances heavily at this point I would > advocate for picking a direction before an influx of new issues. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9744) [Python] Failed to install on aarch64
[ https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-9744: Assignee: Kouhei Sutou (was: Apache Arrow JIRA Bot) > [Python] Failed to install on aarch64 > - > > Key: ARROW-9744 > URL: https://issues.apache.org/jira/browse/ARROW-9744 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 > Environment: AWS m6g (ARM64 'Graviton2' CPU): > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp > cpuid asimdrdm lrcpc dcpop asimddp ssbs > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x3 > CPU part: 0xd0c > CPU revision: 1 > OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 > (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 > 10:03:03 UTC 2020 >Reporter: Matthew Meen >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Attachments: cmake-info.txt, pyarrow_017.txt > > Time Spent: 10m > Remaining Estimate: 0h > > My team is attempting to migrate some workloads from x86-64 to ARM64, a > blocker for this is PyArrow failing to install. `pip install pyarrow` fails > to build the wheel as -march isn't correctly resolved: > {noformat} > -- System processor: aarch64 > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed > -- Arrow build warning level: PRODUCTION > CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message): > Unsupported arch flag: -march=. > {noformat} > It's possible to get the build to work after editing > `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up > as an architecture such as 'armv8-a' - although some more elaborate logic is > really needed to pick up the correct extensions. > I can see that there have been a number of items discussed in the past both > on Jira and in GitHub issues ranging from simple fixes to the cmake script to > more elaborate fixes cross-product for arch detection - but I wasn't able to > discern how the project wishes to proceed. > With AWS pushing their ARM-based instances heavily at this point I would > advocate for picking a direction before an influx of new issues. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9744) [Python] Failed to install on aarch64
[ https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9744: -- Labels: pull-request-available (was: ) > [Python] Failed to install on aarch64 > - > > Key: ARROW-9744 > URL: https://issues.apache.org/jira/browse/ARROW-9744 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 > Environment: AWS m6g (ARM64 'Graviton2' CPU): > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp > cpuid asimdrdm lrcpc dcpop asimddp ssbs > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x3 > CPU part: 0xd0c > CPU revision: 1 > OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 > (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 > 10:03:03 UTC 2020 >Reporter: Matthew Meen >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Attachments: cmake-info.txt, pyarrow_017.txt > > Time Spent: 10m > Remaining Estimate: 0h > > My team is attempting to migrate some workloads from x86-64 to ARM64, a > blocker for this is PyArrow failing to install. `pip install pyarrow` fails > to build the wheel as -march isn't correctly resolved: > {noformat} > -- System processor: aarch64 > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed > -- Arrow build warning level: PRODUCTION > CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message): > Unsupported arch flag: -march=. > {noformat} > It's possible to get the build to work after editing > `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up > as an architecture such as 'armv8-a' - although some more elaborate logic is > really needed to pick up the correct extensions. > I can see that there have been a number of items discussed in the past both > on Jira and in GitHub issues ranging from simple fixes to the cmake script to > more elaborate fixes cross-product for arch detection - but I wasn't able to > discern how the project wishes to proceed. > With AWS pushing their ARM-based instances heavily at this point I would > advocate for picking a direction before an influx of new issues. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9744) [Python] Failed to install on aarch64
[ https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179247#comment-17179247 ] Kouhei Sutou commented on ARROW-9744: - Ah, I got it. pyarrow uses only {{SetupCxxFlags.cmake}}. It doesn't use {{DefineOptions.cmake}}. I'll create a pull request to fix this. Workaround: {{PYARROW_CMAKE_OPTIONS="-DARROW_ARMV8_ARCH=armv8-a" pip install pyarrow}} > [Python] Failed to install on aarch64 > - > > Key: ARROW-9744 > URL: https://issues.apache.org/jira/browse/ARROW-9744 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 > Environment: AWS m6g (ARM64 'Graviton2' CPU): > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp > cpuid asimdrdm lrcpc dcpop asimddp ssbs > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x3 > CPU part: 0xd0c > CPU revision: 1 > OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 > (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 > 10:03:03 UTC 2020 >Reporter: Matthew Meen >Assignee: Kouhei Sutou >Priority: Major > Attachments: cmake-info.txt, pyarrow_017.txt > > > My team is attempting to migrate some workloads from x86-64 to ARM64, a > blocker for this is PyArrow failing to install. `pip install pyarrow` fails > to build the wheel as -march isn't correctly resolved: > {noformat} > -- System processor: aarch64 > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed > -- Arrow build warning level: PRODUCTION > CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message): > Unsupported arch flag: -march=. > {noformat} > It's possible to get the build to work after editing > `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up > as an architecture such as 'armv8-a' - although some more elaborate logic is > really needed to pick up the correct extensions. > I can see that there have been a number of items discussed in the past both > on Jira and in GitHub issues ranging from simple fixes to the cmake script to > more elaborate fixes cross-product for arch detection - but I wasn't able to > discern how the project wishes to proceed. > With AWS pushing their ARM-based instances heavily at this point I would > advocate for picking a direction before an influx of new issues. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9745) [Python] Reading Parquet file crashes on windows - python3.8
[ https://issues.apache.org/jira/browse/ARROW-9745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-9745: Summary: [Python] Reading Parquet file crashes on windows - python3.8 (was: parrow fails to read on windows - python3.8) > [Python] Reading Parquet file crashes on windows - python3.8 > > > Key: ARROW-9745 > URL: https://issues.apache.org/jira/browse/ARROW-9745 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0 > Environment: Installation done with pip: > pip install pyarrow pandas > for python3.8 on a windows machine running windows 10 Enterprise (v1809). The > resulting wheel is: > pyarrow-1.0.0-cp38-cp38-win_amd64.whl >Reporter: Dylan Modesitt >Priority: Major > > {code:java} > import pandas > import numpy > df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), > columns=list("1234")) > df.to_parquet("the.parquet") > pd.read_parquet("the.parquet") # fails here > {code} > fails with > {code:java} > Process finished with exit code -1073741795 (0xC01D) > {code} > {code:java} > pyarrow.parquet.read_pandas(pyarrow.BufferReader(...)).to_pandas() > {code} > also fails with the same exit message. Has this been seen before? Is there a > known solution? I experienced the same issue installing the pyarrow nightlies > as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9745) [Python] Reading Parquet file crashes on windows - python3.8
[ https://issues.apache.org/jira/browse/ARROW-9745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179245#comment-17179245 ] Wes McKinney commented on ARROW-9745: - Can you provide a reproducible example and any information about your hardware (CPU type etc.)? > [Python] Reading Parquet file crashes on windows - python3.8 > > > Key: ARROW-9745 > URL: https://issues.apache.org/jira/browse/ARROW-9745 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0 > Environment: Installation done with pip: > pip install pyarrow pandas > for python3.8 on a windows machine running windows 10 Enterprise (v1809). The > resulting wheel is: > pyarrow-1.0.0-cp38-cp38-win_amd64.whl >Reporter: Dylan Modesitt >Priority: Major > > {code:java} > import pandas > import numpy > df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), > columns=list("1234")) > df.to_parquet("the.parquet") > pd.read_parquet("the.parquet") # fails here > {code} > fails with > {code:java} > Process finished with exit code -1073741795 (0xC01D) > {code} > {code:java} > pyarrow.parquet.read_pandas(pyarrow.BufferReader(...)).to_pandas() > {code} > also fails with the same exit message. Has this been seen before? Is there a > known solution? I experienced the same issue installing the pyarrow nightlies > as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-9744) [Python] Failed to install on aarch64
[ https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179243#comment-17179243 ] Eamonn Nugent edited comment on ARROW-9744 at 8/17/20, 9:48 PM: Hiya. Just wanted to tune in about `pip install pyarrow` with 1.0 on an ARMv8 instance (AWS m6g.medium). It seems to have the same error: {code:java} CMake Error at cmake_modules/SetupCxxFlags.cmake:368 (message): Unsupported arch flag: -march=. Call Stack (most recent call first): CMakeLists.txt:100 (include){code} Is there a good workaround for this? Or should I wait for the next release was (Author: space55): Hiya. Just wanted to tune in about `pip install pyarrow` with 1.0 on an ARMv8 instance (AWS m6g.medium). It seems to have the same error: ``` CMake Error at cmake_modules/SetupCxxFlags.cmake:368 (message): Unsupported arch flag: -march=. Call Stack (most recent call first): CMakeLists.txt:100 (include) ``` Is there a good workaround for this? Or should I wait for the next release > [Python] Failed to install on aarch64 > - > > Key: ARROW-9744 > URL: https://issues.apache.org/jira/browse/ARROW-9744 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 > Environment: AWS m6g (ARM64 'Graviton2' CPU): > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp > cpuid asimdrdm lrcpc dcpop asimddp ssbs > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x3 > CPU part: 0xd0c > CPU revision: 1 > OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 > (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 > 10:03:03 UTC 2020 >Reporter: Matthew Meen >Assignee: Kouhei Sutou >Priority: Major > Attachments: cmake-info.txt, pyarrow_017.txt > > > My team is attempting to migrate some workloads from x86-64 to ARM64, a > blocker for this is PyArrow failing to install. `pip install pyarrow` fails > to build the wheel as -march isn't correctly resolved: > {noformat} > -- System processor: aarch64 > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed > -- Arrow build warning level: PRODUCTION > CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message): > Unsupported arch flag: -march=. > {noformat} > It's possible to get the build to work after editing > `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up > as an architecture such as 'armv8-a' - although some more elaborate logic is > really needed to pick up the correct extensions. > I can see that there have been a number of items discussed in the past both > on Jira and in GitHub issues ranging from simple fixes to the cmake script to > more elaborate fixes cross-product for arch detection - but I wasn't able to > discern how the project wishes to proceed. > With AWS pushing their ARM-based instances heavily at this point I would > advocate for picking a direction before an influx of new issues. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9744) [Python] Failed to install on aarch64
[ https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179243#comment-17179243 ] Eamonn Nugent commented on ARROW-9744: -- Hiya. Just wanted to tune in about `pip install pyarrow` with 1.0 on an ARMv8 instance (AWS m6g.medium). It seems to have the same error: ``` CMake Error at cmake_modules/SetupCxxFlags.cmake:368 (message): Unsupported arch flag: -march=. Call Stack (most recent call first): CMakeLists.txt:100 (include) ``` Is there a good workaround for this? Or should I wait for the next release > [Python] Failed to install on aarch64 > - > > Key: ARROW-9744 > URL: https://issues.apache.org/jira/browse/ARROW-9744 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 > Environment: AWS m6g (ARM64 'Graviton2' CPU): > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp > cpuid asimdrdm lrcpc dcpop asimddp ssbs > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x3 > CPU part: 0xd0c > CPU revision: 1 > OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 > (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 > 10:03:03 UTC 2020 >Reporter: Matthew Meen >Assignee: Kouhei Sutou >Priority: Major > Attachments: cmake-info.txt, pyarrow_017.txt > > > My team is attempting to migrate some workloads from x86-64 to ARM64, a > blocker for this is PyArrow failing to install. `pip install pyarrow` fails > to build the wheel as -march isn't correctly resolved: > {noformat} > -- System processor: aarch64 > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed > -- Arrow build warning level: PRODUCTION > CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message): > Unsupported arch flag: -march=. > {noformat} > It's possible to get the build to work after editing > `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up > as an architecture such as 'armv8-a' - although some more elaborate logic is > really needed to pick up the correct extensions. > I can see that there have been a number of items discussed in the past both > on Jira and in GitHub issues ranging from simple fixes to the cmake script to > more elaborate fixes cross-product for arch detection - but I wasn't able to > discern how the project wishes to proceed. > With AWS pushing their ARM-based instances heavily at this point I would > advocate for picking a direction before an influx of new issues. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9776) [R] read_feather causes segfault in R if file doesn't exist
[ https://issues.apache.org/jira/browse/ARROW-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-9776: Summary: [R] read_feather causes segfault in R if file doesn't exist (was: read_feather causes segfault in R if file doesn't exist) > [R] read_feather causes segfault in R if file doesn't exist > --- > > Key: ARROW-9776 > URL: https://issues.apache.org/jira/browse/ARROW-9776 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 1.0.0 > Environment: R 4.0.2 > Centos 7 >Reporter: Nathan TeBlunthuis >Priority: Major > > This is easy to reproduce. > > {code:java} > library(arrow) > read_feather("test") > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9729) [Java] Error Prone causes other annotation processors to not work with Eclipse
[ https://issues.apache.org/jira/browse/ARROW-9729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-9729: Summary: [Java] Error Prone causes other annotation processors to not work with Eclipse (was: Error Prone causes other annotation processors to not work with Eclipse) > [Java] Error Prone causes other annotation processors to not work with Eclipse > -- > > Key: ARROW-9729 > URL: https://issues.apache.org/jira/browse/ARROW-9729 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Laurent Goujon >Assignee: Laurent Goujon >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > An incompatibility between Eclipse (m2e-apt), Error Prone prevents other > annotation processors to work correctly within Eclipse, which is especially > an issue with Immutables.org annotation processor as it generated classes > needed for the project to compile. > This is explained in more detailed in this bug report for m2e-apt Eclipse > plugin: https://github.com/jbosstools/m2e-apt/issues/62 > There's no easy workaround Eclipse user can apply by themselves, but the > Arrow project could not include Error Prone as an annotation processor when > being imported into Eclipse, in order for the other annotation processors to > work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9776) read_feather causes segfault in R if file doesn't exist
[ https://issues.apache.org/jira/browse/ARROW-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179240#comment-17179240 ] Neal Richardson commented on ARROW-9776: Works on my machine (well, fails gracefully): {code} > library(arrow) Attaching package: ‘arrow’ The following object is masked from ‘package:utils’: timestamp > read_feather("asdfasdf") Error in io___MemoryMappedFile__Open(path, mode) : IOError: Failed to open local file 'asdfasdf'. Detail: [errno 2] No such file or directory {code} Does your file system support memory mapping? Does {{read_feather("test", mmap = FALSE)}} also segfault? Do other read_* functions behave the same? Can you provide details on how you've installed the R package? > read_feather causes segfault in R if file doesn't exist > --- > > Key: ARROW-9776 > URL: https://issues.apache.org/jira/browse/ARROW-9776 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 1.0.0 > Environment: R 4.0.2 > Centos 7 >Reporter: Nathan TeBlunthuis >Priority: Major > > This is easy to reproduce. > > {code:java} > library(arrow) > read_feather("test") > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9776) read_feather causes segfault in R if file doesn't exist
Nathan TeBlunthuis created ARROW-9776: - Summary: read_feather causes segfault in R if file doesn't exist Key: ARROW-9776 URL: https://issues.apache.org/jira/browse/ARROW-9776 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 1.0.0 Environment: R 4.0.2 Centos 7 Reporter: Nathan TeBlunthuis This is easy to reproduce. {code:java} library(arrow) read_feather("test") {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9744) [Python] Failed to install on aarch64
[ https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179237#comment-17179237 ] Krisztian Szucs commented on ARROW-9744: It is available now on PyPI https://pypi.org/project/pyarrow/#files > [Python] Failed to install on aarch64 > - > > Key: ARROW-9744 > URL: https://issues.apache.org/jira/browse/ARROW-9744 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 > Environment: AWS m6g (ARM64 'Graviton2' CPU): > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp > cpuid asimdrdm lrcpc dcpop asimddp ssbs > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x3 > CPU part: 0xd0c > CPU revision: 1 > OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 > (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 > 10:03:03 UTC 2020 >Reporter: Matthew Meen >Assignee: Kouhei Sutou >Priority: Major > Attachments: cmake-info.txt, pyarrow_017.txt > > > My team is attempting to migrate some workloads from x86-64 to ARM64, a > blocker for this is PyArrow failing to install. `pip install pyarrow` fails > to build the wheel as -march isn't correctly resolved: > {noformat} > -- System processor: aarch64 > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed > -- Arrow build warning level: PRODUCTION > CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message): > Unsupported arch flag: -march=. > {noformat} > It's possible to get the build to work after editing > `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up > as an architecture such as 'armv8-a' - although some more elaborate logic is > really needed to pick up the correct extensions. > I can see that there have been a number of items discussed in the past both > on Jira and in GitHub issues ranging from simple fixes to the cmake script to > more elaborate fixes cross-product for arch detection - but I wasn't able to > discern how the project wishes to proceed. > With AWS pushing their ARM-based instances heavily at this point I would > advocate for picking a direction before an influx of new issues. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9744) [Python] Failed to install on aarch64
[ https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-9744: Description: My team is attempting to migrate some workloads from x86-64 to ARM64, a blocker for this is PyArrow failing to install. `pip install pyarrow` fails to build the wheel as -march isn't correctly resolved: {noformat} -- System processor: aarch64 -- Performing Test CXX_SUPPORTS_ARMV8_ARCH -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed -- Arrow build warning level: PRODUCTION CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message): Unsupported arch flag: -march=. {noformat} It's possible to get the build to work after editing `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up as an architecture such as 'armv8-a' - although some more elaborate logic is really needed to pick up the correct extensions. I can see that there have been a number of items discussed in the past both on Jira and in GitHub issues ranging from simple fixes to the cmake script to more elaborate fixes cross-product for arch detection - but I wasn't able to discern how the project wishes to proceed. With AWS pushing their ARM-based instances heavily at this point I would advocate for picking a direction before an influx of new issues. was: My team is attempting to migrate some workloads from x86-64 to ARM64, a blocker for this is PyArrow failing to install. `pip install pyarrow` fails to build the wheel as -march isn't correctly resolved: {{ -- System processor: aarch64}} {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH}} {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed}} {{ -- Arrow build warning level: PRODUCTION}} {{ CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):}} {{ Unsupported arch flag: -march=.}} It's possible to get the build to work after editing `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up as an architecture such as 'armv8-a' - although some more elaborate logic is really needed to pick up the correct extensions. I can see that there have been a number of items discussed in the past both on Jira and in GitHub issues ranging from simple fixes to the cmake script to more elaborate fixes cross-product for arch detection - but I wasn't able to discern how the project wishes to proceed. With AWS pushing their ARM-based instances heavily at this point I would advocate for picking a direction before an influx of new issues. > [Python] Failed to install on aarch64 > - > > Key: ARROW-9744 > URL: https://issues.apache.org/jira/browse/ARROW-9744 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 > Environment: AWS m6g (ARM64 'Graviton2' CPU): > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp > cpuid asimdrdm lrcpc dcpop asimddp ssbs > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x3 > CPU part: 0xd0c > CPU revision: 1 > OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 > (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 > 10:03:03 UTC 2020 >Reporter: Matthew Meen >Assignee: Kouhei Sutou >Priority: Major > Attachments: cmake-info.txt, pyarrow_017.txt > > > My team is attempting to migrate some workloads from x86-64 to ARM64, a > blocker for this is PyArrow failing to install. `pip install pyarrow` fails > to build the wheel as -march isn't correctly resolved: > {noformat} > -- System processor: aarch64 > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed > -- Arrow build warning level: PRODUCTION > CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message): > Unsupported arch flag: -march=. > {noformat} > It's possible to get the build to work after editing > `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up > as an architecture such as 'armv8-a' - although some more elaborate logic is > really needed to pick up the correct extensions. > I can see that there have been a number of items discussed in the past both > on Jira and in GitHub issues ranging from simple fixes to the cmake script to > more elaborate fixes cross-product for arch detection - but I wasn't able to > discern how the project wishes to proceed. > With AWS pushing their ARM-based instances heavily at this point I would > advocate for picking a direction before an influx of new issues. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9744) [Python] Failed to install on aarch64
[ https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-9744: Summary: [Python] Failed to install on aarch64 (was: [Python] aarch64 Installation Error) > [Python] Failed to install on aarch64 > - > > Key: ARROW-9744 > URL: https://issues.apache.org/jira/browse/ARROW-9744 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 > Environment: AWS m6g (ARM64 'Graviton2' CPU): > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp > cpuid asimdrdm lrcpc dcpop asimddp ssbs > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x3 > CPU part: 0xd0c > CPU revision: 1 > OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 > (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 > 10:03:03 UTC 2020 >Reporter: Matthew Meen >Assignee: Kouhei Sutou >Priority: Major > Attachments: cmake-info.txt, pyarrow_017.txt > > > My team is attempting to migrate some workloads from x86-64 to ARM64, a > blocker for this is PyArrow failing to install. `pip install pyarrow` fails > to build the wheel as -march isn't correctly resolved: > {{ -- System processor: aarch64}} > {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH}} > {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed}} > {{ -- Arrow build warning level: PRODUCTION}} > {{ CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):}} > {{ Unsupported arch flag: -march=.}} > It's possible to get the build to work after editing > `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up > as an architecture such as 'armv8-a' - although some more elaborate logic is > really needed to pick up the correct extensions. > I can see that there have been a number of items discussed in the past both > on Jira and in GitHub issues ranging from simple fixes to the cmake script to > more elaborate fixes cross-product for arch detection - but I wasn't able to > discern how the project wishes to proceed. > With AWS pushing their ARM-based instances heavily at this point I would > advocate for picking a direction before an influx of new issues. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9744) [Python] aarch64 Installation Error
[ https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou reassigned ARROW-9744: --- Assignee: Kouhei Sutou > [Python] aarch64 Installation Error > --- > > Key: ARROW-9744 > URL: https://issues.apache.org/jira/browse/ARROW-9744 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 > Environment: AWS m6g (ARM64 'Graviton2' CPU): > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp > cpuid asimdrdm lrcpc dcpop asimddp ssbs > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x3 > CPU part: 0xd0c > CPU revision: 1 > OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 > (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 > 10:03:03 UTC 2020 >Reporter: Matthew Meen >Assignee: Kouhei Sutou >Priority: Major > Attachments: cmake-info.txt, pyarrow_017.txt > > > My team is attempting to migrate some workloads from x86-64 to ARM64, a > blocker for this is PyArrow failing to install. `pip install pyarrow` fails > to build the wheel as -march isn't correctly resolved: > {{ -- System processor: aarch64}} > {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH}} > {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed}} > {{ -- Arrow build warning level: PRODUCTION}} > {{ CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):}} > {{ Unsupported arch flag: -march=.}} > It's possible to get the build to work after editing > `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up > as an architecture such as 'armv8-a' - although some more elaborate logic is > really needed to pick up the correct extensions. > I can see that there have been a number of items discussed in the past both > on Jira and in GitHub issues ranging from simple fixes to the cmake script to > more elaborate fixes cross-product for arch detection - but I wasn't able to > discern how the project wishes to proceed. > With AWS pushing their ARM-based instances heavily at this point I would > advocate for picking a direction before an influx of new issues. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9299) [Python] Expose ORC metadata() in Python ORCFile
[ https://issues.apache.org/jira/browse/ARROW-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179231#comment-17179231 ] Caleb Winston commented on ARROW-9299: -- [~jeremy.dyer] Is it possible to get metadata using arrow-cpp though? I'm seeing a private field [1] storing an ORC `Reader` which could be used to get metadata. There isn't a way to access this through C++ API even though the metadata is in there - correct? [1] [https://github.com/apache/arrow/blob/d542482bdc6bea8a449f000bdd74de8990c20015/cpp/src/arrow/adapters/orc/adapter.cc#L411] > [Python] Expose ORC metadata() in Python ORCFile > > > Key: ARROW-9299 > URL: https://issues.apache.org/jira/browse/ARROW-9299 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.17.1 >Reporter: Jeremy Dyer >Priority: Major > > There is currently no way for a user to directly access the underlying ORC > metadata of a given file. It seems the C++ functions and objects already > existing and rather the plumbing is just missing the the cython/python and > potentially a few c++ shims. Giving users the ability to retrieve the > metadata without first reading the entire file could help numerous > applications to increase their query performance by allowing them to > intelligently determine which ORC stripes should be read. > This would allow for something like > {code:java} > import pyarrow as pa > orc_metadata = pa.orc.ORCFile(filename).metadata() > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-9299) [Python] Expose ORC metadata() in Python ORCFile
[ https://issues.apache.org/jira/browse/ARROW-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179230#comment-17179230 ] Caleb Winston edited comment on ARROW-9299 at 8/17/20, 9:16 PM: This would be very useful for our use-case in cuDF where we want to select stripes to read onto GPU based on statistics stored in the ORC metadata. Edit: Didn't see who was posting this haha. was (Author: calebwin): This would be very useful for our use-case in cuDF where we want to select stripes to read onto GPU based on statistics stored in the ORC metadata. > [Python] Expose ORC metadata() in Python ORCFile > > > Key: ARROW-9299 > URL: https://issues.apache.org/jira/browse/ARROW-9299 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.17.1 >Reporter: Jeremy Dyer >Priority: Major > > There is currently no way for a user to directly access the underlying ORC > metadata of a given file. It seems the C++ functions and objects already > existing and rather the plumbing is just missing the the cython/python and > potentially a few c++ shims. Giving users the ability to retrieve the > metadata without first reading the entire file could help numerous > applications to increase their query performance by allowing them to > intelligently determine which ORC stripes should be read. > This would allow for something like > {code:java} > import pyarrow as pa > orc_metadata = pa.orc.ORCFile(filename).metadata() > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9299) [Python] Expose ORC metadata() in Python ORCFile
[ https://issues.apache.org/jira/browse/ARROW-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179230#comment-17179230 ] Caleb Winston commented on ARROW-9299: -- This would be very useful for our use-case in cuDF where we want to select stripes to read onto GPU based on statistics stored in the ORC metadata. > [Python] Expose ORC metadata() in Python ORCFile > > > Key: ARROW-9299 > URL: https://issues.apache.org/jira/browse/ARROW-9299 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.17.1 >Reporter: Jeremy Dyer >Priority: Major > > There is currently no way for a user to directly access the underlying ORC > metadata of a given file. It seems the C++ functions and objects already > existing and rather the plumbing is just missing the the cython/python and > potentially a few c++ shims. Giving users the ability to retrieve the > metadata without first reading the entire file could help numerous > applications to increase their query performance by allowing them to > intelligently determine which ORC stripes should be read. > This would allow for something like > {code:java} > import pyarrow as pa > orc_metadata = pa.orc.ORCFile(filename).metadata() > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9775) Automatic S3 region selection
[ https://issues.apache.org/jira/browse/ARROW-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179229#comment-17179229 ] Antoine Pitrou commented on ARROW-9775: --- Do you want to submit a PR with the desired changes? > Automatic S3 region selection > - > > Key: ARROW-9775 > URL: https://issues.apache.org/jira/browse/ARROW-9775 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Python > Environment: macOS, Linux. >Reporter: Sahil Gupta >Priority: Major > > Currently, PyArrow and ArrowCpp need to be provided the region of the S3 > file/bucket, else it defaults to using 'us-east-1'. Ideally, PyArrow and > ArrowCpp can automatically detect the region and get the files, etc. For > instance, s3fs and boto3 can read and write files without having to specify > the region explicitly. Similar functionality to auto-detect the region would > be great to have in PyArrow and ArrowCpp. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9744) [Python] aarch64 Installation Error
[ https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179227#comment-17179227 ] Kouhei Sutou commented on ARROW-9744: - Thanks! > [Python] aarch64 Installation Error > --- > > Key: ARROW-9744 > URL: https://issues.apache.org/jira/browse/ARROW-9744 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 > Environment: AWS m6g (ARM64 'Graviton2' CPU): > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp > cpuid asimdrdm lrcpc dcpop asimddp ssbs > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x3 > CPU part: 0xd0c > CPU revision: 1 > OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 > (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 > 10:03:03 UTC 2020 >Reporter: Matthew Meen >Priority: Major > Attachments: cmake-info.txt, pyarrow_017.txt > > > My team is attempting to migrate some workloads from x86-64 to ARM64, a > blocker for this is PyArrow failing to install. `pip install pyarrow` fails > to build the wheel as -march isn't correctly resolved: > {{ -- System processor: aarch64}} > {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH}} > {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed}} > {{ -- Arrow build warning level: PRODUCTION}} > {{ CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):}} > {{ Unsupported arch flag: -march=.}} > It's possible to get the build to work after editing > `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up > as an architecture such as 'armv8-a' - although some more elaborate logic is > really needed to pick up the correct extensions. > I can see that there have been a number of items discussed in the past both > on Jira and in GitHub issues ranging from simple fixes to the cmake script to > more elaborate fixes cross-product for arch detection - but I wasn't able to > discern how the project wishes to proceed. > With AWS pushing their ARM-based instances heavily at this point I would > advocate for picking a direction before an influx of new issues. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9744) [Python] aarch64 Installation Error
[ https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179226#comment-17179226 ] Krisztian Szucs commented on ARROW-9744: Ouch, I'm uploading it. > [Python] aarch64 Installation Error > --- > > Key: ARROW-9744 > URL: https://issues.apache.org/jira/browse/ARROW-9744 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 > Environment: AWS m6g (ARM64 'Graviton2' CPU): > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp > cpuid asimdrdm lrcpc dcpop asimddp ssbs > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x3 > CPU part: 0xd0c > CPU revision: 1 > OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 > (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 > 10:03:03 UTC 2020 >Reporter: Matthew Meen >Priority: Major > Attachments: cmake-info.txt, pyarrow_017.txt > > > My team is attempting to migrate some workloads from x86-64 to ARM64, a > blocker for this is PyArrow failing to install. `pip install pyarrow` fails > to build the wheel as -march isn't correctly resolved: > {{ -- System processor: aarch64}} > {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH}} > {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed}} > {{ -- Arrow build warning level: PRODUCTION}} > {{ CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):}} > {{ Unsupported arch flag: -march=.}} > It's possible to get the build to work after editing > `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up > as an architecture such as 'armv8-a' - although some more elaborate logic is > really needed to pick up the correct extensions. > I can see that there have been a number of items discussed in the past both > on Jira and in GitHub issues ranging from simple fixes to the cmake script to > more elaborate fixes cross-product for arch detection - but I wasn't able to > discern how the project wishes to proceed. > With AWS pushing their ARM-based instances heavily at this point I would > advocate for picking a direction before an influx of new issues. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9744) [Python] aarch64 Installation Error
[ https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179225#comment-17179225 ] Kouhei Sutou commented on ARROW-9744: - [~kszucs] It seems that we forgot to release source package to PyPI. Could you upload it? (If you prefer, I can do it.) > [Python] aarch64 Installation Error > --- > > Key: ARROW-9744 > URL: https://issues.apache.org/jira/browse/ARROW-9744 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 > Environment: AWS m6g (ARM64 'Graviton2' CPU): > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp > cpuid asimdrdm lrcpc dcpop asimddp ssbs > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x3 > CPU part: 0xd0c > CPU revision: 1 > OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 > (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 > 10:03:03 UTC 2020 >Reporter: Matthew Meen >Priority: Major > Attachments: cmake-info.txt, pyarrow_017.txt > > > My team is attempting to migrate some workloads from x86-64 to ARM64, a > blocker for this is PyArrow failing to install. `pip install pyarrow` fails > to build the wheel as -march isn't correctly resolved: > {{ -- System processor: aarch64}} > {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH}} > {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed}} > {{ -- Arrow build warning level: PRODUCTION}} > {{ CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):}} > {{ Unsupported arch flag: -march=.}} > It's possible to get the build to work after editing > `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up > as an architecture such as 'armv8-a' - although some more elaborate logic is > really needed to pick up the correct extensions. > I can see that there have been a number of items discussed in the past both > on Jira and in GitHub issues ranging from simple fixes to the cmake script to > more elaborate fixes cross-product for arch detection - but I wasn't able to > discern how the project wishes to proceed. > With AWS pushing their ARM-based instances heavily at this point I would > advocate for picking a direction before an influx of new issues. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9733) [Rust][DataFusion] Aggregates COUNT/MIN/MAX don't work on VARCHAR columns
[ https://issues.apache.org/jira/browse/ARROW-9733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179224#comment-17179224 ] Andrew Lamb commented on ARROW-9733: yes, I would think MAX() on strings would be the same as `A` < `B` (aka lexographical ordering) > [Rust][DataFusion] Aggregates COUNT/MIN/MAX don't work on VARCHAR columns > - > > Key: ARROW-9733 > URL: https://issues.apache.org/jira/browse/ARROW-9733 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andrew Lamb >Priority: Major > Attachments: repro.csv > > > h2. Reproducer: > Create a table with a string column: > Repro: > {code} > CREATE EXTERNAL TABLE repro(a INT, b VARCHAR) > STORED AS CSV > WITH HEADER ROW > LOCATION 'repro.csv'; > {code} > The contents of repro.csv are as follows (also attached): > {code} > a,b > 1,One > 1,Two > 2,One > 2,Two > 2,Two > {code} > Now, run a query that tries to aggregate that column: > {code} > select a, count(b) from repro group by a; > {code} > *Actual behavior*: > {code} > > select a, count(b) from repro group by a; > ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for > result of aggregate expression"))) > {code} > *Expected Behavior*: > The query runs and produces results > {code} > a, count(b) > 1,2 > 2,3 > {code} > h2. Discussion > Using Min/Max aggregates on varchar also doesn't work (but should): > {code} > > select a, min(b) from repro group by a; > ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for > result of aggregate expression"))) > > select a, max(b) from repro group by a; > ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for > result of aggregate expression"))) > {code} > Fascinatingly these formulations work fine: > {code} > > select a, count(a) from repro group by a; > +---+--+ > | a | count(a) | > +---+--+ > | 2 | 3| > | 1 | 2| > +---+--+ > 2 row in set. Query took 0 seconds. > > select a, count(1) from repro group by a; > +---+-+ > | a | count(UInt8(1)) | > +---+-+ > | 2 | 3 | > | 1 | 2 | > +---+-+ > 2 row in set. Query took 0 seconds. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9775) Automatic S3 region selection
[ https://issues.apache.org/jira/browse/ARROW-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sahil Gupta updated ARROW-9775: --- Description: Currently, PyArrow and ArrowCpp need to be provided the region of the S3 file/bucket, else it defaults to using 'us-east-1'. Ideally, PyArrow and ArrowCpp can automatically detect the region and get the files, etc. For instance, s3fs and boto3 can read and write files without having to specify the region explicitly. Similar functionality to auto-detect the region would be great to have in PyArrow and ArrowCpp. (was: Currently, PyArrow and ArrowCpp need to be provided the region of the S3 file/bucket, else it defaults to using 'us-east-1'. Ideally, PyArrow and ArrowCpp can automatically detect the region and get the files, etc.) > Automatic S3 region selection > - > > Key: ARROW-9775 > URL: https://issues.apache.org/jira/browse/ARROW-9775 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Python > Environment: macOS, Linux. >Reporter: Sahil Gupta >Priority: Major > > Currently, PyArrow and ArrowCpp need to be provided the region of the S3 > file/bucket, else it defaults to using 'us-east-1'. Ideally, PyArrow and > ArrowCpp can automatically detect the region and get the files, etc. For > instance, s3fs and boto3 can read and write files without having to specify > the region explicitly. Similar functionality to auto-detect the region would > be great to have in PyArrow and ArrowCpp. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9775) Automatic S3 region selection
Sahil Gupta created ARROW-9775: -- Summary: Automatic S3 region selection Key: ARROW-9775 URL: https://issues.apache.org/jira/browse/ARROW-9775 Project: Apache Arrow Issue Type: Wish Components: C++, Python Environment: macOS, Linux. Reporter: Sahil Gupta Currently, PyArrow and ArrowCpp need to be provided the region of the S3 file/bucket, else it defaults to using 'us-east-1'. Ideally, PyArrow and ArrowCpp can automatically detect the region and get the files, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9774) Document metadata
Mathieu Dutour Sikiric created ARROW-9774: - Summary: Document metadata Key: ARROW-9774 URL: https://issues.apache.org/jira/browse/ARROW-9774 Project: Apache Arrow Issue Type: Improvement Components: Documentation Affects Versions: 1.0.0 Environment: Linux Reporter: Mathieu Dutour Sikiric I would like to write down a dataframe into a parquet file. The problem that I have is the output dataframe shows up as ```0 \{'field0': 5, 'field1': 8} 1 \{'field0': 5, 'field1': 8} 2 \{'field0': 4, 'field1': 7}``` while what I want is ```0 \{'A': 5, 'B': 8} 1 \{'A': 5, 'B': 8} 2 \{'A': 4, 'B': 7}``` As I understand the discrepancy is because I did not pass the metadata in the creation of the table. That is I did schema_metadata = ::arrow::key_value_metadata(\{{"pandas", metadata.data()}}); schema = std::make_shared(schema_vector, schema_metadata); arrow_table = arrow::Table::Make(schema, columns, row_group_size); status = parquet::arrow::WriteTable( *arrow_table, pool, out_stream, row_group_size, writer_properties, ...) The problem is that I could not find any documentation on how the metadata is to be built. Adding documentation would be much helpful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9773) [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array
David Li created ARROW-9773: --- Summary: [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array Key: ARROW-9773 URL: https://issues.apache.org/jira/browse/ARROW-9773 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 1.0.0 Reporter: David Li Take() currently concatenates ChunkedArrays first. However, this breaks down when calling Take() from a ChunkedArray or Table where concatenating the arrays would result in an array that's too large. While inconvenient to implement, it would be useful if this case were handled. This could be done as a higher-level wrapper around Take(), perhaps. Example in Python: {code:python} >>> import pyarrow as pa >>> pa.__version__ '1.0.0' >>> rb1 = pa.RecordBatch.from_arrays([["a" * 2**30]], names=["a"]) >>> rb2 = pa.RecordBatch.from_arrays([["b" * 2**30]], names=["a"]) >>> table = pa.Table.from_batches([rb1, rb2], schema=rb1.schema) >>> table.take([1, 0]) Traceback (most recent call last): File "", line 1, in File "pyarrow/table.pxi", line 1145, in pyarrow.lib.Table.take File "/home/lidavidm/Code/twosigma/arrow/venv/lib/python3.8/site-packages/pyarrow/compute.py", line 268, in take return call_function('take', [data, indices], options) File "pyarrow/_compute.pyx", line 298, in pyarrow._compute.call_function File "pyarrow/_compute.pyx", line 192, in pyarrow._compute.Function.call File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays {code} In this example, it would be useful if Take() or a higher-level wrapper could generate multiple record batches as output. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9733) [Rust][DataFusion] Aggregates COUNT/MIN/MAX don't work on VARCHAR columns
[ https://issues.apache.org/jira/browse/ARROW-9733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179194#comment-17179194 ] Jorge commented on ARROW-9733: -- Just to check, the max/min of charvar would be their alphabetical order? > [Rust][DataFusion] Aggregates COUNT/MIN/MAX don't work on VARCHAR columns > - > > Key: ARROW-9733 > URL: https://issues.apache.org/jira/browse/ARROW-9733 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andrew Lamb >Priority: Major > Attachments: repro.csv > > > h2. Reproducer: > Create a table with a string column: > Repro: > {code} > CREATE EXTERNAL TABLE repro(a INT, b VARCHAR) > STORED AS CSV > WITH HEADER ROW > LOCATION 'repro.csv'; > {code} > The contents of repro.csv are as follows (also attached): > {code} > a,b > 1,One > 1,Two > 2,One > 2,Two > 2,Two > {code} > Now, run a query that tries to aggregate that column: > {code} > select a, count(b) from repro group by a; > {code} > *Actual behavior*: > {code} > > select a, count(b) from repro group by a; > ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for > result of aggregate expression"))) > {code} > *Expected Behavior*: > The query runs and produces results > {code} > a, count(b) > 1,2 > 2,3 > {code} > h2. Discussion > Using Min/Max aggregates on varchar also doesn't work (but should): > {code} > > select a, min(b) from repro group by a; > ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for > result of aggregate expression"))) > > select a, max(b) from repro group by a; > ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for > result of aggregate expression"))) > {code} > Fascinatingly these formulations work fine: > {code} > > select a, count(a) from repro group by a; > +---+--+ > | a | count(a) | > +---+--+ > | 2 | 3| > | 1 | 2| > +---+--+ > 2 row in set. Query took 0 seconds. > > select a, count(1) from repro group by a; > +---+-+ > | a | count(UInt8(1)) | > +---+-+ > | 2 | 3 | > | 1 | 2 | > +---+-+ > 2 row in set. Query took 0 seconds. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9710) [C++] Generalize Decimal ToString in preparation for Decimal256
[ https://issues.apache.org/jira/browse/ARROW-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-9710. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 7945 [https://github.com/apache/arrow/pull/7945] > [C++] Generalize Decimal ToString in preparation for Decimal256 > --- > > Key: ARROW-9710 > URL: https://issues.apache.org/jira/browse/ARROW-9710 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Micah Kornfield >Assignee: Mingyu Zhong >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Generalize Decimal ToString method in preparation for introducing Decimal256 > bit type (and other bit widths as needed). > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9670) [C++][FlightRPC] Close()ing a DoPut with an ongoing read locks up the client
[ https://issues.apache.org/jira/browse/ARROW-9670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-9670: Assignee: David Li (was: Apache Arrow JIRA Bot) > [C++][FlightRPC] Close()ing a DoPut with an ongoing read locks up the client > > > Key: ARROW-9670 > URL: https://issues.apache.org/jira/browse/ARROW-9670 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC >Affects Versions: 1.0.0 >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > This section accidentally recurses and ends up trying to re-acquire a lock: > https://github.com/apache/arrow/blob/9c04867930eae5454dbb1ea4c7bd869b12fc6e9d/cpp/src/arrow/flight/client.cc#L215 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9670) [C++][FlightRPC] Close()ing a DoPut with an ongoing read locks up the client
[ https://issues.apache.org/jira/browse/ARROW-9670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-9670: Assignee: Apache Arrow JIRA Bot (was: David Li) > [C++][FlightRPC] Close()ing a DoPut with an ongoing read locks up the client > > > Key: ARROW-9670 > URL: https://issues.apache.org/jira/browse/ARROW-9670 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC >Affects Versions: 1.0.0 >Reporter: David Li >Assignee: Apache Arrow JIRA Bot >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > This section accidentally recurses and ends up trying to re-acquire a lock: > https://github.com/apache/arrow/blob/9c04867930eae5454dbb1ea4c7bd869b12fc6e9d/cpp/src/arrow/flight/client.cc#L215 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9670) [C++][FlightRPC] Close()ing a DoPut with an ongoing read locks up the client
[ https://issues.apache.org/jira/browse/ARROW-9670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9670: -- Labels: pull-request-available (was: ) > [C++][FlightRPC] Close()ing a DoPut with an ongoing read locks up the client > > > Key: ARROW-9670 > URL: https://issues.apache.org/jira/browse/ARROW-9670 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC >Affects Versions: 1.0.0 >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > This section accidentally recurses and ends up trying to re-acquire a lock: > https://github.com/apache/arrow/blob/9c04867930eae5454dbb1ea4c7bd869b12fc6e9d/cpp/src/arrow/flight/client.cc#L215 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9772) Optionally allow for to_pandas to return writeable pandas objects
Brandon B. Miller created ARROW-9772: Summary: Optionally allow for to_pandas to return writeable pandas objects Key: ARROW-9772 URL: https://issues.apache.org/jira/browse/ARROW-9772 Project: Apache Arrow Issue Type: New Feature Components: Python Affects Versions: 0.17.1 Reporter: Brandon B. Miller In cuDF, I'd like to leverage pyarrow to facilitate the conversion from cuDF series and dataframe objects into the equivalent pandas objects. Concretely I'd like something like this to work: `pandas_object = cudf_object.to_arrow().to_pandas()`. This allows us to stay consistent with the way the rest of the pyarrow ecosystem handles nulls, dtype conversions and the like without having to reinvent the wheel. However I noticed that in some zero copy scenarios, pyarrow doesn't seem to fully release the underlying buffers when converting `to_pandas()`. The resulting objects are immutable and if one tries to mutate the data they will encounter `ValueError: assignment destination is read-only` This creates a slightly strange situation where a user might encounter issues that subtly stem from the fact that arrow was used to construct the offending pandas object. It would be nice to be able to toggle this behavior using a kwarg or something similar. I suspect this could come up in other situations where libraries want to convert back and forth between equivalent python objects through arrow and expect the final object they get to behave as if it were constructed via other means. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9518) [Python] Deprecate pyarrow serialization
[ https://issues.apache.org/jira/browse/ARROW-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179103#comment-17179103 ] Antoine Pitrou commented on ARROW-9518: --- I widened the topic for this issue, since PyArrow serialization is being obsolete by pickle protocol 5; also, the main users of pyarrow.serialize (i.e. Ray) have stopped using it. > [Python] Deprecate pyarrow serialization > > > Key: ARROW-9518 > URL: https://issues.apache.org/jira/browse/ARROW-9518 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 2.0.0 > > > Per mailing list discussion -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9518) [Python] Deprecate pyarrow serialization
[ https://issues.apache.org/jira/browse/ARROW-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-9518: -- Summary: [Python] Deprecate pyarrow serialization (was: [Python] Deprecate Union-based serialization implemented by pyarrow.serialization) > [Python] Deprecate pyarrow serialization > > > Key: ARROW-9518 > URL: https://issues.apache.org/jira/browse/ARROW-9518 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 2.0.0 > > > Per mailing list discussion -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9768) [Python] Pyarrow allows for unsafe conversions of datetime objects to timestamp nanoseconds
[ https://issues.apache.org/jira/browse/ARROW-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179092#comment-17179092 ] Joris Van den Bossche commented on ARROW-9768: -- Sorry about the noise, another PR had a typo in the issue number, which led to this automatically being closed. > [Python] Pyarrow allows for unsafe conversions of datetime objects to > timestamp nanoseconds > --- > > Key: ARROW-9768 > URL: https://issues.apache.org/jira/browse/ARROW-9768 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1, 1.0.0 > Environment: OS: MacOSX Catalina > Python Version: 3.7 >Reporter: Joshua Lay >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Hi, > In parquet, I want to store date values as timestamp format with nanoseconds > precision. This works fine with most dates except those past > pandas.Timestamp.max: > [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html.|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html] > I was expecting some exception to be raised (like in Pandas), however this > did not happen and the value was processed incorrectly. Note that this is > with safe=True. Can this please be looked into? Thanks > {{Example Code:}} > {{pa.array([datetime(2262,4,12)], type=pa.timestamp("ns"))}} > \{{}} > {{Return:}} > {{[}} > \{{ 1677-09-21 00:25:26.290448384}} > {{]}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (ARROW-9768) [Python] Pyarrow allows for unsafe conversions of datetime objects to timestamp nanoseconds
[ https://issues.apache.org/jira/browse/ARROW-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reopened ARROW-9768: -- > [Python] Pyarrow allows for unsafe conversions of datetime objects to > timestamp nanoseconds > --- > > Key: ARROW-9768 > URL: https://issues.apache.org/jira/browse/ARROW-9768 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1, 1.0.0 > Environment: OS: MacOSX Catalina > Python Version: 3.7 >Reporter: Joshua Lay >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Hi, > In parquet, I want to store date values as timestamp format with nanoseconds > precision. This works fine with most dates except those past > pandas.Timestamp.max: > [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html.|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html] > I was expecting some exception to be raised (like in Pandas), however this > did not happen and the value was processed incorrectly. Note that this is > with safe=True. Can this please be looked into? Thanks > {{Example Code:}} > {{pa.array([datetime(2262,4,12)], type=pa.timestamp("ns"))}} > \{{}} > {{Return:}} > {{[}} > \{{ 1677-09-21 00:25:26.290448384}} > {{]}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9771) [Rust] [DataFusion] Predicate Pushdown Improvement: treat predicates separated by AND separately
[ https://issues.apache.org/jira/browse/ARROW-9771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb updated ARROW-9771: --- Priority: Minor (was: Major) > [Rust] [DataFusion] Predicate Pushdown Improvement: treat predicates > separated by AND separately > > > Key: ARROW-9771 > URL: https://issues.apache.org/jira/browse/ARROW-9771 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Andrew Lamb >Priority: Minor > > As discussed by [~jorgecarleitao] and [~houqp] here: > https://github.com/apache/arrow/pull/7880#pullrequestreview-468057624 > If a predicate is a conjunction (aka AND'd) together, each of the clauses can > be treated separately (e.g. a single filter expression {{A > 5 And B < 4}} > can be broken up and each of {{A > 5}} and {{B < 4}} can be potentially > pushed down different levels > The filter pushdown logic works for the following case (when {{a}} and {{b}} > are are separate selections, predicate for a is pushed below the > {{Aggregate}} in the optimized plan): > {code} > Original plan: > Selection: #b GtEq Int64(1) > Selection: #a LtEq Int64(1) > Aggregate: groupBy=[[#a]], aggr=[[MIN(#b)]] > TableScan: test projection=None > Optimized plan: > Selection: #b GtEq Int64(1) > Aggregate: groupBy=[[#a]], aggr=[[MIN(#b)]] > Selection: #a LtEq Int64(1) > TableScan: test projection=None > {code} > But not for this when {{a}} and {{b}} are {{AND}}'d together > {code} > Original plan: > Selection: #a LtEq Int64(1) And #b GtEq Int64(1) > Aggregate: groupBy=[[#a]], aggr=[[MIN(#b)]] > TableScan: test projection=None > Optimized plan: > Selection: #a LtEq Int64(1) And #b GtEq Int64(1) > Aggregate: groupBy=[[#a]], aggr=[[MIN(#b)]] > TableScan: test projection=None > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9771) [Rust] [DataFusion] Predicate Pushdown Improvement: treat predicates separated by AND separately
Andrew Lamb created ARROW-9771: -- Summary: [Rust] [DataFusion] Predicate Pushdown Improvement: treat predicates separated by AND separately Key: ARROW-9771 URL: https://issues.apache.org/jira/browse/ARROW-9771 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb As discussed by [~jorgecarleitao] and [~houqp] here: https://github.com/apache/arrow/pull/7880#pullrequestreview-468057624 If a predicate is a conjunction (aka AND'd) together, each of the clauses can be treated separately (e.g. a single filter expression {{A > 5 And B < 4}} can be broken up and each of {{A > 5}} and {{B < 4}} can be potentially pushed down different levels The filter pushdown logic works for the following case (when {{a}} and {{b}} are are separate selections, predicate for a is pushed below the {{Aggregate}} in the optimized plan): {code} Original plan: Selection: #b GtEq Int64(1) Selection: #a LtEq Int64(1) Aggregate: groupBy=[[#a]], aggr=[[MIN(#b)]] TableScan: test projection=None Optimized plan: Selection: #b GtEq Int64(1) Aggregate: groupBy=[[#a]], aggr=[[MIN(#b)]] Selection: #a LtEq Int64(1) TableScan: test projection=None {code} But not for this when {{a}} and {{b}} are {{AND}}'d together {code} Original plan: Selection: #a LtEq Int64(1) And #b GtEq Int64(1) Aggregate: groupBy=[[#a]], aggr=[[MIN(#b)]] TableScan: test projection=None Optimized plan: Selection: #a LtEq Int64(1) And #b GtEq Int64(1) Aggregate: groupBy=[[#a]], aggr=[[MIN(#b)]] TableScan: test projection=None {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9770) [Rust] [DataFusion] Add constant folding to expressions during logically planning
Andrew Lamb created ARROW-9770: -- Summary: [Rust] [DataFusion] Add constant folding to expressions during logically planning Key: ARROW-9770 URL: https://issues.apache.org/jira/browse/ARROW-9770 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb The high level idea is that if an expression can be partially evaluated during planning time then # The execution time will be increased # There may be additional optimizations possible (like removing entire LogicalPlan nodes, for example) I recently saw the following selection expression created (by the [predicate push down|https://github.com/apache/arrow/pull/7880]) {code} Selection: #a Eq Int64(1) And #b GtEq Int64(1) And #a LtEq Int64(1) And #a Eq Int64(1) And #b GtEq Int64(1) And #a LtEq Int64(1) TableScan: test projection=None {code} This could be simplified significantly: 1. Duplicate clauses could be removed (e.g. `#a Eq Int64(1) And #a Eq Int64(1)` --> `#a Eq Int64(1)`) 2. Algebraic simplification (e.g. if `A<=B and A=5`, is the same as `A=5`) Inspiration can be taken from the postgres code that evaluates constant expressions https://doxygen.postgresql.org/clauses_8c.html#ac91c4055a7eb3aa6f1bc104479464b28 (in this case, for example if you have a predicate A=5 then you can basically substitute in A=5 for any expression higher up in the the plan). Other classic optimizations include things such as `A OR TRUE` --> `A`, `A AND TRUE` --> TRUE, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9744) [Python] aarch64 Installation Error
[ https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179029#comment-17179029 ] Matthew Meen commented on ARROW-9744: - 1.0.0 behaves the same as 0.17.1 where the value to cmake's -march argument ends up as a blank value. > [Python] aarch64 Installation Error > --- > > Key: ARROW-9744 > URL: https://issues.apache.org/jira/browse/ARROW-9744 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 > Environment: AWS m6g (ARM64 'Graviton2' CPU): > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp > cpuid asimdrdm lrcpc dcpop asimddp ssbs > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x3 > CPU part: 0xd0c > CPU revision: 1 > OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 > (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 > 10:03:03 UTC 2020 >Reporter: Matthew Meen >Priority: Major > Attachments: cmake-info.txt, pyarrow_017.txt > > > My team is attempting to migrate some workloads from x86-64 to ARM64, a > blocker for this is PyArrow failing to install. `pip install pyarrow` fails > to build the wheel as -march isn't correctly resolved: > {{ -- System processor: aarch64}} > {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH}} > {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed}} > {{ -- Arrow build warning level: PRODUCTION}} > {{ CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):}} > {{ Unsupported arch flag: -march=.}} > It's possible to get the build to work after editing > `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up > as an architecture such as 'armv8-a' - although some more elaborate logic is > really needed to pick up the correct extensions. > I can see that there have been a number of items discussed in the past both > on Jira and in GitHub issues ranging from simple fixes to the cmake script to > more elaborate fixes cross-product for arch detection - but I wasn't able to > discern how the project wishes to proceed. > With AWS pushing their ARM-based instances heavily at this point I would > advocate for picking a direction before an influx of new issues. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9495) [C++] Equality assertions don't handle Inf / -Inf properly
[ https://issues.apache.org/jira/browse/ARROW-9495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-9495: - Assignee: Liya Fan > [C++] Equality assertions don't handle Inf / -Inf properly > --- > > Key: ARROW-9495 > URL: https://issues.apache.org/jira/browse/ARROW-9495 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Antoine Pitrou >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > I got this error when working on a PR which added unit tests: > {code} > ../src/arrow/testing/gtest_util.cc:101: Failure > Failed > Expected: > [ > 2.5, > inf, > -inf > ] > Actual: > [ > 2.5, > inf, > -inf > ] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9495) [C++] Equality assertions don't handle Inf / -Inf properly
[ https://issues.apache.org/jira/browse/ARROW-9495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-9495. --- Resolution: Fixed Issue resolved by pull request 7826 [https://github.com/apache/arrow/pull/7826] > [C++] Equality assertions don't handle Inf / -Inf properly > --- > > Key: ARROW-9495 > URL: https://issues.apache.org/jira/browse/ARROW-9495 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > I got this error when working on a PR which added unit tests: > {code} > ../src/arrow/testing/gtest_util.cc:101: Failure > Failed > Expected: > [ > 2.5, > inf, > -inf > ] > Actual: > [ > 2.5, > inf, > -inf > ] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9768) [Python] Pyarrow allows for unsafe conversions of datetime objects to timestamp nanoseconds
[ https://issues.apache.org/jira/browse/ARROW-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-9768. --- Resolution: Fixed Issue resolved by pull request 7980 [https://github.com/apache/arrow/pull/7980] > [Python] Pyarrow allows for unsafe conversions of datetime objects to > timestamp nanoseconds > --- > > Key: ARROW-9768 > URL: https://issues.apache.org/jira/browse/ARROW-9768 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1, 1.0.0 > Environment: OS: MacOSX Catalina > Python Version: 3.7 >Reporter: Joshua Lay >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Hi, > In parquet, I want to store date values as timestamp format with nanoseconds > precision. This works fine with most dates except those past > pandas.Timestamp.max: > [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html.|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html] > I was expecting some exception to be raised (like in Pandas), however this > did not happen and the value was processed incorrectly. Note that this is > with safe=True. Can this please be looked into? Thanks > {{Example Code:}} > {{pa.array([datetime(2262,4,12)], type=pa.timestamp("ns"))}} > \{{}} > {{Return:}} > {{[}} > \{{ 1677-09-21 00:25:26.290448384}} > {{]}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9769) [Python] Remove skip for in-memory fsspec in test_move_file
Krisztian Szucs created ARROW-9769: -- Summary: [Python] Remove skip for in-memory fsspec in test_move_file Key: ARROW-9769 URL: https://issues.apache.org/jira/browse/ARROW-9769 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Krisztian Szucs Fix For: 2.0.0 Follow-up of https://issues.apache.org/jira/browse/ARROW-9621 which should be applied once a new version of fsspec is going to be available. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9621) [Python] test_move_file() is failed with fsspec 0.8.0
[ https://issues.apache.org/jira/browse/ARROW-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-9621: --- Issue Type: Bug (was: Improvement) > [Python] test_move_file() is failed with fsspec 0.8.0 > - > > Key: ARROW-9621 > URL: https://issues.apache.org/jira/browse/ARROW-9621 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Kouhei Sutou >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 1.0.1, 2.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > It works with fsspec 0.7.4: > https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/34414340/job/os9t8kj9t4afgym9 > Failed with fsspec 0.8.0: > https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/34422556/job/abedu9it26qvfxkm > {noformat} > == FAILURES > === > __ test_move_file[PyFileSystem(FSSpecHandler(fsspec.filesystem("memory")))] > ___ > fs = > pathfn = . at 0x003D04F70B58> > def test_move_file(fs, pathfn): > s = pathfn('test-move-source-file') > t = pathfn('test-move-target-file') > > with fs.open_output_stream(s): > pass > > > fs.move(s, t) > pyarrow\tests\test_fs.py:798: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > pyarrow\_fs.pyx:519: in pyarrow._fs.FileSystem.move > check_status(self.fs.Move(source, destination)) > pyarrow\_fs.pyx:1024: in pyarrow._fs._cb_move > handler.move(frombytes(src), frombytes(dest)) > pyarrow\fs.py:199: in move > self.fs.mv(src, dest, recursive=True) > C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\spec.py:744: in mv > self.copy(path1, path2, recursive=recursive, maxdepth=maxdepth) > C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\spec.py:719: in copy > self.cp_file(p1, p2, **kwargs) > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > self = 0x003D01096A78> > path1 = 'test-move-source-file/', path2 = 'test-move-target-file/' > kwargs = {'maxdepth': None} > def cp_file(self, path1, path2, **kwargs): > if self.isfile(path1): > > self.store[path2] = MemoryFile(self, path2, > > self.store[path1].getbuffer()) > E KeyError: 'test-move-source-file/' > C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\implementations\memory.py:134: > KeyError > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9621) [Python] test_move_file() is failed with fsspec 0.8.0
[ https://issues.apache.org/jira/browse/ARROW-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-9621: --- Fix Version/s: 2.0.0 1.0.1 > [Python] test_move_file() is failed with fsspec 0.8.0 > - > > Key: ARROW-9621 > URL: https://issues.apache.org/jira/browse/ARROW-9621 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Kouhei Sutou >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 1.0.1, 2.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > It works with fsspec 0.7.4: > https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/34414340/job/os9t8kj9t4afgym9 > Failed with fsspec 0.8.0: > https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/34422556/job/abedu9it26qvfxkm > {noformat} > == FAILURES > === > __ test_move_file[PyFileSystem(FSSpecHandler(fsspec.filesystem("memory")))] > ___ > fs = > pathfn = . at 0x003D04F70B58> > def test_move_file(fs, pathfn): > s = pathfn('test-move-source-file') > t = pathfn('test-move-target-file') > > with fs.open_output_stream(s): > pass > > > fs.move(s, t) > pyarrow\tests\test_fs.py:798: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > pyarrow\_fs.pyx:519: in pyarrow._fs.FileSystem.move > check_status(self.fs.Move(source, destination)) > pyarrow\_fs.pyx:1024: in pyarrow._fs._cb_move > handler.move(frombytes(src), frombytes(dest)) > pyarrow\fs.py:199: in move > self.fs.mv(src, dest, recursive=True) > C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\spec.py:744: in mv > self.copy(path1, path2, recursive=recursive, maxdepth=maxdepth) > C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\spec.py:719: in copy > self.cp_file(p1, p2, **kwargs) > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > self = 0x003D01096A78> > path1 = 'test-move-source-file/', path2 = 'test-move-target-file/' > kwargs = {'maxdepth': None} > def cp_file(self, path1, path2, **kwargs): > if self.isfile(path1): > > self.store[path2] = MemoryFile(self, path2, > > self.store[path1].getbuffer()) > E KeyError: 'test-move-source-file/' > C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\implementations\memory.py:134: > KeyError > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9517) [C++][Python] Allow session_token argument when initializing S3FileSystem
[ https://issues.apache.org/jira/browse/ARROW-9517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-9517: - Assignee: Matthew Corley > [C++][Python] Allow session_token argument when initializing S3FileSystem > - > > Key: ARROW-9517 > URL: https://issues.apache.org/jira/browse/ARROW-9517 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.17.1 >Reporter: Matthew Corley >Assignee: Matthew Corley >Priority: Major > Labels: AWS, filesystem, pull-request-available, s3 > Fix For: 2.0.0 > > Time Spent: 6h 20m > Remaining Estimate: 0h > > In order to access S3 using temporary credentials (from STS), users must > supply a session token in addition to the usual access key and secret key. > However, currently, the S3FileSystem class only accepts access_key and > secret_key arguments. The only workaround is to provide the session token as > an environment variable, but this not ideal for a variety of reasons. > This is a request to allow an optional session_token argument when > initializing the S3FileSystem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9517) [C++][Python] Allow session_token argument when initializing S3FileSystem
[ https://issues.apache.org/jira/browse/ARROW-9517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-9517. --- Resolution: Fixed Issue resolved by pull request 7803 [https://github.com/apache/arrow/pull/7803] > [C++][Python] Allow session_token argument when initializing S3FileSystem > - > > Key: ARROW-9517 > URL: https://issues.apache.org/jira/browse/ARROW-9517 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.17.1 >Reporter: Matthew Corley >Priority: Major > Labels: AWS, filesystem, pull-request-available, s3 > Fix For: 2.0.0 > > Time Spent: 6h 10m > Remaining Estimate: 0h > > In order to access S3 using temporary credentials (from STS), users must > supply a session token in addition to the usual access key and secret key. > However, currently, the S3FileSystem class only accepts access_key and > secret_key arguments. The only workaround is to provide the session token as > an environment variable, but this not ideal for a variety of reasons. > This is a request to allow an optional session_token argument when > initializing the S3FileSystem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-9633) [C++] Do not toggle memory mapping globally in LocalFileSystem
[ https://issues.apache.org/jira/browse/ARROW-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178980#comment-17178980 ] Wes McKinney edited comment on ARROW-9633 at 8/17/20, 1:37 PM: --- I mostly want to be sure that file formats that are sensitive to a file handle's performance characteristics (for example, Parquet files are highly sensitive to the latency of reads) are able to understand what they are getting so that they can choose to set other options to improve performance. For example: * Will read buffering (or pre-buffering) improve performance? * Is it OK to make blocking IO calls or should an IO call allow a CPU core to be made available to other threads for execution? * Do Read calls allocate memory? I'm all for abstraction/encapsulation where it makes sense but these issues can result in meaningful changes to the wall clock time of accessing data. I'm fine to take no action right now but if we want Arrow to be the gold standard for data access and the platform that people choose to build on we should be vigilant. was (Author: wesmckinn): I mostly want to be sure that file formats that are sensitive to a file handle's performance characteristics (for example, Parquet files are highly sensitive to the latency of reads) are able to understand what they are getting so that they can choose to set other options to improve performance. For example: * Will read buffering (or pre-buffering) to improve performance? * Is it OK to make blocking IO calls or should an IO call allow a CPU core to be made available to other threads for execution? * Do Read calls allocate memory? I'm all for abstraction/encapsulation where it makes sense but these issues can result in meaningful changes to the wall clock time of accessing data. I'm fine to take no action right now but if we want Arrow to be the gold standard for data access and the platform that people choose to build on we should be vigilant. > [C++] Do not toggle memory mapping globally in LocalFileSystem > -- > > Key: ARROW-9633 > URL: https://issues.apache.org/jira/browse/ARROW-9633 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 2.0.0 > > > In the context of the Datasets API, some file formats benefit greatly from > memory mapping (like Arrow IPC files) while other less so. Additionally, in > some scenarios, memory mapping could fail when used on network-attached > storage devices. Since a filesystem may be used to read different kinds of > files and use both memory mapping and non-memory mapping, and additionally > the Datasets API should be able to fall back on non-memory mapping if the > attempt to memory map fails, it would make sense to have a non-global option > for this: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/localfs.h > I would suggest adding a new filesystem API with something like > {{OpenMappedInputFile}} with some options to control the behavior when memory > mapping is not possible. These options may be among: > * Falling back on a normal RandomAccessFile > * Reading the entire file into memory (or even tmpfs?) and then wrapping it > in a BufferReader > * Failing -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9633) [C++] Do not toggle memory mapping globally in LocalFileSystem
[ https://issues.apache.org/jira/browse/ARROW-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178980#comment-17178980 ] Wes McKinney commented on ARROW-9633: - I mostly want to be sure that file formats that are sensitive to a file handle's performance characteristics (for example, Parquet files are highly sensitive to the latency of reads) are able to understand what they are getting so that they can choose to set other options to improve performance. For example: * Will read buffering (or pre-buffering) to improve performance? * Is it OK to make blocking IO calls or should an IO call allow a CPU core to be made available to other threads for execution? * Do Read calls allocate memory? I'm all for abstraction/encapsulation where it makes sense but these issues can result in meaningful changes to the wall clock time of accessing data. I'm fine to take no action right now but if we want Arrow to be the gold standard for data access and the platform that people choose to build on we should be vigilant. > [C++] Do not toggle memory mapping globally in LocalFileSystem > -- > > Key: ARROW-9633 > URL: https://issues.apache.org/jira/browse/ARROW-9633 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 2.0.0 > > > In the context of the Datasets API, some file formats benefit greatly from > memory mapping (like Arrow IPC files) while other less so. Additionally, in > some scenarios, memory mapping could fail when used on network-attached > storage devices. Since a filesystem may be used to read different kinds of > files and use both memory mapping and non-memory mapping, and additionally > the Datasets API should be able to fall back on non-memory mapping if the > attempt to memory map fails, it would make sense to have a non-global option > for this: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/localfs.h > I would suggest adding a new filesystem API with something like > {{OpenMappedInputFile}} with some options to control the behavior when memory > mapping is not possible. These options may be among: > * Falling back on a normal RandomAccessFile > * Reading the entire file into memory (or even tmpfs?) and then wrapping it > in a BufferReader > * Failing -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9402) [C++] Add portable wrappers for __builtin_add_overflow and friends
[ https://issues.apache.org/jira/browse/ARROW-9402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-9402: --- Fix Version/s: 1.0.1 > [C++] Add portable wrappers for __builtin_add_overflow and friends > -- > > Key: ARROW-9402 > URL: https://issues.apache.org/jira/browse/ARROW-9402 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.1, 2.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9744) [Python] aarch64 Installation Error
[ https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178962#comment-17178962 ] Matthew Meen commented on ARROW-9744: - There's isn't a 1.0.0 .tar.gz on the [https://pypi.org/simple/pyarrow/,] so this fails to find the package: (test_env) ubuntu@ip-10-143-19-162:/usr/local/test_env$ sudo pip install pyarrow==1.0.0 ERROR: Could not find a version that satisfies the requirement pyarrow==1.0.0 (from versions: 0.9.0, 0.10.0, 0.11.0, 0.11.1, 0.12.0, 0.12.1, 0.13.0, 0.14.0, 0.15.1, 0.16.0, 0.17.0, 0.17.1) ERROR: No matching distribution found for pyarrow==1.0.0 This is the full output for pip install pyarrow, which finds the latest as 0.17.1: [^pyarrow_017.txt] I'll try cloning and building 1.0.0 directly shortly. > [Python] aarch64 Installation Error > --- > > Key: ARROW-9744 > URL: https://issues.apache.org/jira/browse/ARROW-9744 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 > Environment: AWS m6g (ARM64 'Graviton2' CPU): > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp > cpuid asimdrdm lrcpc dcpop asimddp ssbs > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x3 > CPU part: 0xd0c > CPU revision: 1 > OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 > (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 > 10:03:03 UTC 2020 >Reporter: Matthew Meen >Priority: Major > Attachments: cmake-info.txt, pyarrow_017.txt > > > My team is attempting to migrate some workloads from x86-64 to ARM64, a > blocker for this is PyArrow failing to install. `pip install pyarrow` fails > to build the wheel as -march isn't correctly resolved: > {{ -- System processor: aarch64}} > {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH}} > {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed}} > {{ -- Arrow build warning level: PRODUCTION}} > {{ CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):}} > {{ Unsupported arch flag: -march=.}} > It's possible to get the build to work after editing > `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up > as an architecture such as 'armv8-a' - although some more elaborate logic is > really needed to pick up the correct extensions. > I can see that there have been a number of items discussed in the past both > on Jira and in GitHub issues ranging from simple fixes to the cmake script to > more elaborate fixes cross-product for arch detection - but I wasn't able to > discern how the project wishes to proceed. > With AWS pushing their ARM-based instances heavily at this point I would > advocate for picking a direction before an influx of new issues. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9744) [Python] aarch64 Installation Error
[ https://issues.apache.org/jira/browse/ARROW-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Meen updated ARROW-9744: Attachment: pyarrow_017.txt > [Python] aarch64 Installation Error > --- > > Key: ARROW-9744 > URL: https://issues.apache.org/jira/browse/ARROW-9744 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 > Environment: AWS m6g (ARM64 'Graviton2' CPU): > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp > cpuid asimdrdm lrcpc dcpop asimddp ssbs > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x3 > CPU part: 0xd0c > CPU revision: 1 > OS: Linux version 5.3.0-1032-aws (buildd@bos02-arm64-053) (gcc version 7.5.0 > (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #34~18.04.2-Ubuntu SMP Fri Jul 24 > 10:03:03 UTC 2020 >Reporter: Matthew Meen >Priority: Major > Attachments: cmake-info.txt, pyarrow_017.txt > > > My team is attempting to migrate some workloads from x86-64 to ARM64, a > blocker for this is PyArrow failing to install. `pip install pyarrow` fails > to build the wheel as -march isn't correctly resolved: > {{ -- System processor: aarch64}} > {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH}} > {{ -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed}} > {{ -- Arrow build warning level: PRODUCTION}} > {{ CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):}} > {{ Unsupported arch flag: -march=.}} > It's possible to get the build to work after editing > `cmake_modules/SetupCxxFlags.cmake` to force ARROW_ARMV8_ARCH_FLAG to end up > as an architecture such as 'armv8-a' - although some more elaborate logic is > really needed to pick up the correct extensions. > I can see that there have been a number of items discussed in the past both > on Jira and in GitHub issues ranging from simple fixes to the cmake script to > more elaborate fixes cross-product for arch detection - but I wasn't able to > discern how the project wishes to proceed. > With AWS pushing their ARM-based instances heavily at this point I would > advocate for picking a direction before an influx of new issues. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9672) [Python][Parquet] Expose _filters_to_expression
[ https://issues.apache.org/jira/browse/ARROW-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178948#comment-17178948 ] Krisztian Szucs commented on ARROW-9672: Since it's mostly about an API change which should be discouraged for patch releases I'm excluding it from 1.0.1. > [Python][Parquet] Expose _filters_to_expression > --- > > Key: ARROW-9672 > URL: https://issues.apache.org/jira/browse/ARROW-9672 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Reporter: Caleb Winston >Priority: Trivial > Fix For: 1.0.1 > > > `_filters_to_expression` converts filters expressed in disjunctive normal > form (DNF) to `dataset.Expression`. Can `_filters_to_expression` be added to > the public API? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9672) [Python][Parquet] Expose _filters_to_expression
[ https://issues.apache.org/jira/browse/ARROW-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-9672: --- Fix Version/s: (was: 1.0.1) 2.0.0 > [Python][Parquet] Expose _filters_to_expression > --- > > Key: ARROW-9672 > URL: https://issues.apache.org/jira/browse/ARROW-9672 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Reporter: Caleb Winston >Priority: Trivial > Fix For: 2.0.0 > > > `_filters_to_expression` converts filters expressed in disjunctive normal > form (DNF) to `dataset.Expression`. Can `_filters_to_expression` be added to > the public API? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9768) [Python] Pyarrow allows for unsafe conversions of datetime objects to timestamp nanoseconds
[ https://issues.apache.org/jira/browse/ARROW-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9768: -- Labels: pull-request-available (was: ) > [Python] Pyarrow allows for unsafe conversions of datetime objects to > timestamp nanoseconds > --- > > Key: ARROW-9768 > URL: https://issues.apache.org/jira/browse/ARROW-9768 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1, 1.0.0 > Environment: OS: MacOSX Catalina > Python Version: 3.7 >Reporter: Joshua Lay >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Hi, > In parquet, I want to store date values as timestamp format with nanoseconds > precision. This works fine with most dates except those past > pandas.Timestamp.max: > [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html.|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html] > I was expecting some exception to be raised (like in Pandas), however this > did not happen and the value was processed incorrectly. Note that this is > with safe=True. Can this please be looked into? Thanks > {{Example Code:}} > {{pa.array([datetime(2262,4,12)], type=pa.timestamp("ns"))}} > \{{}} > {{Return:}} > {{[}} > \{{ 1677-09-21 00:25:26.290448384}} > {{]}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9716) [Rust] [DataFusion] MergeExec should have concurrency limit
[ https://issues.apache.org/jira/browse/ARROW-9716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-9716: --- Fix Version/s: (was: 1.0.1) > [Rust] [DataFusion] MergeExec should have concurrency limit > > > Key: ARROW-9716 > URL: https://issues.apache.org/jira/browse/ARROW-9716 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > MergeExec currently spins up one thread per input partition which causes apps > to effectively hang if there are substantially more partitions than available > cores. > We can implement a configurable limit here pretty easily. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-9716) [Rust] [DataFusion] MergeExec should have concurrency limit
[ https://issues.apache.org/jira/browse/ARROW-9716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178941#comment-17178941 ] Krisztian Szucs edited comment on ARROW-9716 at 8/17/20, 12:06 PM: --- Depends on backward incompatible improvement https://github.com/apache/arrow/pull/7958 and https://github.com/apache/arrow/pull/7951 which also depends on the previous dependency, so I'm removing it from 1.0.1 patch release. was (Author: kszucs): Depends on backward incompatible improvement https://github.com/apache/arrow/pull/7958 and https://github.com/apache/arrow/pull/7951 which also depends on the previous dependency. > [Rust] [DataFusion] MergeExec should have concurrency limit > > > Key: ARROW-9716 > URL: https://issues.apache.org/jira/browse/ARROW-9716 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 1.0.1, 2.0.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > MergeExec currently spins up one thread per input partition which causes apps > to effectively hang if there are substantially more partitions than available > cores. > We can implement a configurable limit here pretty easily. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9716) [Rust] [DataFusion] MergeExec should have concurrency limit
[ https://issues.apache.org/jira/browse/ARROW-9716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178941#comment-17178941 ] Krisztian Szucs commented on ARROW-9716: Depends on backward incompatible improvement https://github.com/apache/arrow/pull/7958 and https://github.com/apache/arrow/pull/7951 which also depends on the previous dependency. > [Rust] [DataFusion] MergeExec should have concurrency limit > > > Key: ARROW-9716 > URL: https://issues.apache.org/jira/browse/ARROW-9716 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 1.0.1, 2.0.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > MergeExec currently spins up one thread per input partition which causes apps > to effectively hang if there are substantially more partitions than available > cores. > We can implement a configurable limit here pretty easily. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9732) [Rust] [DataFusion] Add "Physical Planner" type thing which can do optimizations
[ https://issues.apache.org/jira/browse/ARROW-9732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb resolved ARROW-9732. Resolution: Fixed Dupe of ARROW-9758, fixed by [~andygrove] > [Rust] [DataFusion] Add "Physical Planner" type thing which can do > optimizations > > > Key: ARROW-9732 > URL: https://issues.apache.org/jira/browse/ARROW-9732 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Andrew Lamb >Priority: Major > > [~andygrove] implemented what I would describe as a "limit pushdown" > optimization within Limit here: > https://github.com/apache/arrow/pull/7958#discussion_r470175966 > However, it was implemented by directly instantiating Partition objects > during plan execution. This "pick the top N from each partition and then pick > the top N from the merged result" is an example of operator pushdown that > could be done at planning time > This ticket tracks the work to add some way to represent the in the planning > stage, rather than execution, in order to open up more optimization > opportunities. > One example of pushdown that could potentially happen at planning time would > be pushing the limit down past Projections for example. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9714) [Rust] [DataFusion] TypeCoercionRule not implemented for Limit or Sort
[ https://issues.apache.org/jira/browse/ARROW-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178933#comment-17178933 ] Krisztian Szucs commented on ARROW-9714: It heavily depends on https://github.com/apache/arrow/pull/7833 which was not part of 1.0 release, so removing it from 1.0.1 patch release. > [Rust] [DataFusion] TypeCoercionRule not implemented for Limit or Sort > -- > > Key: ARROW-9714 > URL: https://issues.apache.org/jira/browse/ARROW-9714 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 1.0.1, 2.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > TypeCoercionRule not implemented for Limit or Sort, causing TPC-H query 1 to > fail. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9714) [Rust] [DataFusion] TypeCoercionRule not implemented for Limit or Sort
[ https://issues.apache.org/jira/browse/ARROW-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-9714: --- Fix Version/s: (was: 1.0.1) > [Rust] [DataFusion] TypeCoercionRule not implemented for Limit or Sort > -- > > Key: ARROW-9714 > URL: https://issues.apache.org/jira/browse/ARROW-9714 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > TypeCoercionRule not implemented for Limit or Sort, causing TPC-H query 1 to > fail. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9556) [Python][C++] Segfaults in UnionArray with null values
[ https://issues.apache.org/jira/browse/ARROW-9556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-9556. Resolution: Fixed Issue resolved by pull request 7952 [https://github.com/apache/arrow/pull/7952] > [Python][C++] Segfaults in UnionArray with null values > -- > > Key: ARROW-9556 > URL: https://issues.apache.org/jira/browse/ARROW-9556 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0 > Environment: Conda, but pyarrow was installed using pip (in the conda > environment) >Reporter: Jim Pivarski >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0, 1.0.1 > > Time Spent: 50m > Remaining Estimate: 0h > > Extracting null values from a UnionArray containing nulls and constructing a > UnionArray with a bitmask in pyarrow.Array.from_buffers causes segfaults in > pyarrow 1.0.0. I have an environment with pyarrow 0.17.0 and all of the > following run correctly without segfaults in the older version. > Here's a UnionArray that works (because there are no nulls): > > {code:java} > # GOOD > a = pyarrow.UnionArray.from_sparse( > pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()), > [ > pyarrow.array([0.0, 1.1, 2.2, 3.3, 4.4]), > pyarrow.array([True, True, False, True, False]), > ], > ) > a.to_pylist(){code} > > Here's one the fails when you try a.to_pylist() or even just a[2], because > one of the children has a null at 2: > > {code:java} > # SEGFAULT > a = pyarrow.UnionArray.from_sparse( > pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()), > [ > pyarrow.array([0.0, 1.1, None, 3.3, 4.4]), > pyarrow.array([True, True, False, True, False]), > ], > ) > a.to_pylist() # also just a[2] causes a segfault{code} > > Here's another that fails because both children have nulls; the segfault > occurs at both positions with nulls: > > {code:java} > # SEGFAULT > a = pyarrow.UnionArray.from_sparse( > pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()), > [ > pyarrow.array([0.0, 1.1, None, 3.3, 4.4]), > pyarrow.array([True, None, False, True, False]), > ], > ) > a.to_pylist() # also a[1] and a[2] cause segfaults{code} > > Here's one that succeeds, but it's dense, rather than sparse: > > {code:java} > # GOOD > a = pyarrow.UnionArray.from_dense( > pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()), > pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()), > [pyarrow.array([0.0, 1.1, 2.2, 3.3]), pyarrow.array([True, True, False])], > ) > a.to_pylist(){code} > > Here's a dense that fails because one child has a null: > > {code:java} > # SEGFAULT > a = pyarrow.UnionArray.from_dense( > pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()), > pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()), > [pyarrow.array([0.0, 1.1, None, 3.3]), pyarrow.array([True, True, False])], > ) > a.to_pylist() # also just a[3] causes a segfault{code} > > Here's a dense that fails in two positions because both children have a null: > > {code:java} > # SEGFAULT > a = pyarrow.UnionArray.from_dense( > pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()), > pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()), > [pyarrow.array([0.0, 1.1, None, 3.3]), pyarrow.array([True, None, False])], > ) > a.to_pylist() # also a[3] and a[5] cause segfaults{code} > > In all of the above, we created the UnionArray using its from_dense method. > We could instead create it with pyarrow.Array.from_buffers. If created with > content0 and content1 that have no nulls, it's fine, but if created with > nulls in the content, it segfaults as soon as you view the null value. > > {code:java} > # GOOD > content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3, 4.4]) > content1 = pyarrow.array([True, True, False, True, False]) > # SEGFAULT > content0 = pyarrow.array([0.0, 1.1, 2.2, None, 4.4]) > content1 = pyarrow.array([True, True, False, True, False]) > types = pyarrow.union( > [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)], > "sparse", > [0, 1], > ) > a = pyarrow.Array.from_buffers( > types, > 5, > [ > None, > pyarrow.py_buffer(numpy.array([0, 1, 0, 0, 1], numpy.int8)), > ], > children=[content0, content1], > ) > a.to_pylist() # also just a[3] causes a segfault{code} > > Similarly for a dense union. > > {code:java} > # GOOD > content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3]) > content1 = pyarrow.array([True, True, False]) > # SEGFAULT > content0 = pyarrow.array([0.0, 1.1, None, 3.3]) > content1 = pyarrow.array([True, True, False]) > types = pyarrow.union( > [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)], > "dense", > [0, 1], > ) > a = pyarrow.Array.from_buffers( > types, >
[jira] [Commented] (ARROW-9633) [C++] Do not toggle memory mapping globally in LocalFileSystem
[ https://issues.apache.org/jira/browse/ARROW-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178875#comment-17178875 ] Antoine Pitrou commented on ARROW-9633: --- My concern is that memory-mapping is an optimization specific to local filesystem files, and it would burden the generic API with those optimization details. Did you enconter a use case where the current API produces detrimental results? Or where the proposed change (attempt to memory-map and then fall back to regular reading) would? > [C++] Do not toggle memory mapping globally in LocalFileSystem > -- > > Key: ARROW-9633 > URL: https://issues.apache.org/jira/browse/ARROW-9633 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 2.0.0 > > > In the context of the Datasets API, some file formats benefit greatly from > memory mapping (like Arrow IPC files) while other less so. Additionally, in > some scenarios, memory mapping could fail when used on network-attached > storage devices. Since a filesystem may be used to read different kinds of > files and use both memory mapping and non-memory mapping, and additionally > the Datasets API should be able to fall back on non-memory mapping if the > attempt to memory map fails, it would make sense to have a non-global option > for this: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/localfs.h > I would suggest adding a new filesystem API with something like > {{OpenMappedInputFile}} with some options to control the behavior when memory > mapping is not possible. These options may be among: > * Falling back on a normal RandomAccessFile > * Reading the entire file into memory (or even tmpfs?) and then wrapping it > in a BufferReader > * Failing -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage
[ https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-1231: - Assignee: (was: Antoine Pitrou) > [C++] Add filesystem / IO implementation for Google Cloud Storage > - > > Key: ARROW-1231 > URL: https://issues.apache.org/jira/browse/ARROW-1231 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: filesystem > > See example jumping off point > https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9768) [Python] Pyarrow allows for unsafe conversions of datetime objects to timestamp nanoseconds
[ https://issues.apache.org/jira/browse/ARROW-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178785#comment-17178785 ] Joris Van den Bossche commented on ARROW-9768: -- [~Joshual] thanks for the report! We should indeed ensure that this raises. On casting we already check for this and raise appropriately: {code} In [13]: pa.array(np.array([datetime(2262,4,12)])).cast(pa.timestamp('ns')) ... ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 92233728 {code} but this should also be done in the typed array converter. > [Python] Pyarrow allows for unsafe conversions of datetime objects to > timestamp nanoseconds > --- > > Key: ARROW-9768 > URL: https://issues.apache.org/jira/browse/ARROW-9768 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1, 1.0.0 > Environment: OS: MacOSX Catalina > Python Version: 3.7 >Reporter: Joshua Lay >Priority: Minor > Fix For: 2.0.0 > > > Hi, > In parquet, I want to store date values as timestamp format with nanoseconds > precision. This works fine with most dates except those past > pandas.Timestamp.max: > [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html.|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html] > I was expecting some exception to be raised (like in Pandas), however this > did not happen and the value was processed incorrectly. Note that this is > with safe=True. Can this please be looked into? Thanks > {{Example Code:}} > {{pa.array([datetime(2262,4,12)], type=pa.timestamp("ns"))}} > \{{}} > {{Return:}} > {{[}} > \{{ 1677-09-21 00:25:26.290448384}} > {{]}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9768) [Python] Pyarrow allows for unsafe conversions of datetime objects to timestamp nanoseconds
[ https://issues.apache.org/jira/browse/ARROW-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-9768: - Fix Version/s: 2.0.0 > [Python] Pyarrow allows for unsafe conversions of datetime objects to > timestamp nanoseconds > --- > > Key: ARROW-9768 > URL: https://issues.apache.org/jira/browse/ARROW-9768 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1, 1.0.0 > Environment: OS: MacOSX Catalina > Python Version: 3.7 >Reporter: Joshua Lay >Priority: Minor > Fix For: 2.0.0 > > > Hi, > In parquet, I want to store date values as timestamp format with nanoseconds > precision. This works fine with most dates except those past > pandas.Timestamp.max: > [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html.|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.max.html] > I was expecting some exception to be raised (like in Pandas), however this > did not happen and the value was processed incorrectly. Note that this is > with safe=True. Can this please be looked into? Thanks > {{Example Code:}} > {{pa.array([datetime(2262,4,12)], type=pa.timestamp("ns"))}} > \{{}} > {{Return:}} > {{[}} > \{{ 1677-09-21 00:25:26.290448384}} > {{]}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9672) [Python][Parquet] Expose _filters_to_expression
[ https://issues.apache.org/jira/browse/ARROW-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-9672: - Fix Version/s: 1.0.1 > [Python][Parquet] Expose _filters_to_expression > --- > > Key: ARROW-9672 > URL: https://issues.apache.org/jira/browse/ARROW-9672 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Reporter: Caleb Winston >Priority: Trivial > Fix For: 1.0.1 > > > `_filters_to_expression` converts filters expressed in disjunctive normal > form (DNF) to `dataset.Expression`. Can `_filters_to_expression` be added to > the public API? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9672) [Python][Parquet] Expose _filters_to_expression
[ https://issues.apache.org/jira/browse/ARROW-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178775#comment-17178775 ] Joris Van den Bossche commented on ARROW-9672: -- I think the clear disclaimer of such a function would be that the returned expression is _only_ to be used to pass to one of the pyarrow.dataset functions as {{filter}} argument. And so when we have a more general expression API, this function should also be updated to return this new expression type, so that it keeps working for pyarrow.dataset. _If_ that is the only case for which the function would be used, I don't think there is any risk in increasing the surface area. Alternatively, we could also accept the DNF-like lists of tuples in the pyarrow.dataset functions and methods, so that external projects like dask and cudf don't have to convert this to a pyarrow Expression themselves. We decided against it (not wanting to expand support for DNF-like nested lists), but doing this would actually decrease the exposure of the current dataset-specific expressions, as external project would not need to create them to be able to use the filtering functionality. > [Python][Parquet] Expose _filters_to_expression > --- > > Key: ARROW-9672 > URL: https://issues.apache.org/jira/browse/ARROW-9672 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Reporter: Caleb Winston >Priority: Trivial > > `_filters_to_expression` converts filters expressed in disjunctive normal > form (DNF) to `dataset.Expression`. Can `_filters_to_expression` be added to > the public API? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9755) pyarrow deserialize return datetime.datetime
[ https://issues.apache.org/jira/browse/ARROW-9755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178770#comment-17178770 ] Joris Van den Bossche commented on ARROW-9755: -- Can you show a code example that reproduces your issue? Also, what did it return in 0.17.1? > pyarrow deserialize return datetime.datetime > > > Key: ARROW-9755 > URL: https://issues.apache.org/jira/browse/ARROW-9755 > Project: Apache Arrow > Issue Type: Bug >Reporter: Ruotian Luo >Priority: Major > > With latest pyarrow 1.0, pyarrow deserialize return datetime.datetime. Was > fine with 0.17.1 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9755) [Python] pyarrow deserialize return datetime.datetime
[ https://issues.apache.org/jira/browse/ARROW-9755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-9755: - Summary: [Python] pyarrow deserialize return datetime.datetime (was: pyarrow deserialize return datetime.datetime) > [Python] pyarrow deserialize return datetime.datetime > - > > Key: ARROW-9755 > URL: https://issues.apache.org/jira/browse/ARROW-9755 > Project: Apache Arrow > Issue Type: Bug >Reporter: Ruotian Luo >Priority: Major > > With latest pyarrow 1.0, pyarrow deserialize return datetime.datetime. Was > fine with 0.17.1 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9766) [C++][Parquet] Add EngineVersion to properties to allow for toggling new vs old logic
[ https://issues.apache.org/jira/browse/ARROW-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178766#comment-17178766 ] Joris Van den Bossche commented on ARROW-9766: -- Should this be added to the 1.0.1 milestone? > [C++][Parquet] Add EngineVersion to properties to allow for toggling new vs > old logic > - > > Key: ARROW-9766 > URL: https://issues.apache.org/jira/browse/ARROW-9766 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > This will provide an escape hatch in case the new logic some how has > unuseable bugs in it. -- This message was sent by Atlassian Jira (v8.3.4#803005)