[jira] [Created] (ARROW-16623) [GLib] Add support for QuantileOptions
Hirokazu SUZUKI created ARROW-16623: --- Summary: [GLib] Add support for QuantileOptions Key: ARROW-16623 URL: https://issues.apache.org/jira/browse/ARROW-16623 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Hirokazu SUZUKI No options available for quantile in Ruby. It requires GLib implementation. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16622) [R] Detect compression by magic number where possible
Sam Albers created ARROW-16622: -- Summary: [R] Detect compression by magic number where possible Key: ARROW-16622 URL: https://issues.apache.org/jira/browse/ARROW-16622 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 8.0.0 Reporter: Sam Albers readr does this like so: [https://github.com/tidyverse/readr/commit/3e1195762a204fd053ba0d7d88ed3e80a1810510] I think we could largely take that same approach at least as a first pass at detection perhaps falling back on file extensions. This would be a little safer. This would apply to read functions though maybe if not too expensive we could check outputs for a successful write. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16621) [C++][Python] Read ACID Hive tables
Ian Alexander Joiner created ARROW-16621: Summary: [C++][Python] Read ACID Hive tables Key: ARROW-16621 URL: https://issues.apache.org/jira/browse/ARROW-16621 Project: Apache Arrow Issue Type: Improvement Reporter: Ian Alexander Joiner Since Hive 3 produces ACID tables by default it will be great if Arrow can read them directly from HDFS. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16620) open_dataset fails to open single compressed csv
Carl Boettiger created ARROW-16620: -- Summary: open_dataset fails to open single compressed csv Key: ARROW-16620 URL: https://issues.apache.org/jira/browse/ARROW-16620 Project: Apache Arrow Issue Type: Bug Reporter: Carl Boettiger The following fails: {code:java} bucket <- s3_bucket("targets/aquatics", endpoint_override="data.ecoforecast.org") x <- open_dataset(bucket$path("aquatics-targets.csv.gz"), format="csv") {code} This is surprising since pointing to an individual parquet file path is fine: {code:java} bucket <- s3_bucket("scores/parquet/aquatics/2022", endpoint_override="data.ecoforecast.org") x <- open_dataset(bucket$path("aquatics-2022-05-18-climatology.parquet")) {code} Maybe related to discussion in https://issues.apache.org/jira/browse/ARROW-15060 or maybe not? In this context I'm thinking only about read. The above examples use public buckets so should be reproducible with no credentials. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16619) read_csv_arrow / open_dataset over https connection?
Carl Boettiger created ARROW-16619: -- Summary: read_csv_arrow / open_dataset over https connection? Key: ARROW-16619 URL: https://issues.apache.org/jira/browse/ARROW-16619 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Carl Boettiger Currently, remote access to data (particularly lazy read, an immensely powerful arrow ability) only works for data in an S3-compliant object store (though I know Azure support is in the works). It would be really fantastic if we could have remote access over HTTPS (I think this already works on the python side thanks to fsspec). For example, this fails in arrow but works in readr: arrow::read_csv_arrow("https://data.ecoforecast.org/targets/aquatics/aquatics-targets.csv.gz;) readr::read_csv("https://data.ecoforecast.org/targets/aquatics/aquatics-targets.csv.gz;) I think this ability would be even more compelling in `open_dataset()`, since it opens up for us all the power of lazy read access. Most servers support curl range requests so it seems this should be possible. (We can already do something similar from duckdb+R, but only after manually opting in the http extension and only for parquet). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16618) [C++][Python] strptime fails to parse with %p on Windows
Rok Mihevc created ARROW-16618: -- Summary: [C++][Python] strptime fails to parse with %p on Windows Key: ARROW-16618 URL: https://issues.apache.org/jira/browse/ARROW-16618 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Rok Mihevc As reported in https://github.com/apache/arrow/issues/13111 parsing a timestamp with %p will fail on Windows. This is probably due to issues with vendored strptime on Windows locales. We should explore which flags can be enabled and how. Strptime tests suite should be expanded https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string_test.cc#L1842-L1890. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16617) [C++] WinErrorMessage() should not use Windows ANSI APIs
Antoine Pitrou created ARROW-16617: -- Summary: [C++] WinErrorMessage() should not use Windows ANSI APIs Key: ARROW-16617 URL: https://issues.apache.org/jira/browse/ARROW-16617 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou The {{WinErrorMessage}} utility function calls {{FormatMessageA}} in order to get the Windows error message. This unfortunately returns the message encoded using the current "codepage", which can give unreadable results if there are non-ASCII characters in it. Instead, we should probably use {{FormatMessageW}} and then convert to UTF-8. At least {{PyArrow}} expects the error message in a {{Status}} to be utf8-encoded. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16616) [Python] Allow lazy evaluation of filters in Dataset and add Datset.filter method
Alessandro Molina created ARROW-16616: - Summary: [Python] Allow lazy evaluation of filters in Dataset and add Datset.filter method Key: ARROW-16616 URL: https://issues.apache.org/jira/browse/ARROW-16616 Project: Apache Arrow Issue Type: Sub-task Components: Python Reporter: Alessandro Molina Fix For: 9.0.0 To keep the {{Dataset}} api compatible with the {{Table}} one in terms of analytics capabilities, we should add a {{Dataset.filter}} method. The initial POC was based on {{_table_filter}} but that required materialising all the {{Dataset}} content after filtering as it returned an {{{}InMemoryDataset{}}}. Given that {{Scanner}} can filter a dataset without actually materialising the data until a final step happens, it would be good to have {{Dataset.filter}} return some form of lazy dataset when the filter is only stored aside and the Scanner is created when data is actually retrieved. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16615) An unstable crash may appears when reading table from a json file.
Jack Tondon created ARROW-16615: --- Summary: An unstable crash may appears when reading table from a json file. Key: ARROW-16615 URL: https://issues.apache.org/jira/browse/ARROW-16615 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 8.0.0 Reporter: Jack Tondon Attachments: test.json, test_arrow_json.cpp An unstable crash may appears when reading table from a json file. arrow and parquet are installed by apt-get. g++ test_arrow_json.cpp -o test_arrow_json -larrow -lparquet && ./test_arrow_json /build/apache-arrow-8.0.0/cpp/src/arrow/result.cc:28: ValueOrDie called on an error: NotImplemented: JSON conversion to struct>, light_bboxes: list>, countdown: timestamp[s]> is not supported /usr/lib/x86_64-linux-gnu/libarrow.so.800(+0x39e131)[0x7f07b843e131] /usr/lib/x86_64-linux-gnu/libarrow.so.800(_ZN5arrow4util8ArrowLogD1Ev+0xdd)[0x7f07b878e83d] /usr/lib/x86_64-linux-gnu/libarrow.so.800(_ZN5arrow8internal17InvalidValueOrDieERKNS_6StatusE+0x17d)[0x7f07b8626e8d] ./test_arrow_json(+0x1b9b)[0x5613e4b53b9b] ./test_arrow_json(+0x12f2)[0x5613e4b532f2] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f07b772fc87] ./test_arrow_json(+0xfba)[0x5613e4b52fba] Aborted (core dumped) -- This message was sent by Atlassian Jira (v8.20.7#820007)