[jira] [Created] (ARROW-16623) [GLib] Add support for QuantileOptions

2022-05-19 Thread Hirokazu SUZUKI (Jira)
Hirokazu SUZUKI created ARROW-16623:
---

 Summary: [GLib] Add support for QuantileOptions
 Key: ARROW-16623
 URL: https://issues.apache.org/jira/browse/ARROW-16623
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Hirokazu SUZUKI


No options available for quantile in Ruby. It requires GLib implementation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16622) [R] Detect compression by magic number where possible

2022-05-19 Thread Sam Albers (Jira)
Sam Albers created ARROW-16622:
--

 Summary: [R] Detect compression by magic number where possible
 Key: ARROW-16622
 URL: https://issues.apache.org/jira/browse/ARROW-16622
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 8.0.0
Reporter: Sam Albers


readr does this like so:
[https://github.com/tidyverse/readr/commit/3e1195762a204fd053ba0d7d88ed3e80a1810510]

I think we could largely take that same approach at least as a first pass at 
detection perhaps falling back on file extensions. This would be a little 
safer. 

This would apply to read functions though maybe if not too expensive we could 
check outputs for a successful write.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16621) [C++][Python] Read ACID Hive tables

2022-05-19 Thread Ian Alexander Joiner (Jira)
Ian Alexander Joiner created ARROW-16621:


 Summary: [C++][Python] Read ACID Hive tables
 Key: ARROW-16621
 URL: https://issues.apache.org/jira/browse/ARROW-16621
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ian Alexander Joiner


Since Hive 3 produces ACID tables by default it will be great if Arrow can read 
them directly from HDFS.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16620) open_dataset fails to open single compressed csv

2022-05-19 Thread Carl Boettiger (Jira)
Carl Boettiger created ARROW-16620:
--

 Summary: open_dataset fails to open single compressed csv
 Key: ARROW-16620
 URL: https://issues.apache.org/jira/browse/ARROW-16620
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Carl Boettiger


 

The following fails:
{code:java}
bucket <- s3_bucket("targets/aquatics", 
endpoint_override="data.ecoforecast.org")
x <- open_dataset(bucket$path("aquatics-targets.csv.gz"), format="csv") {code}
This is surprising since pointing to an individual parquet file path is fine:


{code:java}
bucket <- s3_bucket("scores/parquet/aquatics/2022", 
endpoint_override="data.ecoforecast.org")
x <- open_dataset(bucket$path("aquatics-2022-05-18-climatology.parquet")) {code}
Maybe related to discussion in 
https://issues.apache.org/jira/browse/ARROW-15060 or maybe not?  In this 
context I'm thinking only about read.  The above examples use public buckets so 
should be reproducible with no credentials.

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16619) read_csv_arrow / open_dataset over https connection?

2022-05-19 Thread Carl Boettiger (Jira)
Carl Boettiger created ARROW-16619:
--

 Summary: read_csv_arrow / open_dataset over https connection?
 Key: ARROW-16619
 URL: https://issues.apache.org/jira/browse/ARROW-16619
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Carl Boettiger


Currently, remote access to data (particularly lazy read, an immensely powerful 
arrow ability) only works for data in an S3-compliant object store (though I 
know Azure support is in the works).  It would be really fantastic if we could 
have remote access over HTTPS (I think this already works on the python side 
thanks to fsspec).  

For example, this fails in arrow but works in readr:


arrow::read_csv_arrow("https://data.ecoforecast.org/targets/aquatics/aquatics-targets.csv.gz;)
 
readr::read_csv("https://data.ecoforecast.org/targets/aquatics/aquatics-targets.csv.gz;)

I think this ability would be even more compelling in `open_dataset()`, since 
it opens up for us all the power of lazy read access.  Most servers support 
curl range requests so it seems this should be possible.  (We can already do 
something similar from duckdb+R, but only after manually opting in the http 
extension and only for parquet).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16618) [C++][Python] strptime fails to parse with %p on Windows

2022-05-19 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-16618:
--

 Summary: [C++][Python] strptime fails to parse with %p on Windows
 Key: ARROW-16618
 URL: https://issues.apache.org/jira/browse/ARROW-16618
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Rok Mihevc


As reported in https://github.com/apache/arrow/issues/13111 parsing a timestamp 
with %p  will fail on Windows. This is probably due to issues with vendored 
strptime on Windows locales.
We should explore which flags can be enabled and how. Strptime tests suite 
should be expanded 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string_test.cc#L1842-L1890.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16617) [C++] WinErrorMessage() should not use Windows ANSI APIs

2022-05-19 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-16617:
--

 Summary: [C++] WinErrorMessage() should not use Windows ANSI APIs
 Key: ARROW-16617
 URL: https://issues.apache.org/jira/browse/ARROW-16617
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


The {{WinErrorMessage}} utility function calls {{FormatMessageA}} in order to 
get the Windows error message. This unfortunately returns the message encoded 
using the current "codepage", which can give unreadable results if there are 
non-ASCII characters in it.

Instead, we should probably use {{FormatMessageW}} and then convert to UTF-8. 
At least  {{PyArrow}} expects the error message in a {{Status}} to be 
utf8-encoded.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16616) [Python] Allow lazy evaluation of filters in Dataset and add Datset.filter method

2022-05-19 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-16616:
-

 Summary: [Python] Allow lazy evaluation of filters in Dataset and 
add Datset.filter method
 Key: ARROW-16616
 URL: https://issues.apache.org/jira/browse/ARROW-16616
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Python
Reporter: Alessandro Molina
 Fix For: 9.0.0


To keep the {{Dataset}} api compatible with the {{Table}} one in terms of 
analytics capabilities, we should add a {{Dataset.filter}} method. The initial 
POC was based on {{_table_filter}} but that required materialising all the 
{{Dataset}} content after filtering as it returned an {{{}InMemoryDataset{}}}. 

Given that {{Scanner}} can filter a dataset without actually materialising the 
data until a final step happens, it would be good to have {{Dataset.filter}} 
return some form of lazy dataset when the filter is only stored aside and the 
Scanner is created when data is actually retrieved.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16615) An unstable crash may appears when reading table from a json file.

2022-05-19 Thread Jack Tondon (Jira)
Jack Tondon created ARROW-16615:
---

 Summary: An unstable crash may appears when reading table from a 
json file.
 Key: ARROW-16615
 URL: https://issues.apache.org/jira/browse/ARROW-16615
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 8.0.0
Reporter: Jack Tondon
 Attachments: test.json, test_arrow_json.cpp

An unstable crash may appears when reading table from a json file.

arrow and parquet are installed by apt-get.

g++ test_arrow_json.cpp -o test_arrow_json -larrow -lparquet && 
./test_arrow_json

/build/apache-arrow-8.0.0/cpp/src/arrow/result.cc:28: ValueOrDie called on an 
error: NotImplemented: JSON conversion to struct>, 
light_bboxes: list>, countdown: timestamp[s]> is not supported
/usr/lib/x86_64-linux-gnu/libarrow.so.800(+0x39e131)[0x7f07b843e131]
/usr/lib/x86_64-linux-gnu/libarrow.so.800(_ZN5arrow4util8ArrowLogD1Ev+0xdd)[0x7f07b878e83d]
/usr/lib/x86_64-linux-gnu/libarrow.so.800(_ZN5arrow8internal17InvalidValueOrDieERKNS_6StatusE+0x17d)[0x7f07b8626e8d]
./test_arrow_json(+0x1b9b)[0x5613e4b53b9b]
./test_arrow_json(+0x12f2)[0x5613e4b532f2]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f07b772fc87]
./test_arrow_json(+0xfba)[0x5613e4b52fba]
Aborted (core dumped)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)