[jira] [Created] (ARROW-11468) [R] Allow user to pass schema to read_json_arrow()
Ian Cook created ARROW-11468: Summary: [R] Allow user to pass schema to read_json_arrow() Key: ARROW-11468 URL: https://issues.apache.org/jira/browse/ARROW-11468 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 3.0.0 Reporter: Ian Cook Assignee: Ian Cook The {{read_json_arrow()}} function lacks a {{schema}} argument, and it is not possible to specify a schema through {{JsonParseOptions}}. PyArrow allows the user to pass a schema to {{read_json()}} through {{ParseOptions}} to bypass automatic type inference. Implement this in the R package. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11467) [R] Fix reference to json_table_reader() in R docs
Ian Cook created ARROW-11467: Summary: [R] Fix reference to json_table_reader() in R docs Key: ARROW-11467 URL: https://issues.apache.org/jira/browse/ARROW-11467 Project: Apache Arrow Issue Type: Task Components: R Affects Versions: 3.0.0 Reporter: Ian Cook Assignee: Ian Cook The docs entry for the R function {{read_json_arrow()}} refers to the nonexistent function {{json_table_reader()}}. This should be changed to {{JsonTableReader$create()}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11466) [Flight][Go] Add BasicAuth and BearerToken handlers for Go
Matt Topol created ARROW-11466: -- Summary: [Flight][Go] Add BasicAuth and BearerToken handlers for Go Key: ARROW-11466 URL: https://issues.apache.org/jira/browse/ARROW-11466 Project: Apache Arrow Issue Type: Improvement Reporter: Matt Topol Assignee: Matt Topol Like ARROW-10487 did for C++ flight clients, there should be helpers to make it easier for Basic Authentication via base64 encoding and bearer tokens in the Go Flight client and server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11465) Parquet file writer snapshot API and proper ColumnChunk.file_path utilization
Radu Teodorescu created ARROW-11465: --- Summary: Parquet file writer snapshot API and proper ColumnChunk.file_path utilization Key: ARROW-11465 URL: https://issues.apache.org/jira/browse/ARROW-11465 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 3.0.0 Reporter: Radu Teodorescu Assignee: Radu Teodorescu Fix For: 4.0.0 This is a follow up to the thread: [https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%3ccdd00783-0ffc-4934-aa24-529fb2a44...@yahoo.com%3e] The specific use case I am targeting is having the ability to partially read a parquet file while it's still being written to. This is relevant for any process that is recording events over a long period of times and writing them to parquet (tracing data, logging events or any other live time series) The solution relies on the fact that parquet specifications allows column chunk metadata to point explicitly to its location in a file which can theoretically be different from the file containing the metadata (as covered in other threads, this behavior is not fully supported by major parquet implementations). My solution is centered around adding a method, {{void ParquetFileWriter::Snapshot(const std::string& data_path, std::shared_ptr<::arrow::io::OutputStream>& sink) }} ,that writes writes the metadata for the RowGroups given so far to the {{sink}} stream and updates all the ColumnChunk metadata {{file_path}} to point to {{data_path}}. This was intended as a minimalist change to {{ParquetFileWriter}} On the reading side I implemented full support for ColumnChunk.file_path by introducing {{ArrowMultiInputFile}} as an alternative to {{ArrowInputFile}} in the {{ParquetFileReader}} implementation stack. In the PR implementation one can default to the current behavior by using the {{SingleFile}} class, have full read support for multi-file parquet in line with the specification by using {{MultiReadableFile}} implementation (that captures the metafile base directory and uses it as the base directory to the ColumChunk.file_path) or one can provide a separate implementation for a non-posix file system storage. For an example see {{write_parquet_file_with_snapshot}} function in reader-writer.cc that illustrates the snapshotting write while the {{read_whole_file}} function has been modified to read one of the snapshots (I will rollback that change and provide separate example before the merge) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11464) [Python] pyarrow.parquet.read_pandas doesn't conform to its docs
Pac A. He created ARROW-11464: - Summary: [Python] pyarrow.parquet.read_pandas doesn't conform to its docs Key: ARROW-11464 URL: https://issues.apache.org/jira/browse/ARROW-11464 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 3.0.0 Environment: latest Reporter: Pac A. He The {{*pyarrow.parquet.read_pandas*}} [implementation|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L1740-L1754] doesn't conform to its [docs|https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_pandas.html#pyarrow.parquet.read_pandas] in at least these ways: # The docs state that a *{{filesystem}}* option can be provided, as it should be. Without this option I cannot read from S3, etc. The implementation, however, doesn't have this option! As such I currently cannot use it to read from S3! # The docs state that the default value for *{{use_legacy_dataset}}* is False, whereas the implementation has a value of True. It looks to have been implemented and reviewed pretty carelessly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
Leonard Lausen created ARROW-11463: -- Summary: Allow configuration of IpcWriterOptions 64Bit from PyArrow Key: ARROW-11463 URL: https://issues.apache.org/jira/browse/ARROW-11463 Project: Apache Arrow Issue Type: Task Components: Python Reporter: Leonard Lausen For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` will be around 1000x slower compared to the `pyarrow.Table.take` on the table with combined chunks (1 chunk). Unfortunately, if such table contains large list data type, it's easy for the flattened table to contain more than 2**31 rows and serialization (eg for Plasma store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in length` I couldn't find a way to enable 64bit support for the serialization as called from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 64 bit setting; further the Python serialization APIs do not allow specification of IpcWriteOptions) I was able to serialize successfully after changing the default and rebuilding ``` modified cpp/src/arrow/ipc/options.h @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { /// \brief If true, allow field lengths that don't fit in a signed 32-bit int. /// /// Some implementations may not be able to parse streams created with this option. - bool allow_64bit = false; + bool allow_64bit = true; /// \brief The maximum permitted schema nesting depth. int max_recursion_depth = kMaxNestingDepth; ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11462) [Developer] Remove needless quote from the default DOCKER_VOLUME_PREFIX
Kouhei Sutou created ARROW-11462: Summary: [Developer] Remove needless quote from the default DOCKER_VOLUME_PREFIX Key: ARROW-11462 URL: https://issues.apache.org/jira/browse/ARROW-11462 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11461) [Flight][Go] GetSchema does not work with Java Flight Server
Matt Topol created ARROW-11461: -- Summary: [Flight][Go] GetSchema does not work with Java Flight Server Key: ARROW-11461 URL: https://issues.apache.org/jira/browse/ARROW-11461 Project: Apache Arrow Issue Type: Bug Components: FlightRPC, Go Reporter: Matt Topol Assignee: Matt Topol Despite the fact that the Flight.proto says the following: > "schema of the dataset as described in Schema.fbs::Schema." It implementations seem to use a fully serialized RecordBatch just with 0 rows for the schema byte fields in GetFlightInfo and GetSchema. So the Go implementation should follow suit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11460) [R] Use system compression libraries if present on Linux
Neal Richardson created ARROW-11460: --- Summary: [R] Use system compression libraries if present on Linux Key: ARROW-11460 URL: https://issues.apache.org/jira/browse/ARROW-11460 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson We vendor/bundle all compression libraries and have them disabled in the default build. This is reliable, but it would be nice to use system libraries if they're present. It's not as simple as setting {{ARROW_DEPENDENCY_SOURCE=AUTO}} because we have to know if we're using them in order to set the right `-lwhatever` flags in the R package build. Maybe these can be determined from the C++ build/cmake output rather than detected outside the build (but this may require ARROW-6312). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11459) [Rust] Allow ListArray of primitives to be built from iterator
Jorge Leitão created ARROW-11459: Summary: [Rust] Allow ListArray of primitives to be built from iterator Key: ARROW-11459 URL: https://issues.apache.org/jira/browse/ARROW-11459 Project: Apache Arrow Issue Type: New Feature Components: Rust Reporter: Jorge Leitão Assignee: Jorge Leitão -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11458) PyArrow 1.x and 2.x do not work with numpy 1.20
Zhuo Peng created ARROW-11458: - Summary: PyArrow 1.x and 2.x do not work with numpy 1.20 Key: ARROW-11458 URL: https://issues.apache.org/jira/browse/ARROW-11458 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 2.0.0, 1.0.1, 1.0.0 Reporter: Zhuo Peng Numpy 1.20 was released on 1/30 and it is not compatible with libraries that built against numpy<1.16.6 which is the case for pyarrow 1.x and 2.x. However, pyarrow does not specify an upper bound for the numpy version [1]. ``` Python 3.7.9 (default, Oct 30 2020, 13:50:59) [GCC 10.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow as pa >>> import numpy as np >>> np.__version__ '1.20.0' >>> pa.__version__ '2.0.0' >>> pa.array(np.arange(10)) Traceback (most recent call last): File "", line 1, in File "pyarrow/array.pxi", line 292, in pyarrow.lib.array File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object ``` [1] https://github.com/apache/arrow/blob/478286658055bb91737394c2065b92a7e92fb0c1/python/setup.py#L572 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11457) [Rust] Make string comparisson kernels generic over Utf8 and LargeUtf8
Andrew Lamb created ARROW-11457: --- Summary: [Rust] Make string comparisson kernels generic over Utf8 and LargeUtf8 Key: ARROW-11457 URL: https://issues.apache.org/jira/browse/ARROW-11457 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb Assignee: Ritchie -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11456) OSError: Capacity error: BinaryBuilder cannot reserve space for more than 2147483646 child elements
Pac A. He created ARROW-11456: - Summary: OSError: Capacity error: BinaryBuilder cannot reserve space for more than 2147483646 child elements Key: ARROW-11456 URL: https://issues.apache.org/jira/browse/ARROW-11456 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 3.0.0, 2.0.0 Environment: pyarrow 3.0.0 / 2.0.0 pandas 1.2.1 Reporter: Pac A. He When reading a large parquet file, I have this error: {noformat} df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet return impl.read( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read return self.api.parquet.read_table( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1638, in read_table return dataset.read(columns=columns, use_threads=use_threads, File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 327, in read return self.reader.read_all(column_indices=column_indices, File "pyarrow/_parquet.pyx", line 1126, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: Capacity error: BinaryBuilder cannot reserve space for more than 2147483646 child elements, got 2147483648 {noformat} Isn't pyarrow supposed to support large parquets? It let me write this file, but now it doesn't let me read it back. I don't understand why arrow uses [32-bit computing|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] in a 64-bit world. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11455) [R] Improve handling of -2^31 in 32-bit integer fields
Ian Cook created ARROW-11455: Summary: [R] Improve handling of -2^31 in 32-bit integer fields Key: ARROW-11455 URL: https://issues.apache.org/jira/browse/ARROW-11455 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 3.0.0 Reporter: Ian Cook Assignee: Ian Cook R’s {{integer}} range is 1 smaller than the normal 32-bit integer range of C++, Java, etc. In R, it’s {{-2^31 + 1}} to {{2^31 - 1}}. Elsewhere, it’s {{-2^31}} to {{2^31 - 1}}. So R's native {{integer}} type cannot represent {{-2^31}} ({{-2147483648}}). If you run {{-2147483648L}} in R, it converts it to {{numeric}} and issues a warning: {code:java} Warning message: non-integer value 2147483648L qualified with L; using numeric value {code} In the {{arrow}} R package, when a 32-bit integer Arrow field containing the value {{-2147483648}} is converted to an R {{integer}} vector, the value is silently converted to {{NA_integer_}}. Consider whether we should handle this case differently and whether it is feasible to do so without performance regressions. Other possible behaviors might be: * Converting the value to {{NA_integer_}} with a warning * Converting the field to {{bit64::integer64}} with a warning * Converting the field to {{base::numeric}} with a warning * Allowing the user to specify an argument or option to control the behavior -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11454) [Website] [Rust] 3.0.0 Blog Post
Andy Grove created ARROW-11454: -- Summary: [Website] [Rust] 3.0.0 Blog Post Key: ARROW-11454 URL: https://issues.apache.org/jira/browse/ARROW-11454 Project: Apache Arrow Issue Type: Improvement Components: Rust, Website Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11453) [Python] [Dataset] Unable to use write_dataset() to Azure Blob with adlfs 0.6.0
Lance Dacey created ARROW-11453: --- Summary: [Python] [Dataset] Unable to use write_dataset() to Azure Blob with adlfs 0.6.0 Key: ARROW-11453 URL: https://issues.apache.org/jira/browse/ARROW-11453 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 3.0.0 Environment: This environment results in an error: adlfs v0.6.0 fsspec 0.8.5 azure.storage.blob 12.6.0 adal 1.2.6 pandas 1.2.1 pyarrow 3.0.0 Reporter: Lance Dacey https://github.com/dask/adlfs/issues/171 I am unable to save data to Azure Blob using ds.write_dataset() with pyarrow 3.0 and adlfs 0.6.0. Reverting to 0.5.9 fixes the issue, but I am not sure what the cause is - posting this here in case there were filesystem changes in pyarrow recently which are incompatible with changes made in adlfs. {code:java} File "pyarrow/_dataset.pyx", line 2343, in pyarrow._dataset._filesystemdataset_write File "pyarrow/_fs.pyx", line 1032, in pyarrow._fs._cb_create_dir File "/opt/conda/lib/python3.8/site-packages/pyarrow/fs.py", line 259, in create_dir self.fs.mkdir(path, create_parents=recursive) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper return maybe_sync(func, self, *args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync return sync(loop, func, *args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync raise exc.with_traceback(tb) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f result[0] = await future File "/opt/conda/lib/python3.8/site-packages/adlfs/spec.py", line 1033, in _mkdir raise FileExistsError( FileExistsError: Cannot overwrite existing Azure container -- dev already exists. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)