[jira] [Created] (ARROW-11468) [R] Allow user to pass schema to read_json_arrow()

2021-02-01 Thread Ian Cook (Jira)
Ian Cook created ARROW-11468:


 Summary: [R] Allow user to pass schema to read_json_arrow()
 Key: ARROW-11468
 URL: https://issues.apache.org/jira/browse/ARROW-11468
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 3.0.0
Reporter: Ian Cook
Assignee: Ian Cook


The {{read_json_arrow()}} function lacks a {{schema}} argument, and it is not 
possible to specify a schema through {{JsonParseOptions}}. PyArrow allows the 
user to pass a schema to {{read_json()}} through {{ParseOptions}} to bypass 
automatic type inference. Implement this in the R package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11467) [R] Fix reference to json_table_reader() in R docs

2021-02-01 Thread Ian Cook (Jira)
Ian Cook created ARROW-11467:


 Summary: [R] Fix reference to json_table_reader() in R docs
 Key: ARROW-11467
 URL: https://issues.apache.org/jira/browse/ARROW-11467
 Project: Apache Arrow
  Issue Type: Task
  Components: R
Affects Versions: 3.0.0
Reporter: Ian Cook
Assignee: Ian Cook


The docs entry for the R function {{read_json_arrow()}} refers to the 
nonexistent function {{json_table_reader()}}. This should be changed to 
{{JsonTableReader$create()}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11466) [Flight][Go] Add BasicAuth and BearerToken handlers for Go

2021-02-01 Thread Matt Topol (Jira)
Matt Topol created ARROW-11466:
--

 Summary: [Flight][Go] Add BasicAuth and BearerToken handlers for Go
 Key: ARROW-11466
 URL: https://issues.apache.org/jira/browse/ARROW-11466
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Matt Topol
Assignee: Matt Topol


Like ARROW-10487 did for C++ flight clients, there should be helpers to make it 
easier for Basic Authentication via base64 encoding and bearer tokens in the Go 
Flight client and server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11465) Parquet file writer snapshot API and proper ColumnChunk.file_path utilization

2021-02-01 Thread Radu Teodorescu (Jira)
Radu Teodorescu created ARROW-11465:
---

 Summary: Parquet file writer snapshot API and proper 
ColumnChunk.file_path utilization
 Key: ARROW-11465
 URL: https://issues.apache.org/jira/browse/ARROW-11465
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 3.0.0
Reporter: Radu Teodorescu
Assignee: Radu Teodorescu
 Fix For: 4.0.0


This is a follow up to the thread:
[https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%3ccdd00783-0ffc-4934-aa24-529fb2a44...@yahoo.com%3e]

The specific use case I am targeting is having the ability to partially read a 
parquet file while it's still being written to.
This is relevant for any process that is recording events over a long period of 
times and writing them to parquet (tracing data, logging events or any other 
live time series)
The solution relies on the fact that parquet specifications allows column chunk 
metadata to point explicitly to its location in a file which can theoretically 
be different from the file containing the metadata (as covered in other 
threads, this behavior is not fully supported by major parquet implementations).
My solution is centered around adding a method,

 

{{void ParquetFileWriter::Snapshot(const std::string& data_path,
 std::shared_ptr<::arrow::io::OutputStream>& 
sink) }}

,that writes writes the metadata for the RowGroups given so far to the {{sink}} 
stream and updates all the ColumnChunk metadata {{file_path}} to point to 
{{data_path}}. This was intended as a minimalist change to {{ParquetFileWriter}}

On the reading side I implemented full support for ColumnChunk.file_path by 
introducing {{ArrowMultiInputFile}} as an alternative to {{ArrowInputFile}} in 
the {{ParquetFileReader}} implementation stack. In the PR implementation one 
can default to the current behavior by using the {{SingleFile}} class, have 
full read support for multi-file parquet in line with the specification by 
using {{MultiReadableFile}} implementation (that captures the metafile base 
directory and uses it as the base directory to the ColumChunk.file_path) or one 
can provide a separate implementation for a non-posix file system storage.

For an example see {{write_parquet_file_with_snapshot}} function in 
reader-writer.cc that illustrates the snapshotting write while the 
{{read_whole_file}} function has been modified to read one of the snapshots (I 
will rollback that change and provide separate example before the merge)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11464) [Python] pyarrow.parquet.read_pandas doesn't conform to its docs

2021-02-01 Thread Pac A. He (Jira)
Pac A. He created ARROW-11464:
-

 Summary: [Python] pyarrow.parquet.read_pandas doesn't conform to 
its docs
 Key: ARROW-11464
 URL: https://issues.apache.org/jira/browse/ARROW-11464
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 3.0.0
 Environment: latest
Reporter: Pac A. He


The {{*pyarrow.parquet.read_pandas*}} 
[implementation|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L1740-L1754]
 doesn't conform to its 
[docs|https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_pandas.html#pyarrow.parquet.read_pandas]
 in at least these ways:
 # The docs state that a *{{filesystem}}* option can be provided, as it should 
be. Without this option I cannot read from S3, etc. The implementation, 
however, doesn't have this option! As such I currently cannot use it to read 
from S3!
 # The docs state that the default value for *{{use_legacy_dataset}}* is False, 
whereas the implementation has a value of True.

It looks to have been implemented and reviewed pretty carelessly.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-01 Thread Leonard Lausen (Jira)
Leonard Lausen created ARROW-11463:
--

 Summary: Allow configuration of IpcWriterOptions 64Bit from PyArrow
 Key: ARROW-11463
 URL: https://issues.apache.org/jira/browse/ARROW-11463
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: Leonard Lausen


For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` will 
be around 1000x slower compared to the `pyarrow.Table.take` on the table with 
combined chunks (1 chunk). Unfortunately, if such table contains large list 
data type, it's easy for the flattened table to contain more than 2**31 rows 
and serialization (eg for Plasma store) will fail due to 
`pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in 
length`

I couldn't find a way to enable 64bit support for the serialization as called 
from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 64 
bit setting; further the Python serialization APIs do not allow specification 
of IpcWriteOptions)

I was able to serialize successfully after changing the default and rebuilding

```
modified   cpp/src/arrow/ipc/options.h
@@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
   /// \brief If true, allow field lengths that don't fit in a signed 32-bit 
int.
   ///
   /// Some implementations may not be able to parse streams created with this 
option.
-  bool allow_64bit = false;
+  bool allow_64bit = true;
 
   /// \brief The maximum permitted schema nesting depth.
   int max_recursion_depth = kMaxNestingDepth;
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11462) [Developer] Remove needless quote from the default DOCKER_VOLUME_PREFIX

2021-02-01 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-11462:


 Summary: [Developer] Remove needless quote from the default 
DOCKER_VOLUME_PREFIX
 Key: ARROW-11462
 URL: https://issues.apache.org/jira/browse/ARROW-11462
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11461) [Flight][Go] GetSchema does not work with Java Flight Server

2021-02-01 Thread Matt Topol (Jira)
Matt Topol created ARROW-11461:
--

 Summary: [Flight][Go] GetSchema does not work with Java Flight 
Server
 Key: ARROW-11461
 URL: https://issues.apache.org/jira/browse/ARROW-11461
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC, Go
Reporter: Matt Topol
Assignee: Matt Topol


Despite the fact that the Flight.proto says the following:

> "schema of the dataset as described in Schema.fbs::Schema."

It implementations seem to use a fully serialized RecordBatch just with 0 rows 
for the schema byte fields in GetFlightInfo and GetSchema. So the Go 
implementation should follow suit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11460) [R] Use system compression libraries if present on Linux

2021-02-01 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-11460:
---

 Summary: [R] Use system compression libraries if present on Linux
 Key: ARROW-11460
 URL: https://issues.apache.org/jira/browse/ARROW-11460
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Neal Richardson


We vendor/bundle all compression libraries and have them disabled in the 
default build. This is reliable, but it would be nice to use system libraries 
if they're present. 

It's not as simple as setting {{ARROW_DEPENDENCY_SOURCE=AUTO}} because we have 
to know if we're using them in order to set the right `-lwhatever` flags in the 
R package build. Maybe these can be determined from the C++ build/cmake output 
rather than detected outside the build (but this may require ARROW-6312).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11459) [Rust] Allow ListArray of primitives to be built from iterator

2021-02-01 Thread Jira
Jorge Leitão created ARROW-11459:


 Summary: [Rust] Allow ListArray of primitives to be built from 
iterator
 Key: ARROW-11459
 URL: https://issues.apache.org/jira/browse/ARROW-11459
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Jorge Leitão
Assignee: Jorge Leitão






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11458) PyArrow 1.x and 2.x do not work with numpy 1.20

2021-02-01 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-11458:
-

 Summary: PyArrow 1.x and 2.x do not work with numpy 1.20
 Key: ARROW-11458
 URL: https://issues.apache.org/jira/browse/ARROW-11458
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 2.0.0, 1.0.1, 1.0.0
Reporter: Zhuo Peng


Numpy 1.20 was released on 1/30 and it is not compatible with libraries that 
built against numpy<1.16.6 which is the case for pyarrow 1.x and 2.x. However, 
pyarrow does not specify an upper bound for the numpy version [1].

```

Python 3.7.9 (default, Oct 30 2020, 13:50:59)
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> import numpy as np
>>> np.__version__

'1.20.0'

>>> pa.__version__
'2.0.0'
>>> pa.array(np.arange(10))
Traceback (most recent call last):
 File "", line 1, in 
 File "pyarrow/array.pxi", line 292, in pyarrow.lib.array
 File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array
 File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type
 File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object

```

 

[1] 
https://github.com/apache/arrow/blob/478286658055bb91737394c2065b92a7e92fb0c1/python/setup.py#L572

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11457) [Rust] Make string comparisson kernels generic over Utf8 and LargeUtf8

2021-02-01 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-11457:
---

 Summary: [Rust] Make string comparisson kernels generic over Utf8 
and LargeUtf8 
 Key: ARROW-11457
 URL: https://issues.apache.org/jira/browse/ARROW-11457
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Andrew Lamb
Assignee: Ritchie






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11456) OSError: Capacity error: BinaryBuilder cannot reserve space for more than 2147483646 child elements

2021-02-01 Thread Pac A. He (Jira)
Pac A. He created ARROW-11456:
-

 Summary: OSError: Capacity error: BinaryBuilder cannot reserve 
space for more than 2147483646 child elements
 Key: ARROW-11456
 URL: https://issues.apache.org/jira/browse/ARROW-11456
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 3.0.0, 2.0.0
 Environment: pyarrow 3.0.0 / 2.0.0
pandas 1.2.1
Reporter: Pac A. He


When reading a large parquet file, I have this error:

 
{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this file, 
but now it doesn't let me read it back. I don't understand why arrow uses 
[32-bit 
computing|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] in 
a 64-bit world.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11455) [R] Improve handling of -2^31 in 32-bit integer fields

2021-02-01 Thread Ian Cook (Jira)
Ian Cook created ARROW-11455:


 Summary: [R] Improve handling of -2^31 in 32-bit integer fields
 Key: ARROW-11455
 URL: https://issues.apache.org/jira/browse/ARROW-11455
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 3.0.0
Reporter: Ian Cook
Assignee: Ian Cook


R’s {{integer}} range is 1 smaller than the normal 32-bit integer range of C++, 
Java, etc. In R, it’s {{-2^31 + 1}} to {{2^31 - 1}}. Elsewhere, it’s {{-2^31}} 
to {{2^31 - 1}}. So R's native {{integer}} type cannot represent {{-2^31}} 
({{-2147483648}}).

If you run {{-2147483648L}} in R, it converts it to {{numeric}} and issues a 
warning:
{code:java}
Warning message:
non-integer value 2147483648L qualified with L; using numeric value 
{code}
In the {{arrow}} R package, when a 32-bit integer Arrow field containing the 
value {{-2147483648}} is converted to an R {{integer}} vector, the value is 
silently converted to {{NA_integer_}}. Consider whether we should handle this 
case differently and whether it is feasible to do so without performance 
regressions. Other possible behaviors might be:
 * Converting the value to {{NA_integer_}} with a warning
 * Converting the field to {{bit64::integer64}} with a warning
 * Converting the field to {{base::numeric}} with a warning
 * Allowing the user to specify an argument or option to control the behavior



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11454) [Website] [Rust] 3.0.0 Blog Post

2021-02-01 Thread Andy Grove (Jira)
Andy Grove created ARROW-11454:
--

 Summary: [Website] [Rust] 3.0.0 Blog Post
 Key: ARROW-11454
 URL: https://issues.apache.org/jira/browse/ARROW-11454
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Website
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11453) [Python] [Dataset] Unable to use write_dataset() to Azure Blob with adlfs 0.6.0

2021-02-01 Thread Lance Dacey (Jira)
Lance Dacey created ARROW-11453:
---

 Summary: [Python] [Dataset] Unable to use write_dataset() to Azure 
Blob with adlfs 0.6.0
 Key: ARROW-11453
 URL: https://issues.apache.org/jira/browse/ARROW-11453
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 3.0.0
 Environment: This environment results in an error:

adlfs v0.6.0
fsspec 0.8.5
azure.storage.blob 12.6.0
adal 1.2.6
pandas 1.2.1
pyarrow 3.0.0
Reporter: Lance Dacey


https://github.com/dask/adlfs/issues/171

I am unable to save data to Azure Blob using ds.write_dataset() with pyarrow 
3.0 and adlfs 0.6.0. Reverting to 0.5.9 fixes the issue, but I am not sure what 
the cause is - posting this here in case there were filesystem changes in 
pyarrow recently which are incompatible with changes made in adlfs.



{code:java}
  File "pyarrow/_dataset.pyx", line 2343, in 
pyarrow._dataset._filesystemdataset_write
  File "pyarrow/_fs.pyx", line 1032, in pyarrow._fs._cb_create_dir
  File "/opt/conda/lib/python3.8/site-packages/pyarrow/fs.py", line 259, in 
create_dir
self.fs.mkdir(path, create_parents=recursive)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in 
wrapper
return maybe_sync(func, self, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in 
maybe_sync
return sync(loop, func, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
raise exc.with_traceback(tb)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
result[0] = await future
  File "/opt/conda/lib/python3.8/site-packages/adlfs/spec.py", line 1033, in 
_mkdir
raise FileExistsError(
FileExistsError: Cannot overwrite existing Azure container -- dev already 
exists.  
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)