[jira] [Created] (ARROW-10936) [Python] support dateutil timezones

2020-12-15 Thread Eric Du (Jira)
Eric Du created ARROW-10936:
---

 Summary: [Python] support dateutil timezones 
 Key: ARROW-10936
 URL: https://issues.apache.org/jira/browse/ARROW-10936
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Affects Versions: 2.0.0
Reporter: Eric Du


Here are two main reasons:
 # As of Python 3.6, the [tzinfo documentation |#tzinfo-objects]] recommends 
{{dateutil.tz}} rather than {{pytz}} as an IANA time zone provider.
 # Pandas supports dateutil timezones.

When having a pandas DataFrame that uses a dateutil timezone,  you get an 
error. 

Below is a code sample:
{code:python}
import dateutil
tz = dateutil.tz.gettz('Asia/Shanghai')
df = pd.DataFrame({'a': list(range(1, 4)), 'b': pd.date_range('20130101', 
periods=3, tz=tz)})
df.to_feather('df.feather')
{code}
Errors:
{code:java}
ArrowInvalid: ('Object returned by tzinfo.utcoffset(None) is not an instance of 
datetime.timedelta', "Conversion failed for column b with type datetime64[ns, 
tzfile('/usr/share/zoneinfo/Asia/Shanghai')]"){code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10935) [Python] pa.array() doesn't support pa.lib.TimestampScalar objects

2020-12-15 Thread slatebit (Jira)
slatebit created ARROW-10935:


 Summary: [Python] pa.array() doesn't support 
pa.lib.TimestampScalar objects
 Key: ARROW-10935
 URL: https://issues.apache.org/jira/browse/ARROW-10935
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 2.0.0
 Environment: Windows 10, Python 3.7.4, PyArrow 2.0.0
Reporter: slatebit


I encountered this edge case bug in PyArrow v2.0.0. For some reason, pa.array() 
does not know how to support pa.lib.TimestampScalar objects. This bug 
completely blocks my specific use case, although I do recognize that this edge 
case seems kind of wonky. Nonetheless, I don't see any reason why PyArrow would 
not understand one of it's own object types.

 

Stacktrace:
{code:java}
ArrowInvalid: Could not convert 2020-11-04 22:50:16.276892 with type 
pyarrow.lib.TimestampScalar: did not recognize Python value type when inferring 
an Arrow data type
{code}
 

Reproducible Code:
{code:java}
import pandas as pd
import pyarrow as pa

pa.array([pa.scalar(pd.to_datetime('2020-11-04 22:50:16.276892000'))])
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10934) [Python] Tests are failed with fsspec-0.8.5

2020-12-15 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-10934:


 Summary: [Python] Tests are failed with fsspec-0.8.5
 Key: ARROW-10934
 URL: https://issues.apache.org/jira/browse/ARROW-10934
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Kouhei Sutou
 Fix For: 3.0.0


https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/36851219/job/lwywl76d82coawpd?fullLog=true#L2284

{noformat}
== FAILURES ===
_ 
test_get_file_info_with_selector[PyFileSystem(FSSpecHandler(fsspec.filesystem("memory")))]
 _
fs = 
pathfn = . at 0x0140F4BFBB58>
def test_get_file_info_with_selector(fs, pathfn):
base_dir = pathfn('selector-dir/')
file_a = pathfn('selector-dir/test_file_a')
file_b = pathfn('selector-dir/test_file_b')
dir_a = pathfn('selector-dir/test_dir_a')
file_c = pathfn('selector-dir/test_dir_a/test_file_c')
dir_b = pathfn('selector-dir/test_dir_b')

try:
fs.create_dir(base_dir)
with fs.open_output_stream(file_a):
pass
with fs.open_output_stream(file_b):
pass
fs.create_dir(dir_a)
with fs.open_output_stream(file_c):
pass
fs.create_dir(dir_b)

# recursive selector
selector = FileSelector(base_dir, allow_not_found=False,
recursive=True)
assert selector.base_dir == base_dir

infos = fs.get_file_info(selector)
if fs.type_name == "py::fsspec+s3":
# s3fs only lists directories if they are not empty
assert len(infos) == 4
else:
assert len(infos) == 5

for info in infos:
if (info.path.endswith(file_a) or info.path.endswith(file_b) or
info.path.endswith(file_c)):
assert info.type == FileType.File
elif (info.path.rstrip("/").endswith(dir_a) or
  info.path.rstrip("/").endswith(dir_b)):
assert info.type == FileType.Directory
else:
raise ValueError('unexpected path {}'.format(info.path))
check_mtime_or_absent(info)

# non-recursive selector -> not selecting the nested file_c
selector = FileSelector(base_dir, recursive=False)

infos = fs.get_file_info(selector)
if fs.type_name == "py::fsspec+s3":
# s3fs only lists directories if they are not empty
assert len(infos) == 3
else:
assert len(infos) == 4

finally:
>   fs.delete_dir(base_dir)
pyarrow\tests\test_fs.py:716: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyarrow\_fs.pyx:472: in pyarrow._fs.FileSystem.delete_dir
check_status(self.fs.DeleteDir(directory))
pyarrow\_fs.pyx:1035: in pyarrow._fs._cb_delete_dir
handler.delete_dir(frombytes(path))
pyarrow\fs.py:262: in delete_dir
self.fs.rm(path, recursive=True)
C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\implementations\memory.py:176:
 in rm
self.rm_file(p)
C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\spec.py:840: in rm_file
self._rm(path)
C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\implementations\memory.py:163:
 in _rm
self.rmdir(path)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = 
path = 'selector-dir'
def rmdir(self, path):
path = path.rstrip("/")
if path in self.pseudo_dirs:
if not self.ls(path):
self.pseudo_dirs.remove(path)
else:
>   raise OSError(ENOTEMPTY, "Directory not empty", path)
E   OSError: [Errno 41] Directory not empty: 'selector-dir'
C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\implementations\memory.py:110:
 OSError
__ test_delete_dir[PyFileSystem(FSSpecHandler(fsspec.filesystem("memory")))] __
fs = 
pathfn = . at 0x0140F50BC738>
def test_delete_dir(fs, pathfn):
skip_fsspec_s3fs(fs)

d = pathfn('directory/')
nd = pathfn('directory/nested/')

fs.create_dir(nd)
>   fs.delete_dir(d)
pyarrow\tests\test_fs.py:743: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyarrow\_fs.pyx:472: in pyarrow._fs.FileSystem.delete_dir
check_status(self.fs.DeleteDir(directory))
pyarrow\_fs.pyx:1035: in pyarrow._fs._cb_delete_dir
handler.delete_dir(frombytes(path))
pyarrow\fs.py:262: in delete_dir
self.fs.rm(path, recursive=True)
C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\implementations\memory.py:176:
 in rm
self.rm_file(p)
C:\Miniconda

[jira] [Created] (ARROW-10933) [Rust] Update docs in regard to stable rust

2020-12-15 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-10933:
---

 Summary: [Rust] Update docs in regard to stable rust
 Key: ARROW-10933
 URL: https://issues.apache.org/jira/browse/ARROW-10933
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Andrew Lamb


Update the docs to include changes after 
https://github.com/apache/arrow/pull/8698



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10932) [C++] BinaryMemoTable::CopyOffsets access out-of-bound address when data is empty

2020-12-15 Thread Jimmy Lu (Jira)
Jimmy Lu created ARROW-10932:


 Summary: [C++] BinaryMemoTable::CopyOffsets access out-of-bound 
address when data is empty
 Key: ARROW-10932
 URL: https://issues.apache.org/jira/browse/ARROW-10932
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 2.0.0, 1.0.1, 1.0.0
Reporter: Jimmy Lu


In 
[BinaryMemoTable::CopyOffsets|https://github.com/apache/arrow/blob/apache-arrow-2.0.0/cpp/src/arrow/util/hashing.h#L693],
 if there is no previous calls to insert data, {{offsets[start]}} will access 
out-of-bound address and cause undefined behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10931) [Rust] [Parquet] Improve performance of the parquet compressors

2020-12-15 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-10931:
---

 Summary: [Rust] [Parquet] Improve performance of the parquet 
compressors
 Key: ARROW-10931
 URL: https://issues.apache.org/jira/browse/ARROW-10931
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Andrew Lamb


As part of moving to stable Rust (ARROW-10636), we lost some amount of 
performance in the parquet compressors. The improvement to stable rust was 
deemed worthwhile but [~gbowyer] thinks there is additional changes could be 
made to improve the compressors.

More detail can be found here;
https://github.com/apache/arrow/pull/8698#issuecomment-740958408



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10930) In pyarrow, LargeListArray doesn't have a value_field

2020-12-15 Thread Jim Pivarski (Jira)
Jim Pivarski created ARROW-10930:


 Summary: In pyarrow, LargeListArray doesn't have a value_field
 Key: ARROW-10930
 URL: https://issues.apache.org/jira/browse/ARROW-10930
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 2.0.0
Reporter: Jim Pivarski


This one is easy: it looks like the LargeListType is just missing this field. 
Here it is for a 32-bit list (the reason I want this is to get at the 
"nullable" field, although the "metadata" would be nice, too):
{code:java}
>>> import pyarrow as pa
>>> small_array = pa.ListArray.from_arrays(pa.array([0, 3, 3, 5]), 
>>> pa.array([1.1, 2.2, 3.3, 4.4, 5.5]))
>>> small_array.type.value_field
pyarrow.Field
>>> small_array.type.value_field.nullable
True{code}
Now with a large list:
{code:java}
>>> large_array = pa.LargeListArray.from_arrays(pa.array([0, 3, 3, 5]), 
>>> pa.array([1.1, 2.2, 3.3, 4.4, 5.5]))
>>> large_array.type.value_field
Traceback (most recent call last):
 File "", line 1, in 
AttributeError: 'pyarrow.lib.LargeListType' object has no attribute 
'value_field'{code}
Verifying version:
{code:java}
>>> pa.__version__
'2.0.0'{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10929) [Rust] Migrate CI tests to stable rust

2020-12-15 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-10929:
---

 Summary: [Rust] Migrate CI tests to stable rust
 Key: ARROW-10929
 URL: https://issues.apache.org/jira/browse/ARROW-10929
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Andrew Lamb


With the merging of https://github.com/apache/arrow/pull/8698 the parquet 
writer now supports stable rust and we should be able to run most of our CI 
checks with stable rust rather than nightly to ensure no more unstable features 
are added. 

[~jorgecarleitao] has started on this -- in particular this patch: 
https://github.com/jorgecarleitao/arrow/commit/ca66d6d945e265dd2c83464bd80ff1dd7d231f7c



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10928) [Python] Unknown error: data type leaf_count mismatch

2020-12-15 Thread Lucas da Silva Abreu (Jira)
Lucas da Silva Abreu created ARROW-10928:


 Summary: [Python] Unknown error: data type leaf_count mismatch
 Key: ARROW-10928
 URL: https://issues.apache.org/jira/browse/ARROW-10928
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 2.0.0
 Environment: ubuntu 18.04
Reporter: Lucas da Silva Abreu


I was trying to write some dataframes to parquet using {{snappy}} with the 
command {{[df.to|http://df.to/]}}{{_parquet('my-parquet', compression= 
'snapppy')}}

But , I got the following error
Unknown error: data type leaf_count != builder_leaf_count9 8
By manually sampling with columns, I found out that a column that is a list of 
dicts was causing the issue

A toy example is shown below which enables one to reproduce the error
df2 = pd.DataFrame(
[[
[\{'my_field_1': {},
  'my_field_2': {'my_field_21': 'value_21',
   'my_field_22': 1,
   'my_field_23': 1,
   'my_field_24': 1.0},
  'my_field_3': {'my_field_31': 'value_31',
   'my_field_32': 1,
   'my_field_33': 1,
   'my_field_34': 1}},
 \{'my_field_1': {},
  'my_field_2': {'my_field_21': 'value_21',
   'my_field_22': 1,
   'my_field_23': 1,
   'my_field_24': 1.0},
  'my_field_3': {'my_field_31': 'value_31',
   'my_field_32': 1,
   'my_field_33': 1,
   'my_field_34': 1}}]
]], columns = ['my_column'])
df2['toy_column_1'] = 1
df2['toy_column_2'] = 'ab'
Current configuration of my pandas is
INSTALLED VERSIONS
--
commit   : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
python   : 3.6.9.final.0
python-bits  : 64
OS   : Linux
OS-release   : 4.15.0-126-generic
Version  : #129-Ubuntu SMP Mon Nov 23 18:53:38 UTC 2020
machine  : x86_64
processor: x86_64
byteorder: little
LC_ALL   : None
LANG : en_US.UTF-8
LOCALE   : pt_BR.UTF-8pandas   : 1.1.4
numpy: 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip  : 20.3
setuptools   : 41.2.0
Cython   : None
pytest   : 5.1.1
hypothesis   : None
sphinx   : None
blosc: None
feather  : None
xlsxwriter   : None
lxml.etree   : None
html5lib : None
pymysql  : 0.10.1
psycopg2 : 2.8.2 (dt dec pq3 ext lo64)
jinja2   : 2.11.2
IPython  : 7.16.1
pandas_datareader: None
bs4  : None
bottleneck   : None
fsspec   : None
fastparquet  : 0.4.1
gcsfs: None
matplotlib   : 3.3.2
numexpr  : None
odfpy: None
openpyxl : None
pandas_gbq   : 0.10.0
pyarrow  : 2.0.0
pytables : None
pyxlsb   : None
s3fs : None
scipy: 1.5.2
sqlalchemy   : 1.3.18
tables   : None
tabulate : 0.8.7
xarray   : None
xlrd : None
xlwt : None
numba: 0.52.0
 
I have found this issue within pandas 
([https://github.com/pandas-dev/pandas/issues/34643]) that at first seemed to 
me be the same root cause, but I've noticed that was already using the same 
version of the issue and that the example in the original issue worked fine to 
me.
Could someone please help me ?
 
{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10927) Add Decimal to ArrayBuilderReader for physical type fixed size binary

2020-12-15 Thread Jira
Florian Müller created ARROW-10927:
--

 Summary: Add Decimal to ArrayBuilderReader for physical type fixed 
size binary
 Key: ARROW-10927
 URL: https://issues.apache.org/jira/browse/ARROW-10927
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Florian Müller






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10926) Add parquet reader / writer for decimal types

2020-12-15 Thread Jira
Florian Müller created ARROW-10926:
--

 Summary: Add parquet reader / writer for decimal types
 Key: ARROW-10926
 URL: https://issues.apache.org/jira/browse/ARROW-10926
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Florian Müller


Decimal values, stored physically as e.g. Fixed Size Binary should be 
represented by DecimalArray when the logical type indicates decimal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10925) [Rust] Validate temporal data that has restrictions

2020-12-15 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10925:
--

 Summary: [Rust] Validate temporal data that has restrictions
 Key: ARROW-10925
 URL: https://issues.apache.org/jira/browse/ARROW-10925
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Neville Dipale


Some temporal data types have restrictions (e.g. date64 should be a multiple of 
8640). We should validate them when creating the arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10924) [C++] Validate temporal data in ValidateArrayFull

2020-12-15 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10924:
--

 Summary: [C++] Validate temporal data in ValidateArrayFull
 Key: ARROW-10924
 URL: https://issues.apache.org/jira/browse/ARROW-10924
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou


Some temporal data types have restrictions on range or precision of values. We 
should check for those restrictions in ValidateArrayFull.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10923) Failure to read parquet from s3 after copy of s3-object to new s3-key

2020-12-15 Thread Darren Weber (Jira)
Darren Weber created ARROW-10923:


 Summary: Failure to read parquet from s3 after copy of s3-object 
to new s3-key
 Key: ARROW-10923
 URL: https://issues.apache.org/jira/browse/ARROW-10923
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Darren Weber


After a parquet file was copied to an s3-bucket and s3-key, pyarrow does not 
read it from s3.  Desired behavior is that an s3-object for parquet should be 
self-contained, it should not depend on or track any substantial metadata about 
the storage engine or file system location it was saved to in such a way that 
it prevents relocating the object.  To try to replicate the problem, save any 
parquet file on a linux file system (ext4) and then use the aws-cli to copy 
that file to any s3-object and then try to use geopandas.read_parquet to load 
that s3-object.

```
File 
"/opt/conda/envs/project/lib/python3.7/site-packages/geopandas/io/arrow.py", 
line 404, in _read_parquet
 table = parquet.read_table(path, columns=columns, **kwargs)
 File "/opt/conda/envs/project/lib/python3.7/site-packages/pyarrow/parquet.py", 
line 1573, in read_table
 ignore_prefixes=ignore_prefixes,
 File "/opt/conda/envs/project/lib/python3.7/site-packages/pyarrow/parquet.py", 
line 1434, in __init__
 ignore_prefixes=ignore_prefixes)
 File "/opt/conda/envs/project/lib/python3.7/site-packages/pyarrow/dataset.py", 
line 667, in dataset
 return _filesystem_dataset(source, **kwargs)
 File "/opt/conda/envs/project/lib/python3.7/site-packages/pyarrow/dataset.py", 
line 424, in _filesystem_dataset
 fs, paths_or_selector = _ensure_single_source(source, filesystem)
 File "/opt/conda/envs/project/lib/python3.7/site-packages/pyarrow/dataset.py", 
line 391, in _ensure_single_source
 file_info = filesystem.get_file_info([path])[0]
 File "pyarrow/_fs.pyx", line 429, in pyarrow._fs.FileSystem.get_file_info
 File "pyarrow/error.pxi", line 122, in 
pyarrow.lib.pyarrow_internal_check_status
 File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
```




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10922) [C++] Test utility PrintArrayDiff prints a different style when arrays differ in length

2020-12-15 Thread Weston Pace (Jira)
Weston Pace created ARROW-10922:
---

 Summary: [C++] Test utility PrintArrayDiff prints a different 
style when arrays differ in length
 Key: ARROW-10922
 URL: https://issues.apache.org/jira/browse/ARROW-10922
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


For example, when comparing (expected) `[1, 2, 3]` with (actual) `[1, 2]` I 
would expect something like...
 
Unequal at absolute position 2 
Expected:
- 2
Actual:
- 
 
Instead, the message is "Expected length 3 but was actually 2"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10921) `TypeError: 'coroutine' object is not iterable` when reading parquet partitions via s3fs >= 0.5 with pyarrow

2020-12-15 Thread Ivan Necas (Jira)
Ivan Necas created ARROW-10921:
--

 Summary: `TypeError: 'coroutine' object is not iterable` when 
reading parquet partitions via s3fs >= 0.5 with pyarrow
 Key: ARROW-10921
 URL: https://issues.apache.org/jira/browse/ARROW-10921
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Ivan Necas


Trying to read partitioned parquet files using updated s3fs {{>= 0.5 (using via 
}}{{dask}}), and got this error:

 
{code:python}
410 files = set()
 411 
--> 412 for key in list(self.fs._ls(path, refresh=refresh)):
 413 path = key['Key']
 414 if key['StorageClass'] == 'DIRECTORY':

TypeError: 'coroutine' object is not iterable

{code}
coming from 
[https://github.com/apache/arrow/blob/9baa123ea38ee9cc1d3a90cfc9347239cd28064c/python/pyarrow/filesystem.py#L415]
 

 

Seems related to switching s3fs to asyncio in 
[https://github.com/dask/s3fs/pull/336.|https://github.com/dask/s3fs/pull/336]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10920) [Rust] Segmentation fault in Arrow Parquet writer with huge arrays

2020-12-15 Thread Andy Grove (Jira)
Andy Grove created ARROW-10920:
--

 Summary: [Rust] Segmentation fault in Arrow Parquet writer with 
huge arrays
 Key: ARROW-10920
 URL: https://issues.apache.org/jira/browse/ARROW-10920
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Andy Grove


I stumbled across this by chance. I am not too surprised that this fails but I 
would expect it to fail gracefully and not with a segmentation fault.

 
{code:java}
 use std::fs::File;
use std::sync::Arc;

use arrow::array::StringBuilder;
use arrow::datatypes::{DataType, Field, Schema};
use arrow::error::Result;
use arrow::record_batch::RecordBatch;

use parquet::arrow::ArrowWriter;

fn main() -> Result<()> {
let schema = Schema::new(vec![
Field::new("c0", DataType::Utf8, false),
Field::new("c1", DataType::Utf8, true),
]);
let batch_size = 250;
let repeat_count = 140;
let file = File::create("/tmp/test.parquet")?;
let mut writer = ArrowWriter::try_new(file, Arc::new(schema.clone()), 
None).unwrap();
let mut c0_builder = StringBuilder::new(batch_size);
let mut c1_builder = StringBuilder::new(batch_size);

println!("Start of loop");
for i in 0..batch_size {
let c0_value = format!("{:032}", i);
let c1_value = c0_value.repeat(repeat_count);
c0_builder.append_value(&c0_value)?;
c1_builder.append_value(&c1_value)?;
}

println!("Finish building c0");
let c0 = Arc::new(c0_builder.finish());

println!("Finish building c1");
let c1 = Arc::new(c1_builder.finish());

println!("Creating RecordBatch");
let batch = RecordBatch::try_new(Arc::new(schema.clone()), vec![c0, c1])?;

// write the batch to parquet
println!("Writing RecordBatch");
writer.write(&batch).unwrap();

println!("Closing writer");
writer.close().unwrap();

Ok(())
}
{code}
output:
{code:java}
Start of loop
Finish building c0
Finish building c1
Creating RecordBatch
Writing RecordBatch
Segmentation fault (core dumped)
 {code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10919) Wrong values with Table slicing and conversion to/From pandas ExtensionArray

2020-12-15 Thread Adrien Hoarau (Jira)
Adrien Hoarau created ARROW-10919:
-

 Summary: Wrong values with Table slicing and conversion to/From 
pandas ExtensionArray
 Key: ARROW-10919
 URL: https://issues.apache.org/jira/browse/ARROW-10919
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 2.0.0
 Environment: INSTALLED VERSIONS
--
commit   : b5958ee1999e9aead1938c0bba2b674378807b3d
python   : 3.8.6.final.0
python-bits  : 64
OS   : Linux
OS-release   : 5.4.0-58-generic
Version  : #64-Ubuntu SMP Wed Dec 9 08:16:25 UTC 2020
machine  : x86_64
processor: x86_64
byteorder: little
LC_ALL   : None
LANG : en_US.UTF-8
LOCALE   : en_US.UTF-8
pandas   : 1.1.5
numpy: 1.19.4
pytz : 2020.4
dateutil : 2.8.1
pip  : 20.2.1
setuptools   : 49.2.1
Cython   : None
pytest   : 5.4.3
hypothesis   : None
sphinx   : None
blosc: None
feather  : None
xlsxwriter   : None
lxml.etree   : None
html5lib : None
pymysql  : None
psycopg2 : None
jinja2   : None
IPython  : None
pandas_datareader: None
bs4  : None
bottleneck   : None
fsspec   : 0.8.4
fastparquet  : None
gcsfs: None
matplotlib   : None
numexpr  : None
odfpy: None
openpyxl : None
pandas_gbq   : None
pyarrow  : 2.0.0
pytables : None
pyxlsb   : None
s3fs : 0.4.2
scipy: None
sqlalchemy   : None
tables   : None
tabulate : None
xarray   : None
xlrd : None
xlwt : None
numba: None

Reporter: Adrien Hoarau
 Attachments: Screenshot from 2020-12-15 13-28-38.png

 
{code:java}
import pandas as pd
from pyarrow import Table

df = pd.DataFrame({'int_na': [0, None, 2, 3, None, 5, 6, None, 8]}, 
dtype=pd.Int64Dtype())
print(df)
{code}
    int_na

0 0 
1 
 2 2 
3 3 
4 
 5 5 
6 6 
7  
8 8
{code:java}
Table.from_pandas(df).slice(2, None).to_pandas()
{code}
  int_na
0 2
1 
2 1
3 5
4 
5 1
6 8



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10918) [C++][Doc] Document supported Parquet features

2020-12-15 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10918:
--

 Summary: [C++][Doc] Document supported Parquet features
 Key: ARROW-10918
 URL: https://issues.apache.org/jira/browse/ARROW-10918
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, Documentation
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 3.0.0


We should document the Parquet features supported by our C++ implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10917) [Rust][Doc] Update feature matrix

2020-12-15 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10917:
--

 Summary: [Rust][Doc] Update feature matrix
 Key: ARROW-10917
 URL: https://issues.apache.org/jira/browse/ARROW-10917
 Project: Apache Arrow
  Issue Type: Task
  Components: Documentation, Rust
Reporter: Antoine Pitrou


The [status 
matrix](https://github.com/apache/arrow/blob/master/docs/source/status.rst) 
should be updated with the latest Rust additions (for example the C data 
interface support).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10916) gapply fails executing with rbind error

2020-12-15 Thread MvR (Jira)
MvR created ARROW-10916:
---

 Summary: gapply fails executing with rbind error
 Key: ARROW-10916
 URL: https://issues.apache.org/jira/browse/ARROW-10916
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 2.0.0
 Environment: Databricks runtime 7.3 LTS ML
Reporter: MvR
 Attachments: Rerror.log

Executing following code on databricks runtime 7.3 LTS ML errors out showing 
some rbind error whereas it is successfully executed without enabling Arrow in 
Spark session. Full error message attached.

 

```

library(dplyr)
library(SparkR)

SparkR::sparkR.session(sparkConfig = 
list(spark.sql.execution.arrow.sparkr.enabled = "true"))

mtcars %>%
 SparkR::as.DataFrame() %>%

SparkR::gapply(x = .,
 cols = c("cyl", "vs"),
 
 func = function(key,
 data){
 
 dt <- data[,c("mpg", "qsec")]
 res <- apply(dt, 2, mean)
 df <- data.frame(firstGroupKey = key[1],
 secondGroupKey = key[2],
 mean_mpg = res[1],
 mean_cyl = res[2])
 return(df)
 
 }, 
 schema = structType(structField("cyl", "double"),
 structField("vs", "double"),
 structField("mpg_mean", "double"),
 structField("qsec_mean", "double"))
 ) %>%
 display()

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10915) Make ARROW_TEST_DATA and PARQUET_TEST_DATA absolute dirs

2020-12-15 Thread meng qingyou (Jira)
meng qingyou created ARROW-10915:


 Summary: Make ARROW_TEST_DATA and PARQUET_TEST_DATA absolute dirs
 Key: ARROW-10915
 URL: https://issues.apache.org/jira/browse/ARROW-10915
 Project: Apache Arrow
  Issue Type: Test
  Components: Rust
Reporter: meng qingyou


In rust/README.md,  both *ARROW_TEST_DATA* and *PARQUET_TEST_DATA* are set as 
relative path. The problem is: we MAY have to reset them back-and-forth across 
top and subdirectories -- that's annoying. So, the obvious solution is: set the 
Env vars as absolute dirs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10914) [Rust]: SIMD implementation of arithmetic kernels reads out of bounds

2020-12-15 Thread Jira
Jörn Horstmann created ARROW-10914:
--

 Summary: [Rust]: SIMD implementation of arithmetic kernels reads 
out of bounds
 Key: ARROW-10914
 URL: https://issues.apache.org/jira/browse/ARROW-10914
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Jörn Horstmann
Assignee: Jörn Horstmann


The simd arithmetic kernels use the following pattern repeatedly:

{code}
for i in (0..left.len()).step_by(lanes) { ... }
{code}

If len is not a multiple of the number of lanes, this would read out of bounds 
in the last iteration. Currently, all buffers have an additional padding of 64 
bytes (equal to the simd width), which masks this problem in most tests. As 
soon as we use a slice of an array, it should however be reproducible even with 
this padding.

Even without a crash, the issue is detectable with valgrind:

{code}
==31106== Invalid read of size 32
==31106==at 0x1ECEE1: 
arrow::compute::kernels::arithmetic::add::hfded8b2c06cf22de (in 
/home/joernhorstmann/Source/github/apache/arrow/rust/target/release/deps/arrow-205580f93d58d5a9)
==31106==by 0x2650EF: 
arrow::compute::kernels::arithmetic::tests::test_arithmetic_kernel_should_not_rely_on_padding::hacb7c7921dc38e6a
 (in 
/home/joernhorstmann/Source/github/apache/arrow/rust/target/release/deps/arrow-205580f93d58d5a9)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)