[jira] [Created] (ARROW-10936) [Python] support dateutil timezones
Eric Du created ARROW-10936: --- Summary: [Python] support dateutil timezones Key: ARROW-10936 URL: https://issues.apache.org/jira/browse/ARROW-10936 Project: Apache Arrow Issue Type: New Feature Components: Python Affects Versions: 2.0.0 Reporter: Eric Du Here are two main reasons: # As of Python 3.6, the [tzinfo documentation |#tzinfo-objects]] recommends {{dateutil.tz}} rather than {{pytz}} as an IANA time zone provider. # Pandas supports dateutil timezones. When having a pandas DataFrame that uses a dateutil timezone, you get an error. Below is a code sample: {code:python} import dateutil tz = dateutil.tz.gettz('Asia/Shanghai') df = pd.DataFrame({'a': list(range(1, 4)), 'b': pd.date_range('20130101', periods=3, tz=tz)}) df.to_feather('df.feather') {code} Errors: {code:java} ArrowInvalid: ('Object returned by tzinfo.utcoffset(None) is not an instance of datetime.timedelta', "Conversion failed for column b with type datetime64[ns, tzfile('/usr/share/zoneinfo/Asia/Shanghai')]"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10935) [Python] pa.array() doesn't support pa.lib.TimestampScalar objects
slatebit created ARROW-10935: Summary: [Python] pa.array() doesn't support pa.lib.TimestampScalar objects Key: ARROW-10935 URL: https://issues.apache.org/jira/browse/ARROW-10935 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 2.0.0 Environment: Windows 10, Python 3.7.4, PyArrow 2.0.0 Reporter: slatebit I encountered this edge case bug in PyArrow v2.0.0. For some reason, pa.array() does not know how to support pa.lib.TimestampScalar objects. This bug completely blocks my specific use case, although I do recognize that this edge case seems kind of wonky. Nonetheless, I don't see any reason why PyArrow would not understand one of it's own object types. Stacktrace: {code:java} ArrowInvalid: Could not convert 2020-11-04 22:50:16.276892 with type pyarrow.lib.TimestampScalar: did not recognize Python value type when inferring an Arrow data type {code} Reproducible Code: {code:java} import pandas as pd import pyarrow as pa pa.array([pa.scalar(pd.to_datetime('2020-11-04 22:50:16.276892000'))]) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10934) [Python] Tests are failed with fsspec-0.8.5
Kouhei Sutou created ARROW-10934: Summary: [Python] Tests are failed with fsspec-0.8.5 Key: ARROW-10934 URL: https://issues.apache.org/jira/browse/ARROW-10934 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Kouhei Sutou Fix For: 3.0.0 https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/36851219/job/lwywl76d82coawpd?fullLog=true#L2284 {noformat} == FAILURES === _ test_get_file_info_with_selector[PyFileSystem(FSSpecHandler(fsspec.filesystem("memory")))] _ fs = pathfn = . at 0x0140F4BFBB58> def test_get_file_info_with_selector(fs, pathfn): base_dir = pathfn('selector-dir/') file_a = pathfn('selector-dir/test_file_a') file_b = pathfn('selector-dir/test_file_b') dir_a = pathfn('selector-dir/test_dir_a') file_c = pathfn('selector-dir/test_dir_a/test_file_c') dir_b = pathfn('selector-dir/test_dir_b') try: fs.create_dir(base_dir) with fs.open_output_stream(file_a): pass with fs.open_output_stream(file_b): pass fs.create_dir(dir_a) with fs.open_output_stream(file_c): pass fs.create_dir(dir_b) # recursive selector selector = FileSelector(base_dir, allow_not_found=False, recursive=True) assert selector.base_dir == base_dir infos = fs.get_file_info(selector) if fs.type_name == "py::fsspec+s3": # s3fs only lists directories if they are not empty assert len(infos) == 4 else: assert len(infos) == 5 for info in infos: if (info.path.endswith(file_a) or info.path.endswith(file_b) or info.path.endswith(file_c)): assert info.type == FileType.File elif (info.path.rstrip("/").endswith(dir_a) or info.path.rstrip("/").endswith(dir_b)): assert info.type == FileType.Directory else: raise ValueError('unexpected path {}'.format(info.path)) check_mtime_or_absent(info) # non-recursive selector -> not selecting the nested file_c selector = FileSelector(base_dir, recursive=False) infos = fs.get_file_info(selector) if fs.type_name == "py::fsspec+s3": # s3fs only lists directories if they are not empty assert len(infos) == 3 else: assert len(infos) == 4 finally: > fs.delete_dir(base_dir) pyarrow\tests\test_fs.py:716: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pyarrow\_fs.pyx:472: in pyarrow._fs.FileSystem.delete_dir check_status(self.fs.DeleteDir(directory)) pyarrow\_fs.pyx:1035: in pyarrow._fs._cb_delete_dir handler.delete_dir(frombytes(path)) pyarrow\fs.py:262: in delete_dir self.fs.rm(path, recursive=True) C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\implementations\memory.py:176: in rm self.rm_file(p) C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\spec.py:840: in rm_file self._rm(path) C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\implementations\memory.py:163: in _rm self.rmdir(path) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = path = 'selector-dir' def rmdir(self, path): path = path.rstrip("/") if path in self.pseudo_dirs: if not self.ls(path): self.pseudo_dirs.remove(path) else: > raise OSError(ENOTEMPTY, "Directory not empty", path) E OSError: [Errno 41] Directory not empty: 'selector-dir' C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\implementations\memory.py:110: OSError __ test_delete_dir[PyFileSystem(FSSpecHandler(fsspec.filesystem("memory")))] __ fs = pathfn = . at 0x0140F50BC738> def test_delete_dir(fs, pathfn): skip_fsspec_s3fs(fs) d = pathfn('directory/') nd = pathfn('directory/nested/') fs.create_dir(nd) > fs.delete_dir(d) pyarrow\tests\test_fs.py:743: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pyarrow\_fs.pyx:472: in pyarrow._fs.FileSystem.delete_dir check_status(self.fs.DeleteDir(directory)) pyarrow\_fs.pyx:1035: in pyarrow._fs._cb_delete_dir handler.delete_dir(frombytes(path)) pyarrow\fs.py:262: in delete_dir self.fs.rm(path, recursive=True) C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\implementations\memory.py:176: in rm self.rm_file(p) C:\Miniconda
[jira] [Created] (ARROW-10933) [Rust] Update docs in regard to stable rust
Andrew Lamb created ARROW-10933: --- Summary: [Rust] Update docs in regard to stable rust Key: ARROW-10933 URL: https://issues.apache.org/jira/browse/ARROW-10933 Project: Apache Arrow Issue Type: Sub-task Reporter: Andrew Lamb Update the docs to include changes after https://github.com/apache/arrow/pull/8698 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10932) [C++] BinaryMemoTable::CopyOffsets access out-of-bound address when data is empty
Jimmy Lu created ARROW-10932: Summary: [C++] BinaryMemoTable::CopyOffsets access out-of-bound address when data is empty Key: ARROW-10932 URL: https://issues.apache.org/jira/browse/ARROW-10932 Project: Apache Arrow Issue Type: Bug Affects Versions: 2.0.0, 1.0.1, 1.0.0 Reporter: Jimmy Lu In [BinaryMemoTable::CopyOffsets|https://github.com/apache/arrow/blob/apache-arrow-2.0.0/cpp/src/arrow/util/hashing.h#L693], if there is no previous calls to insert data, {{offsets[start]}} will access out-of-bound address and cause undefined behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10931) [Rust] [Parquet] Improve performance of the parquet compressors
Andrew Lamb created ARROW-10931: --- Summary: [Rust] [Parquet] Improve performance of the parquet compressors Key: ARROW-10931 URL: https://issues.apache.org/jira/browse/ARROW-10931 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb As part of moving to stable Rust (ARROW-10636), we lost some amount of performance in the parquet compressors. The improvement to stable rust was deemed worthwhile but [~gbowyer] thinks there is additional changes could be made to improve the compressors. More detail can be found here; https://github.com/apache/arrow/pull/8698#issuecomment-740958408 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10930) In pyarrow, LargeListArray doesn't have a value_field
Jim Pivarski created ARROW-10930: Summary: In pyarrow, LargeListArray doesn't have a value_field Key: ARROW-10930 URL: https://issues.apache.org/jira/browse/ARROW-10930 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 2.0.0 Reporter: Jim Pivarski This one is easy: it looks like the LargeListType is just missing this field. Here it is for a 32-bit list (the reason I want this is to get at the "nullable" field, although the "metadata" would be nice, too): {code:java} >>> import pyarrow as pa >>> small_array = pa.ListArray.from_arrays(pa.array([0, 3, 3, 5]), >>> pa.array([1.1, 2.2, 3.3, 4.4, 5.5])) >>> small_array.type.value_field pyarrow.Field >>> small_array.type.value_field.nullable True{code} Now with a large list: {code:java} >>> large_array = pa.LargeListArray.from_arrays(pa.array([0, 3, 3, 5]), >>> pa.array([1.1, 2.2, 3.3, 4.4, 5.5])) >>> large_array.type.value_field Traceback (most recent call last): File "", line 1, in AttributeError: 'pyarrow.lib.LargeListType' object has no attribute 'value_field'{code} Verifying version: {code:java} >>> pa.__version__ '2.0.0'{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10929) [Rust] Migrate CI tests to stable rust
Andrew Lamb created ARROW-10929: --- Summary: [Rust] Migrate CI tests to stable rust Key: ARROW-10929 URL: https://issues.apache.org/jira/browse/ARROW-10929 Project: Apache Arrow Issue Type: Sub-task Reporter: Andrew Lamb With the merging of https://github.com/apache/arrow/pull/8698 the parquet writer now supports stable rust and we should be able to run most of our CI checks with stable rust rather than nightly to ensure no more unstable features are added. [~jorgecarleitao] has started on this -- in particular this patch: https://github.com/jorgecarleitao/arrow/commit/ca66d6d945e265dd2c83464bd80ff1dd7d231f7c -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10928) [Python] Unknown error: data type leaf_count mismatch
Lucas da Silva Abreu created ARROW-10928: Summary: [Python] Unknown error: data type leaf_count mismatch Key: ARROW-10928 URL: https://issues.apache.org/jira/browse/ARROW-10928 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 2.0.0 Environment: ubuntu 18.04 Reporter: Lucas da Silva Abreu I was trying to write some dataframes to parquet using {{snappy}} with the command {{[df.to|http://df.to/]}}{{_parquet('my-parquet', compression= 'snapppy')}} But , I got the following error Unknown error: data type leaf_count != builder_leaf_count9 8 By manually sampling with columns, I found out that a column that is a list of dicts was causing the issue A toy example is shown below which enables one to reproduce the error df2 = pd.DataFrame( [[ [\{'my_field_1': {}, 'my_field_2': {'my_field_21': 'value_21', 'my_field_22': 1, 'my_field_23': 1, 'my_field_24': 1.0}, 'my_field_3': {'my_field_31': 'value_31', 'my_field_32': 1, 'my_field_33': 1, 'my_field_34': 1}}, \{'my_field_1': {}, 'my_field_2': {'my_field_21': 'value_21', 'my_field_22': 1, 'my_field_23': 1, 'my_field_24': 1.0}, 'my_field_3': {'my_field_31': 'value_31', 'my_field_32': 1, 'my_field_33': 1, 'my_field_34': 1}}] ]], columns = ['my_column']) df2['toy_column_1'] = 1 df2['toy_column_2'] = 'ab' Current configuration of my pandas is INSTALLED VERSIONS -- commit : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3 python : 3.6.9.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-126-generic Version : #129-Ubuntu SMP Mon Nov 23 18:53:38 UTC 2020 machine : x86_64 processor: x86_64 byteorder: little LC_ALL : None LANG : en_US.UTF-8 LOCALE : pt_BR.UTF-8pandas : 1.1.4 numpy: 1.19.1 pytz : 2020.1 dateutil : 2.8.1 pip : 20.3 setuptools : 41.2.0 Cython : None pytest : 5.1.1 hypothesis : None sphinx : None blosc: None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : 0.10.1 psycopg2 : 2.8.2 (dt dec pq3 ext lo64) jinja2 : 2.11.2 IPython : 7.16.1 pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : 0.4.1 gcsfs: None matplotlib : 3.3.2 numexpr : None odfpy: None openpyxl : None pandas_gbq : 0.10.0 pyarrow : 2.0.0 pytables : None pyxlsb : None s3fs : None scipy: 1.5.2 sqlalchemy : 1.3.18 tables : None tabulate : 0.8.7 xarray : None xlrd : None xlwt : None numba: 0.52.0 I have found this issue within pandas ([https://github.com/pandas-dev/pandas/issues/34643]) that at first seemed to me be the same root cause, but I've noticed that was already using the same version of the issue and that the example in the original issue worked fine to me. Could someone please help me ? {{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10927) Add Decimal to ArrayBuilderReader for physical type fixed size binary
Florian Müller created ARROW-10927: -- Summary: Add Decimal to ArrayBuilderReader for physical type fixed size binary Key: ARROW-10927 URL: https://issues.apache.org/jira/browse/ARROW-10927 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Florian Müller -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10926) Add parquet reader / writer for decimal types
Florian Müller created ARROW-10926: -- Summary: Add parquet reader / writer for decimal types Key: ARROW-10926 URL: https://issues.apache.org/jira/browse/ARROW-10926 Project: Apache Arrow Issue Type: New Feature Components: Rust Reporter: Florian Müller Decimal values, stored physically as e.g. Fixed Size Binary should be represented by DecimalArray when the logical type indicates decimal. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10925) [Rust] Validate temporal data that has restrictions
Neville Dipale created ARROW-10925: -- Summary: [Rust] Validate temporal data that has restrictions Key: ARROW-10925 URL: https://issues.apache.org/jira/browse/ARROW-10925 Project: Apache Arrow Issue Type: Improvement Reporter: Neville Dipale Some temporal data types have restrictions (e.g. date64 should be a multiple of 8640). We should validate them when creating the arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10924) [C++] Validate temporal data in ValidateArrayFull
Antoine Pitrou created ARROW-10924: -- Summary: [C++] Validate temporal data in ValidateArrayFull Key: ARROW-10924 URL: https://issues.apache.org/jira/browse/ARROW-10924 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Antoine Pitrou Some temporal data types have restrictions on range or precision of values. We should check for those restrictions in ValidateArrayFull. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10923) Failure to read parquet from s3 after copy of s3-object to new s3-key
Darren Weber created ARROW-10923: Summary: Failure to read parquet from s3 after copy of s3-object to new s3-key Key: ARROW-10923 URL: https://issues.apache.org/jira/browse/ARROW-10923 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Darren Weber After a parquet file was copied to an s3-bucket and s3-key, pyarrow does not read it from s3. Desired behavior is that an s3-object for parquet should be self-contained, it should not depend on or track any substantial metadata about the storage engine or file system location it was saved to in such a way that it prevents relocating the object. To try to replicate the problem, save any parquet file on a linux file system (ext4) and then use the aws-cli to copy that file to any s3-object and then try to use geopandas.read_parquet to load that s3-object. ``` File "/opt/conda/envs/project/lib/python3.7/site-packages/geopandas/io/arrow.py", line 404, in _read_parquet table = parquet.read_table(path, columns=columns, **kwargs) File "/opt/conda/envs/project/lib/python3.7/site-packages/pyarrow/parquet.py", line 1573, in read_table ignore_prefixes=ignore_prefixes, File "/opt/conda/envs/project/lib/python3.7/site-packages/pyarrow/parquet.py", line 1434, in __init__ ignore_prefixes=ignore_prefixes) File "/opt/conda/envs/project/lib/python3.7/site-packages/pyarrow/dataset.py", line 667, in dataset return _filesystem_dataset(source, **kwargs) File "/opt/conda/envs/project/lib/python3.7/site-packages/pyarrow/dataset.py", line 424, in _filesystem_dataset fs, paths_or_selector = _ensure_single_source(source, filesystem) File "/opt/conda/envs/project/lib/python3.7/site-packages/pyarrow/dataset.py", line 391, in _ensure_single_source file_info = filesystem.get_file_info([path])[0] File "pyarrow/_fs.pyx", line 429, in pyarrow._fs.FileSystem.get_file_info File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10922) [C++] Test utility PrintArrayDiff prints a different style when arrays differ in length
Weston Pace created ARROW-10922: --- Summary: [C++] Test utility PrintArrayDiff prints a different style when arrays differ in length Key: ARROW-10922 URL: https://issues.apache.org/jira/browse/ARROW-10922 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace For example, when comparing (expected) `[1, 2, 3]` with (actual) `[1, 2]` I would expect something like... Unequal at absolute position 2 Expected: - 2 Actual: - Instead, the message is "Expected length 3 but was actually 2" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10921) `TypeError: 'coroutine' object is not iterable` when reading parquet partitions via s3fs >= 0.5 with pyarrow
Ivan Necas created ARROW-10921: -- Summary: `TypeError: 'coroutine' object is not iterable` when reading parquet partitions via s3fs >= 0.5 with pyarrow Key: ARROW-10921 URL: https://issues.apache.org/jira/browse/ARROW-10921 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Ivan Necas Trying to read partitioned parquet files using updated s3fs {{>= 0.5 (using via }}{{dask}}), and got this error: {code:python} 410 files = set() 411 --> 412 for key in list(self.fs._ls(path, refresh=refresh)): 413 path = key['Key'] 414 if key['StorageClass'] == 'DIRECTORY': TypeError: 'coroutine' object is not iterable {code} coming from [https://github.com/apache/arrow/blob/9baa123ea38ee9cc1d3a90cfc9347239cd28064c/python/pyarrow/filesystem.py#L415] Seems related to switching s3fs to asyncio in [https://github.com/dask/s3fs/pull/336.|https://github.com/dask/s3fs/pull/336] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10920) [Rust] Segmentation fault in Arrow Parquet writer with huge arrays
Andy Grove created ARROW-10920: -- Summary: [Rust] Segmentation fault in Arrow Parquet writer with huge arrays Key: ARROW-10920 URL: https://issues.apache.org/jira/browse/ARROW-10920 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andy Grove I stumbled across this by chance. I am not too surprised that this fails but I would expect it to fail gracefully and not with a segmentation fault. {code:java} use std::fs::File; use std::sync::Arc; use arrow::array::StringBuilder; use arrow::datatypes::{DataType, Field, Schema}; use arrow::error::Result; use arrow::record_batch::RecordBatch; use parquet::arrow::ArrowWriter; fn main() -> Result<()> { let schema = Schema::new(vec![ Field::new("c0", DataType::Utf8, false), Field::new("c1", DataType::Utf8, true), ]); let batch_size = 250; let repeat_count = 140; let file = File::create("/tmp/test.parquet")?; let mut writer = ArrowWriter::try_new(file, Arc::new(schema.clone()), None).unwrap(); let mut c0_builder = StringBuilder::new(batch_size); let mut c1_builder = StringBuilder::new(batch_size); println!("Start of loop"); for i in 0..batch_size { let c0_value = format!("{:032}", i); let c1_value = c0_value.repeat(repeat_count); c0_builder.append_value(&c0_value)?; c1_builder.append_value(&c1_value)?; } println!("Finish building c0"); let c0 = Arc::new(c0_builder.finish()); println!("Finish building c1"); let c1 = Arc::new(c1_builder.finish()); println!("Creating RecordBatch"); let batch = RecordBatch::try_new(Arc::new(schema.clone()), vec![c0, c1])?; // write the batch to parquet println!("Writing RecordBatch"); writer.write(&batch).unwrap(); println!("Closing writer"); writer.close().unwrap(); Ok(()) } {code} output: {code:java} Start of loop Finish building c0 Finish building c1 Creating RecordBatch Writing RecordBatch Segmentation fault (core dumped) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10919) Wrong values with Table slicing and conversion to/From pandas ExtensionArray
Adrien Hoarau created ARROW-10919: - Summary: Wrong values with Table slicing and conversion to/From pandas ExtensionArray Key: ARROW-10919 URL: https://issues.apache.org/jira/browse/ARROW-10919 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 2.0.0 Environment: INSTALLED VERSIONS -- commit : b5958ee1999e9aead1938c0bba2b674378807b3d python : 3.8.6.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-58-generic Version : #64-Ubuntu SMP Wed Dec 9 08:16:25 UTC 2020 machine : x86_64 processor: x86_64 byteorder: little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.1.5 numpy: 1.19.4 pytz : 2020.4 dateutil : 2.8.1 pip : 20.2.1 setuptools : 49.2.1 Cython : None pytest : 5.4.3 hypothesis : None sphinx : None blosc: None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : 0.8.4 fastparquet : None gcsfs: None matplotlib : None numexpr : None odfpy: None openpyxl : None pandas_gbq : None pyarrow : 2.0.0 pytables : None pyxlsb : None s3fs : 0.4.2 scipy: None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba: None Reporter: Adrien Hoarau Attachments: Screenshot from 2020-12-15 13-28-38.png {code:java} import pandas as pd from pyarrow import Table df = pd.DataFrame({'int_na': [0, None, 2, 3, None, 5, 6, None, 8]}, dtype=pd.Int64Dtype()) print(df) {code} int_na 0 0 1 2 2 3 3 4 5 5 6 6 7 8 8 {code:java} Table.from_pandas(df).slice(2, None).to_pandas() {code} int_na 0 2 1 2 1 3 5 4 5 1 6 8 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10918) [C++][Doc] Document supported Parquet features
Antoine Pitrou created ARROW-10918: -- Summary: [C++][Doc] Document supported Parquet features Key: ARROW-10918 URL: https://issues.apache.org/jira/browse/ARROW-10918 Project: Apache Arrow Issue Type: Task Components: C++, Documentation Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix For: 3.0.0 We should document the Parquet features supported by our C++ implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10917) [Rust][Doc] Update feature matrix
Antoine Pitrou created ARROW-10917: -- Summary: [Rust][Doc] Update feature matrix Key: ARROW-10917 URL: https://issues.apache.org/jira/browse/ARROW-10917 Project: Apache Arrow Issue Type: Task Components: Documentation, Rust Reporter: Antoine Pitrou The [status matrix](https://github.com/apache/arrow/blob/master/docs/source/status.rst) should be updated with the latest Rust additions (for example the C data interface support). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10916) gapply fails executing with rbind error
MvR created ARROW-10916: --- Summary: gapply fails executing with rbind error Key: ARROW-10916 URL: https://issues.apache.org/jira/browse/ARROW-10916 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 2.0.0 Environment: Databricks runtime 7.3 LTS ML Reporter: MvR Attachments: Rerror.log Executing following code on databricks runtime 7.3 LTS ML errors out showing some rbind error whereas it is successfully executed without enabling Arrow in Spark session. Full error message attached. ``` library(dplyr) library(SparkR) SparkR::sparkR.session(sparkConfig = list(spark.sql.execution.arrow.sparkr.enabled = "true")) mtcars %>% SparkR::as.DataFrame() %>% SparkR::gapply(x = ., cols = c("cyl", "vs"), func = function(key, data){ dt <- data[,c("mpg", "qsec")] res <- apply(dt, 2, mean) df <- data.frame(firstGroupKey = key[1], secondGroupKey = key[2], mean_mpg = res[1], mean_cyl = res[2]) return(df) }, schema = structType(structField("cyl", "double"), structField("vs", "double"), structField("mpg_mean", "double"), structField("qsec_mean", "double")) ) %>% display() ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10915) Make ARROW_TEST_DATA and PARQUET_TEST_DATA absolute dirs
meng qingyou created ARROW-10915: Summary: Make ARROW_TEST_DATA and PARQUET_TEST_DATA absolute dirs Key: ARROW-10915 URL: https://issues.apache.org/jira/browse/ARROW-10915 Project: Apache Arrow Issue Type: Test Components: Rust Reporter: meng qingyou In rust/README.md, both *ARROW_TEST_DATA* and *PARQUET_TEST_DATA* are set as relative path. The problem is: we MAY have to reset them back-and-forth across top and subdirectories -- that's annoying. So, the obvious solution is: set the Env vars as absolute dirs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10914) [Rust]: SIMD implementation of arithmetic kernels reads out of bounds
Jörn Horstmann created ARROW-10914: -- Summary: [Rust]: SIMD implementation of arithmetic kernels reads out of bounds Key: ARROW-10914 URL: https://issues.apache.org/jira/browse/ARROW-10914 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Jörn Horstmann Assignee: Jörn Horstmann The simd arithmetic kernels use the following pattern repeatedly: {code} for i in (0..left.len()).step_by(lanes) { ... } {code} If len is not a multiple of the number of lanes, this would read out of bounds in the last iteration. Currently, all buffers have an additional padding of 64 bytes (equal to the simd width), which masks this problem in most tests. As soon as we use a slice of an array, it should however be reproducible even with this padding. Even without a crash, the issue is detectable with valgrind: {code} ==31106== Invalid read of size 32 ==31106==at 0x1ECEE1: arrow::compute::kernels::arithmetic::add::hfded8b2c06cf22de (in /home/joernhorstmann/Source/github/apache/arrow/rust/target/release/deps/arrow-205580f93d58d5a9) ==31106==by 0x2650EF: arrow::compute::kernels::arithmetic::tests::test_arithmetic_kernel_should_not_rely_on_padding::hacb7c7921dc38e6a (in /home/joernhorstmann/Source/github/apache/arrow/rust/target/release/deps/arrow-205580f93d58d5a9) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)