[jira] [Created] (ARROW-10494) take silently overflow on list array (when casting to large_list is needed)
Artem KOZHEVNIKOV created ARROW-10494: - Summary: take silently overflow on list array (when casting to large_list is needed) Key: ARROW-10494 URL: https://issues.apache.org/jira/browse/ARROW-10494 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 2.0.0 Reporter: Artem KOZHEVNIKOV reproducer below {code:python} import numpy as np import pyarrow as pa arr = pa.array([np.arange(x).astype(np.int8) for x in range(6)]) nb_repeat = 2**32 // arr.offsets.to_numpy()[-1] indices = pa.array(np.repeat(np.arange(len(arr)), nb_repeat)) big_arr = arr.take(indices) print(big_arr.offsets[-5:]) big_arr.validate() # hopefully this can catch it [ -21, -16, -11, -6, -1 ] --- ArrowInvalid Traceback (most recent call last) in 6 big_arr = arr.take(indices) 7 print(big_arr.offsets[-5:]) > 8 big_arr.validate() /opt/conda/envs/model/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.Array.validate() /opt/conda/envs/model/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Negative offsets in list array {code} and it works fine with large_array (as expected) : {code:python} import numpy as np import pyarrow as pa arr = pa.array([np.arange(x).astype(np.int8) for x in range(6)], type=pa.large_list(pa.int8())) nb_repeat = 2**32 // arr.offsets.to_numpy()[-1] indices = pa.array(np.repeat(np.arange(len(arr)), nb_repeat)) big_arr = arr.take(indices) print(big_arr.offsets[-5:]) big_arr.validate() [ 4294967275, 4294967280, 4294967285, 4294967290, 4294967295 ] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10172) cancat_arrays requires upcast for large array
Artem KOZHEVNIKOV created ARROW-10172: - Summary: cancat_arrays requires upcast for large array Key: ARROW-10172 URL: https://issues.apache.org/jira/browse/ARROW-10172 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 1.0.1 Reporter: Artem KOZHEVNIKOV I'm sorry if this was already reported, but there's an overflow issue in concatenation of large arrays {code:python} In [1]: import pyarrow as pa In [2]: str_array = pa.array(['a' * 128] * 10**8) In [3]: large_array = pa.concat_arrays([str_array] * 50) Segmentation fault (core dumped) {code} I suppose that this should be handled by upcast to large_string. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-7731) [C++][Parquet] Support LargeListArray
[ https://issues.apache.org/jira/browse/ARROW-7731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027457#comment-17027457 ] Artem KOZHEVNIKOV edited comment on ARROW-7731 at 2/4/20 9:11 AM: -- I found another edge case that's maybe linked to this (pyarrow=0.15.1) {code:python} import pyarrow as pa import pyarrow.parquet as pq l1 = pa.array([list(range(100))] * 10**7, type=pa.list_(pa.int16())) tt = pa.Table.from_pydict(\{'big': pa.chunked_array([l1]*10)}) # if concat, offset will overflow int32 pq.write_table(tt, '/tmp/test.parquet') # that took a while but worked tt_reload = pq.read_table('/tmp/test.parquet') # it consumes a huge amount of memory before failing ArrowInvalid Traceback (most recent call last) in > 1 tt_reload = pq.read_table('/tmp/test.parquet') /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size) 1279 buffer_size=buffer_size) 1280 return pf.read(columns=columns, use_threads=use_threads, -> 1281 use_pandas_metadata=use_pandas_metadata) 1282 1283 /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, use_threads, use_pandas_metadata) 1135 table = piece.read(columns=columns, use_threads=use_threads, 1136 partitions=self.partitions, -> 1137 use_pandas_metadata=use_pandas_metadata) 1138 tables.append(table) 1139 /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, use_threads, partitions, file, use_pandas_metadata) 603 table = reader.read_row_group(self.row_group, **options) 604 else: --> 605 table = reader.read(**options) 606 607 if len(self.partition_keys) > 0: /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, use_threads, use_pandas_metadata) 251 columns, use_pandas_metadata=use_pandas_metadata) 252 return self.reader.read_all(column_indices=column_indices, --> 253 use_threads=use_threads) 254 255 def scan_contents(self, columns=None, batch_size=65536): /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.read_all() /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Column 0: Offset invariant failure: 21474837 inconsistent offset for non-null slot: -2147483596<2147483600 {code} Thrown Error is not explicit. I wonder if created parquet file is correct (I've not tried yet to reload it with spark) or it's just a reading by pyarrow that does not support it. was (Author: artemk): I found another edge case that maybe link to this (pyarrow=0.15.1) {code:python} import pyarrow as pa import pyarrow.parquet as pq l1 = pa.array([list(range(100))] * 10**7, type=pa.list_(pa.int16())) tt = pa.Table.from_pydict(\{'big': pa.chunked_array([l1]*10)}) # if concat, offset will overflow int32 pq.write_table(tt, '/tmp/test.parquet') # that took a while but worked tt_reload = pq.read_table('/tmp/test.parquet') # it consumes a huge amount of memory before failing ArrowInvalid Traceback (most recent call last) in > 1 tt_reload = pq.read_table('/tmp/test.parquet') /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size) 1279 buffer_size=buffer_size) 1280 return pf.read(columns=columns, use_threads=use_threads, -> 1281 use_pandas_metadata=use_pandas_metadata) 1282 1283 /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, use_threads, use_pandas_metadata) 1135 table = piece.read(columns=columns, use_threads=use_threads, 1136 partitions=self.partitions, -> 1137 use_pandas_metadata=use_pandas_metadata) 1138 tables.append(table) 1139 /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, use_threads, partitions, file, use_pandas_metadata) 603 table = reader.read_row_group(self.row_group, **options) 604 else: --> 605 table = reader.read(**options) 606 607 if len(self.partition_keys) > 0: /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, use_threads, use_pandas_metadata) 251 columns, use_pandas_metadata=use_pandas_metadata) 252 return self.reader.read_all(column_indices=column_indices, --> 253 use_threads=use_threads) 254 255 def scan_contents(self, columns=None, batch_size=65536): /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.read_all() /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Column 0: Offset invariant failure: 21474837 inconsistent offset for
[jira] [Commented] (ARROW-7731) [Parquet] Support LargeListArray
[ https://issues.apache.org/jira/browse/ARROW-7731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027457#comment-17027457 ] Artem KOZHEVNIKOV commented on ARROW-7731: -- I found another edge case that maybe link to this (pyarrow=0.15.1) {code:python} import pyarrow as pa import pyarrow.parquet as pq l1 = pa.array([list(range(100))] * 10**7, type=pa.list_(pa.int16())) tt = pa.Table.from_pydict(\{'big': pa.chunked_array([l1]*10)}) # if concat, offset will overflow int32 pq.write_table(tt, '/tmp/test.parquet') # that took a while but worked tt_reload = pq.read_table('/tmp/test.parquet') # it consumes a huge amount of memory before failing ArrowInvalid Traceback (most recent call last) in > 1 tt_reload = pq.read_table('/tmp/test.parquet') /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size) 1279 buffer_size=buffer_size) 1280 return pf.read(columns=columns, use_threads=use_threads, -> 1281 use_pandas_metadata=use_pandas_metadata) 1282 1283 /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, use_threads, use_pandas_metadata) 1135 table = piece.read(columns=columns, use_threads=use_threads, 1136 partitions=self.partitions, -> 1137 use_pandas_metadata=use_pandas_metadata) 1138 tables.append(table) 1139 /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, use_threads, partitions, file, use_pandas_metadata) 603 table = reader.read_row_group(self.row_group, **options) 604 else: --> 605 table = reader.read(**options) 606 607 if len(self.partition_keys) > 0: /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, use_threads, use_pandas_metadata) 251 columns, use_pandas_metadata=use_pandas_metadata) 252 return self.reader.read_all(column_indices=column_indices, --> 253 use_threads=use_threads) 254 255 def scan_contents(self, columns=None, batch_size=65536): /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.read_all() /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Column 0: Offset invariant failure: 21474837 inconsistent offset for non-null slot: -2147483596<2147483600 {code} > [Parquet] Support LargeListArray > > > Key: ARROW-7731 > URL: https://issues.apache.org/jira/browse/ARROW-7731 > Project: Apache Arrow > Issue Type: Improvement >Reporter: marc abboud >Priority: Major > > For now it's not possible to write a pyarrow.Table containing a > LargeListArray in parquet. The lines > {code:java} > from pyarrow import parquet > import pyarrow as pa > indices = [1, 2, 3] > indptr = [0, 1, 2, 3] > q = pa.lib.LargeListArray.from_arrays(indptr, indices) > table = pa.Table.from_arrays([q], names=['no']) > parquet.write_table(table, '/test'){code} > yields the error > {code:java} > ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema > conversion: large_list > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7008) [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers
[ https://issues.apache.org/jira/browse/ARROW-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961079#comment-16961079 ] Artem KOZHEVNIKOV commented on ARROW-7008: -- is > [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers > > > Key: ARROW-7008 > URL: https://issues.apache.org/jira/browse/ARROW-7008 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 >Reporter: Uwe Korn >Priority: Major > > Minimal reproducer: > {code} > import pyarrow as pa > pa.chunked_array([pa.array([], > type=pa.string()).dictionary_encode().dictionary]) > {code} > Traceback > {code} > (lldb) bt > * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS > (code=1, address=0x20) > * frame #0: 0x000112cd5d0e libarrow.15.dylib`arrow::Status > arrow::internal::ValidateVisitor::ValidateOffsets const>(arrow::BinaryArray const&) + 94 > frame #1: 0x000112cc79a3 libarrow.15.dylib`arrow::Status > arrow::VisitArrayInline(arrow::Array > const&, arrow::internal::ValidateVisitor*) + 915 > frame #2: 0x000112cc747d libarrow.15.dylib`arrow::Array::Validate() > const + 829 > frame #3: 0x000112e3ea19 > libarrow.15.dylib`arrow::ChunkedArray::Validate() const + 89 > frame #4: 0x000112b8eb7d > lib.cpython-37m-darwin.so`__pyx_pw_7pyarrow_3lib_135chunked_array(_object*, > _object*, _object*) + 3661 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-7008) [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers
[ https://issues.apache.org/jira/browse/ARROW-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961079#comment-16961079 ] Artem KOZHEVNIKOV edited comment on ARROW-7008 at 10/28/19 2:10 PM: is it the same issue as https://issues.apache.org/jira/browse/ARROW-6857 ? was (Author: artemk): is > [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers > > > Key: ARROW-7008 > URL: https://issues.apache.org/jira/browse/ARROW-7008 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 >Reporter: Uwe Korn >Priority: Major > > Minimal reproducer: > {code} > import pyarrow as pa > pa.chunked_array([pa.array([], > type=pa.string()).dictionary_encode().dictionary]) > {code} > Traceback > {code} > (lldb) bt > * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS > (code=1, address=0x20) > * frame #0: 0x000112cd5d0e libarrow.15.dylib`arrow::Status > arrow::internal::ValidateVisitor::ValidateOffsets const>(arrow::BinaryArray const&) + 94 > frame #1: 0x000112cc79a3 libarrow.15.dylib`arrow::Status > arrow::VisitArrayInline(arrow::Array > const&, arrow::internal::ValidateVisitor*) + 915 > frame #2: 0x000112cc747d libarrow.15.dylib`arrow::Array::Validate() > const + 829 > frame #3: 0x000112e3ea19 > libarrow.15.dylib`arrow::ChunkedArray::Validate() const + 89 > frame #4: 0x000112b8eb7d > lib.cpython-37m-darwin.so`__pyx_pw_7pyarrow_3lib_135chunked_array(_object*, > _object*, _object*) + 3661 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5454) [C++] Implement Take on ChunkedArray for DataFrame use
[ https://issues.apache.org/jira/browse/ARROW-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951704#comment-16951704 ] Artem KOZHEVNIKOV commented on ARROW-5454: -- [~wesm], could you have a look on this ? a part a arrow:DataFrame project, in meanwhile this feature can be very useful to work with pyarrow data structures via pandas.ExtenensionArray. > [C++] Implement Take on ChunkedArray for DataFrame use > -- > > Key: ARROW-5454 > URL: https://issues.apache.org/jira/browse/ARROW-5454 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Follow up to ARROW-2667 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-5454) [C++] Implement Take on ChunkedArray for DataFrame use
[ https://issues.apache.org/jira/browse/ARROW-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951704#comment-16951704 ] Artem KOZHEVNIKOV edited comment on ARROW-5454 at 10/15/19 8:09 AM: [~wesm], could you have a look on this ? apart a arrow:DataFrame project, in meanwhile this feature can be very useful to work with pyarrow data structures via pandas.ExtenensionArray. was (Author: artemk): [~wesm], could you have a look on this ? a part a arrow:DataFrame project, in meanwhile this feature can be very useful to work with pyarrow data structures via pandas.ExtenensionArray. > [C++] Implement Take on ChunkedArray for DataFrame use > -- > > Key: ARROW-5454 > URL: https://issues.apache.org/jira/browse/ARROW-5454 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Follow up to ARROW-2667 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6882) cannot create a chunked_array from dictionary_encoding result
[ https://issues.apache.org/jira/browse/ARROW-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem KOZHEVNIKOV updated ARROW-6882: - Description: I've experienced a strange error raise when trying to apply `pa.chunked_array` directly on the indices of dictionary_encoding (code is below). Making a memory view solves the problem. {code:python} import pyarrow as pa ca = pa.array(['a', 'a', 'b', 'b', 'c']) fca = ca.dictionary_encode() fca.indices [ 0, 0, 1, 1, 2 ] pa.chunked_array([fca.indices]) --- ArrowInvalid Traceback (most recent call last) in > 1 pa.chunked_array([fca.indices]) ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.chunked_array() ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Unexpected dictionary values in array of type int32 # with another memory view it's OK pa.chunked_array([fca.indices.view(fca.indices.type)]) Out[45]: [ [ 0, 0, 1, 1, 2 ] ] {code} was: I've experienced a strange error raise when trying to apply `pa.chunked_array` directly on the indices of dictionary_encoding (code is below). Making a memory view solves the problem. {code:python} import pyarrow as pa ca = pa.array(['a', 'a', 'b', 'b', 'c']) fca = ca.dictionary_encode() fca.indices [ 0, 0, 1, 1, 2 ] pa.chunked_array([fca.indices]) --- ArrowInvalid Traceback (most recent call last) in > 1 pa.chunked_array([fca.indices]) ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.chunked_array() ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Unexpected dictionary values in array of type int32 # with another memory view it's OK pa.chunked_array([pa.Array.from_buffers(type=pa.int32(), length=len(fca.indices), buffers=fca.indices.buffers())]) Out[45]: [ [ 0, 0, 1, 1, 2 ] ] {code} > cannot create a chunked_array from dictionary_encoding result > - > > Key: ARROW-6882 > URL: https://issues.apache.org/jira/browse/ARROW-6882 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 >Reporter: Artem KOZHEVNIKOV >Priority: Major > Fix For: 0.15.1 > > > I've experienced a strange error raise when trying to apply > `pa.chunked_array` directly on the indices of dictionary_encoding (code is > below). Making a memory view solves the problem. > {code:python} > import pyarrow as pa > ca = pa.array(['a', 'a', 'b', 'b', 'c']) > > fca = ca.dictionary_encode() > > fca.indices > > > [ > 0, > 0, > 1, > 1, > 2 > ] > pa.chunked_array([fca.indices]) > > --- > ArrowInvalid Traceback (most recent call last) > in > > 1 pa.chunked_array([fca.indices]) > ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi > in pyarrow.lib.chunked_array() > ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi > in pyarrow.lib.check_status() > ArrowInvalid: Unexpected dictionary values in array of type int32 > # with
[jira] [Created] (ARROW-6882) cannot create a chunked_array from dictionary_encoding result
Artem KOZHEVNIKOV created ARROW-6882: Summary: cannot create a chunked_array from dictionary_encoding result Key: ARROW-6882 URL: https://issues.apache.org/jira/browse/ARROW-6882 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.0 Reporter: Artem KOZHEVNIKOV I've experienced a strange error raise when trying to apply `pa.chunked_array` directly on the indices of dictionary_encoding (code is below). Making a memory view solves the problem. {code:python} import pyarrow as pa ca = pa.array(['a', 'a', 'b', 'b', 'c']) fca = ca.dictionary_encode() fca.indices [ 0, 0, 1, 1, 2 ] pa.chunked_array([fca.indices]) --- ArrowInvalid Traceback (most recent call last) in > 1 pa.chunked_array([fca.indices]) ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.chunked_array() ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Unexpected dictionary values in array of type int32 # with another memory view it's OK pa.chunked_array([pa.Array.from_buffers(type=pa.int32(), length=len(fca.indices), buffers=fca.indices.buffers())]) Out[45]: [ [ 0, 0, 1, 1, 2 ] ] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6857) Segfault for dictionary_encode on empty chunked_array (edge case)
Artem KOZHEVNIKOV created ARROW-6857: Summary: Segfault for dictionary_encode on empty chunked_array (edge case) Key: ARROW-6857 URL: https://issues.apache.org/jira/browse/ARROW-6857 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.0 Reporter: Artem KOZHEVNIKOV a reproducer is here : {code:python} import pyarrow as pa aa = pa.chunked_array([pa.array(['a', 'b', 'c'])]) aa[:0].dictionary_encode() # Segmentation fault: 11 {code} For pyarrow=0.14, I could not reproduce. I use a conda version : "pyarrow 0.15.0 py37hdca360a_0 conda-forge" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5103) [Python] Segfault when using chunked_array.to_pandas on array different types (edge case)
[ https://issues.apache.org/jira/browse/ARROW-5103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913746#comment-16913746 ] Artem KOZHEVNIKOV commented on ARROW-5103: -- it was fixed in 0.14, wasn't it? > [Python] Segfault when using chunked_array.to_pandas on array different types > (edge case) > --- > > Key: ARROW-5103 > URL: https://issues.apache.org/jira/browse/ARROW-5103 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.12.1, 0.13.0 > Environment: pyarrow 0.12.1 py37hf9e6f3b_0 conda-forge > numpy 1.15.4 py37hacdab7b_0 > MacOs | gcc7 | what else ? >Reporter: Artem KOZHEVNIKOV >Priority: Major > Fix For: 0.15.0 > > > {code:java} > import numpy as np > import pyarrow as pa > ca = pa.chunked_array([pa.array(['rr'] * 10), pa.array(np.arange(10))]) > ca.type > ca.to_pandas() > libc++abi.dylib: terminating with uncaught exception of type > std::length_error: basic_string > Abort trap: 6 > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Comment Edited] (ARROW-5454) [C++] Implement Take on ChunkedArray for DataFrame use
[ https://issues.apache.org/jira/browse/ARROW-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912675#comment-16912675 ] Artem KOZHEVNIKOV edited comment on ARROW-5454 at 8/22/19 8:48 AM: --- if it were in pure python, we could do something like below (relying on `pa.array.take`) {code:python} import numpy as np import pyarrow as pa from pandas.core.sorting import get_group_index_sorter def take_on_chunked_array(charr, indices): indices = np.array(indices, dtype=np.int) if indices.max() > len(charr): raise IndexError() indices[indices < 0] += len(charr) if indices.min() < 0: raise IndexError() lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64) cum_lengths = lengths.cumsum() bins = np.searchsorted(cum_lengths, indices, side="right") limits_idx = np.concatenate([[0], np.bincount(bins).cumsum()]) sort_idx = get_group_index_sorter(bins, len(cum_lengths)) del bins indices = indices[sort_idx] sort_idx = np.argsort(sort_idx, kind="merge") # inverse sort indices cum_lengths -= lengths res_array = pa.concat_arrays([charr.chunks[i].take(pa.array(indices[limits_idx[i]:limits_idx[i + 1]] - cum_length)) for i, cum_length in enumerate(cum_lengths)]) return res_array.take(pa.array(sort_idx)) charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), pa.array([5, 6, 7, 8])]) take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy() pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy() {code} Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method and concat_arrays (or we want to avoid an extra copy) ? if we don't reuse `array.take` we can of cause avoid sorting indices back and forth. was (Author: artemk): if it were in pure python, we could do something like below (relying on `pa.array.take`) {code:python} import numpy as np import pyarrow as pa from pandas.core.sorting import get_group_index_sorter def take_on_chunked_array(charr, indices): indices = np.array(indices, dtype=np.int) if indices.max() > len(charr): raise IndexError() indices[indices < 0] += len(charr) if indices.min() < 0: raise IndexError() lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64) cum_lengths = lengths.cumsum() bins = np.searchsorted(cum_lengths, indices, side="right") limits_idx = np.concatenate([[0], np.bincount(bins).cumsum()]) sort_idx = get_group_index_sorter(bins, len(cum_lengths)) del bins indices = indices[sort_idx] sort_idx = np.argsort(sort_idx, kind="merge") # inverse sort indices cum_lengths -= lengths res_array = pa.concat_arrays([charr.chunks[i].take(pa.array(indices[limits_idx[i]:limits_idx[i + 1]] - cum_length)) for i, cum_length in enumerate(cum_lengths)]) return res_array.take(pa.array(sort_idx)) charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), pa.array([5, 6, 7, 8])]) take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy() pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy() {code} Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method and concat_arrays (or we want to avoid an extra copy) ? > [C++] Implement Take on ChunkedArray for DataFrame use > -- > > Key: ARROW-5454 > URL: https://issues.apache.org/jira/browse/ARROW-5454 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Follow up to ARROW-2667 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Comment Edited] (ARROW-5454) [C++] Implement Take on ChunkedArray for DataFrame use
[ https://issues.apache.org/jira/browse/ARROW-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912675#comment-16912675 ] Artem KOZHEVNIKOV edited comment on ARROW-5454 at 8/22/19 6:42 AM: --- if it were in pure python, we could do something like below (relying on `pa.array.take`) {code:python} import numpy as np import pyarrow as pa from pandas.core.sorting import get_group_index_sorter def take_on_chunked_array(charr, indices): indices = np.array(indices, dtype=np.int) if indices.max() > len(charr): raise IndexError() indices[indices < 0] += len(charr) if indices.min() < 0: raise IndexError() lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64) cum_lengths = lengths.cumsum() bins = np.searchsorted(cum_lengths, indices, side="right") limits_idx = np.concatenate([[0], np.bincount(bins).cumsum()]) sort_idx = get_group_index_sorter(bins, len(cum_lengths)) del bins indices = indices[sort_idx] sort_idx = np.argsort(sort_idx, kind="merge") # inverse sort indices cum_lengths -= lengths res_array = pa.concat_arrays([charr.chunks[i].take(pa.array(indices[limits_idx[i]:limits_idx[i + 1]] - cum_length)) for i, cum_length in enumerate(cum_lengths)]) return res_array.take(pa.array(sort_idx)) charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), pa.array([5, 6, 7, 8])]) take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy() pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy() {code} Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method and concat_arrays (or we want to avoid an extra copy) ? was (Author: artemk): if it were in pure python, we could do something like below (relying on `pa.array.take`) {code:python} import numpy as np import pyarrow as pa from pandas.core.sorting import get_group_index_sorter def take_on_chunked_array(charr, indices): indices = np.array(indices, dtype=np.int) if indices.max() > len(charr): raise IndexError() indices[indices < 0] += len(charr) if indices.min() < 0: raise IndexError() lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64) cum_lengths = lengths.cumsum() bins = np.searchsorted(cum_lengths, indices, side="right") limits_idx = np.concatenate([[0], np.bincount(bins).cumsum()]) sort_idx = get_group_index_sorter(bins, len(cum_lengths)) del bins indices = indices[sort_idx] sort_idx = np.argsort(sort_idx, kind="merge") # inverse sort indices cum_lengths -= lengths res_array = pa.concat_arrays([charr.chunks[i].take(pa.array(indices[limits_idx[i]:limits_idx[i + 1]] - cum_length)) for i, cum_length in enumerate(cum_lengths)]) return res_array.take(pa.array(sort_idx)) charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), pa.array([5, 6, 7, 8])]) take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy() pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy() {code} Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method (or we want to avoid an extra copy) ? > [C++] Implement Take on ChunkedArray for DataFrame use > -- > > Key: ARROW-5454 > URL: https://issues.apache.org/jira/browse/ARROW-5454 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Follow up to ARROW-2667 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Comment Edited] (ARROW-5454) [C++] Implement Take on ChunkedArray for DataFrame use
[ https://issues.apache.org/jira/browse/ARROW-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912675#comment-16912675 ] Artem KOZHEVNIKOV edited comment on ARROW-5454 at 8/22/19 6:34 AM: --- if it were in pure python, we could do something like below (relying on `pa.array.take`) {code:python} import numpy as np import pyarrow as pa from pandas.core.sorting import get_group_index_sorter def take_on_chunked_array(charr, indices): indices = np.array(indices, dtype=np.int) if indices.max() > len(charr): raise IndexError() indices[indices < 0] += len(charr) if indices.min() < 0: raise IndexError() lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64) cum_lengths = lengths.cumsum() bins = np.searchsorted(cum_lengths, indices, side="right") limits_idx = np.concatenate([[0], np.bincount(bins).cumsum()]) sort_idx = get_group_index_sorter(bins, len(cum_lengths)) del bins indices = indices[sort_idx] sort_idx = np.argsort(sort_idx, kind="merge") # inverse sort indices cum_lengths -= lengths res_array = pa.concat_arrays([charr.chunks[i].take(pa.array(indices[limits_idx[i]:limits_idx[i + 1]] - cum_length)) for i, cum_length in enumerate(cum_lengths)]) return res_array.take(pa.array(sort_idx)) charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), pa.array([5, 6, 7, 8])]) take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy() pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy() {code} Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method (or we want to avoid an extra copy) ? was (Author: artemk): if it were in pure python, we could do something like below (relying on `pa.array.take`) {code:python} import numpy as np import pyarrow as pa def take_on_chunked_array(charr, indices): indices = np.asarray(indices, dtype=np.int) if indices.max() > len(charr): raise IndexError() indices[indices < 0] += len(charr) if indices.min() < 0: raise IndexError() lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64) cum_lengths = lengths.cumsum() sort_idx = np.argsort(indices) indices = indices[sort_idx] sort_idx = np.argsort(sort_idx) # inverse sort indices # btw, we could check if indices are already sorted to avoid an extra copy in this case limit_idx = [(0, 0, 0)] for i, cum_length in enumerate(cum_lengths): limit_idx.append((i, limit_idx[-1][-1], np.searchsorted(indices, cum_length))) limit_idx = limit_idx[1:] cum_lengths -= lengths res_array = pa.concat_arrays([charr.chunks[i].take(pa.array(indices[j_start:j_end] - cum_lengths[i])) for i, j_start, j_end in limit_idx if j_start < j_end]) return res_array.take(pa.array(sort_idx)) charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), pa.array([5, 6, 7, 8])]) take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy() pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy() {code} Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method (or we want to avoid an extra copy) ? We certainly can avoid global indices sorting as well. > [C++] Implement Take on ChunkedArray for DataFrame use > -- > > Key: ARROW-5454 > URL: https://issues.apache.org/jira/browse/ARROW-5454 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Follow up to ARROW-2667 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Comment Edited] (ARROW-5454) [C++] Implement Take on ChunkedArray for DataFrame use
[ https://issues.apache.org/jira/browse/ARROW-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912675#comment-16912675 ] Artem KOZHEVNIKOV edited comment on ARROW-5454 at 8/21/19 9:03 PM: --- if it were in pure python, we could do something like below (relying on `pa.array.take`) {code:python} import numpy as np import pyarrow as pa def take_on_chunked_array(charr, indices): indices = np.asarray(indices, dtype=np.int) if indices.max() > len(charr): raise IndexError() indices[indices < 0] += len(charr) if indices.min() < 0: raise IndexError() lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64) cum_lengths = lengths.cumsum() sort_idx = np.argsort(indices) indices = indices[sort_idx] sort_idx = np.argsort(sort_idx) # inverse sort indices # btw, we could check if indices are already sorted to avoid an extra copy in this case limit_idx = [(0, 0, 0)] for i, cum_length in enumerate(cum_lengths): limit_idx.append((i, limit_idx[-1][-1], np.searchsorted(indices, cum_length))) limit_idx = limit_idx[1:] cum_lengths -= lengths res_array = pa.concat_arrays([charr.chunks[i].take(pa.array(indices[j_start:j_end] - cum_lengths[i])) for i, j_start, j_end in limit_idx if j_start < j_end]) return res_array.take(pa.array(sort_idx)) charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), pa.array([5, 6, 7, 8])]) take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy() pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy() {code} Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method (or we want to avoid an extra copy) ? We certainly can avoid global indices sorting as well. was (Author: artemk): if it were in pure python, we could do something like below (relying on `pa.array.take`) {code:python} import numpy as np import pyarrow as pa def take_on_chunked_array(charr, indices): indices = np.asarray(indices, dtype=np.int) if indices.max() > len(charr): raise IndexError() indices[indices < 0] += len(charr) if indices.min() < 0: raise IndexError() lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64) cum_lengths = lengths.cumsum() sort_idx = np.argsort(indices) indices = indices[sort_idx] sort_idx = np.argsort(sort_idx) # inverse sort indices # btw, we could check if indices are already sorted to avoid an extra copy in this case limit_idx = [(0, 0, 0)] for i, cum_length in enumerate(cum_lengths): limit_idx.append((i, limit_idx[-1][-1], np.searchsorted(indices, cum_length))) limit_idx = limit_idx[1:] cum_lengths -= lengths res_array = pa.concat_arrays([charr.chunks[i].take(pa.array(indices[j_start:j_end] - cum_lengths[i])) for i, j_start, j_end in limit_idx if j_start < j_end]) return res_array.take(pa.array(sort_idx)) charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), pa.array([5, 6, 7, 8])]) take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy() pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy() {code} Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method (or we want to avoid an extra copy) ? > [C++] Implement Take on ChunkedArray for DataFrame use > -- > > Key: ARROW-5454 > URL: https://issues.apache.org/jira/browse/ARROW-5454 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Follow up to ARROW-2667 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5454) [C++] Implement Take on ChunkedArray for DataFrame use
[ https://issues.apache.org/jira/browse/ARROW-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912675#comment-16912675 ] Artem KOZHEVNIKOV commented on ARROW-5454: -- if it were in pure python, we could do something like below (relying on `pa.array.take`) {code:python} import numpy as np import pyarrow as pa def take_on_chunked_array(charr, indices): indices = np.asarray(indices, dtype=np.int) if indices.max() > len(charr): raise IndexError() indices[indices < 0] += len(charr) if indices.min() < 0: raise IndexError() lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64) cum_lengths = lengths.cumsum() sort_idx = np.argsort(indices) indices = indices[sort_idx] sort_idx = np.argsort(sort_idx) # inverse sort indices # btw, we could check if indices are already sorted to avoid an extra copy in this case limit_idx = [(0, 0, 0)] for i, cum_length in enumerate(cum_lengths): limit_idx.append((i, limit_idx[-1][-1], np.searchsorted(indices, cum_length))) limit_idx = limit_idx[1:] cum_lengths -= lengths res_array = pa.concat_arrays([charr.chunks[i].take(pa.array(indices[j_start:j_end] - cum_lengths[i])) for i, j_start, j_end in limit_idx if j_start < j_end]) return res_array.take(pa.array(sort_idx)) charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), pa.array([5, 6, 7, 8])]) take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy() pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy() {code} Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method (or we want to avoid an extra copy) ? > [C++] Implement Take on ChunkedArray for DataFrame use > -- > > Key: ARROW-5454 > URL: https://issues.apache.org/jira/browse/ARROW-5454 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Follow up to ARROW-2667 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5713) [Python] fancy indexing on pa.array
[ https://issues.apache.org/jira/browse/ARROW-5713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909283#comment-16909283 ] Artem KOZHEVNIKOV commented on ARROW-5713: -- it looks like `Array.take` function is already available in version 0.14.2 ! `Table.take` would be nice as well. > [Python] fancy indexing on pa.array > --- > > Key: ARROW-5713 > URL: https://issues.apache.org/jira/browse/ARROW-5713 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Artem KOZHEVNIKOV >Priority: Major > > In numpy one can do : > {code:java} > In [2]: import numpy as np > > > In [3]: a = np.array(['a', 'bb', 'ccc', ''], dtype="O") > > > In [4]: indices = np.array([0, -1, 2, 2, 0, 3]) > > > In [5]: a[indices] > > > Out[5]: array(['a', '', 'ccc', 'ccc', 'a', ''], dtype=object) > {code} > It would be nice to have a similar feature in pyarrow. > Currently, pa.arrow __getitem__ supports only a slice or a single element as > an argument. > Of course, using that we've some workarounds, like below > {code:java} > In [6]: import pyarrow as pa > > > In [7]: a = pa.array(['a', 'bb', 'ccc', '']) > > > In [8]: pa.array(a.to_pandas()[indices]) # if len(indices) is high > > > Out[8]: > > [ > "a", > "", > "ccc", > "ccc", > "a", > "" > ] > In [9]: pa.array([a[i].as_py() for i in indices]) # if len(indices) is low > > Out[9]: > > [ > "a", > "", > "ccc", > "ccc", > "a", > "" > ] > {code} > both are not memory efficient. > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5713) fancy indexing on pa.array
Artem KOZHEVNIKOV created ARROW-5713: Summary: fancy indexing on pa.array Key: ARROW-5713 URL: https://issues.apache.org/jira/browse/ARROW-5713 Project: Apache Arrow Issue Type: New Feature Components: C++, Python Reporter: Artem KOZHEVNIKOV In numpy one can do : {code:java} In [2]: import numpy as np In [3]: a = np.array(['a', 'bb', 'ccc', ''], dtype="O") In [4]: indices = np.array([0, -1, 2, 2, 0, 3]) In [5]: a[indices] Out[5]: array(['a', '', 'ccc', 'ccc', 'a', ''], dtype=object) {code} It would be nice to have a similar feature in pyarrow. Currently, pa.arrow __getitem__ supports only a slice or a single element as an argument. Of course, using that we've some workarounds, like below {code:java} In [6]: import pyarrow as pa In [7]: a = pa.array(['a', 'bb', 'ccc', '']) In [8]: pa.array(a.to_pandas()[indices]) # if len(indices) is high Out[8]: [ "a", "", "ccc", "ccc", "a", "" ] In [9]: pa.array([a[i].as_py() for i in indices]) # if len(indices) is low Out[9]: [ "a", "", "ccc", "ccc", "a", "" ] {code} both are not memory efficient. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5208) [Python] Inconsistent resulting type during casting in pa.array() when mask is present
[ https://issues.apache.org/jira/browse/ARROW-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871655#comment-16871655 ] Artem KOZHEVNIKOV commented on ARROW-5208: -- yes, I'm aware about cython limitations and modern c++ features (looking sometimes very pythonic :). So if I get it right, in "arrow" every non-trivial computational part is c++ based and cython is used only to wrap c++ api. Do you consider in some future to switch to automatic bindings generation in arrow (like in pytorch with pybind11) and get rid completely of cython (current c++ modules look still far from being auto-generated) ? > [Python] Inconsistent resulting type during casting in pa.array() when mask > is present > -- > > Key: ARROW-5208 > URL: https://issues.apache.org/jira/browse/ARROW-5208 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 >Reporter: Artem KOZHEVNIKOV >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 20m > Remaining Estimate: 0h > > I would expect Int64Array type in all cases below : > {code:java} > >>> pa.array([4, None, 4, None], mask=np.array([False, True, False, True])) > >>> > >>> > [4, null, 4, null ] > >>> pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True])) > >>> > >>> > [4, null, 4, null ] > >>> pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True])) > >>> > >>> [ 4, null, > >>> 4, null ]{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5208) [Python] Inconsistent resulting type during casting in pa.array() when mask is present
[ https://issues.apache.org/jira/browse/ARROW-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870879#comment-16870879 ] Artem KOZHEVNIKOV commented on ARROW-5208: -- Just curiosity question : what was the reasons to have `python_to_arrow.cc` module in pure c++ and not in cython let say ? (To be honest, I did not feel myself confortable enough to contribute in c++...) Passing mask argument into `InferArrowType` will also solve the case `_is_array_like(obj) is True` (or in this case the inference is based on pandas ) ? > [Python] Inconsistent resulting type during casting in pa.array() when mask > is present > -- > > Key: ARROW-5208 > URL: https://issues.apache.org/jira/browse/ARROW-5208 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 >Reporter: Artem KOZHEVNIKOV >Assignee: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > I would expect Int64Array type in all cases below : > {code:java} > >>> pa.array([4, None, 4, None], mask=np.array([False, True, False, True])) > >>> > >>> > [4, null, 4, null ] > >>> pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True])) > >>> > >>> > [4, null, 4, null ] > >>> pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True])) > >>> > >>> [ 4, null, > >>> 4, null ]{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5208) [Python] Inconsistent resulting type during casting in pa.array() when mask is present
[ https://issues.apache.org/jira/browse/ARROW-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826896#comment-16826896 ] Artem KOZHEVNIKOV commented on ARROW-5208: -- yes, absolutely, it would be nice to get involved! Any doc that can be useful to start with ? CI best practices ? > [Python] Inconsistent resulting type during casting in pa.array() when mask > is present > -- > > Key: ARROW-5208 > URL: https://issues.apache.org/jira/browse/ARROW-5208 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 >Reporter: Artem KOZHEVNIKOV >Priority: Major > Fix For: 0.14.0 > > > I would expect Int64Array type in all cases below : > {code:java} > >>> pa.array([4, None, 4, None], mask=np.array([False, True, False, True])) > >>> > >>> > [4, null, 4, null ] > >>> pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True])) > >>> > >>> > [4, null, 4, null ] > >>> pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True])) > >>> > >>> [ 4, null, > >>> 4, null ]{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5208) Inconsistent resulting type during casting in pa.array() when mask is present
[ https://issues.apache.org/jira/browse/ARROW-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem KOZHEVNIKOV updated ARROW-5208: - Description: I would expect Int64Array type in all cases below : {code:java} >>> pa.array([4, None, 4, None], mask=np.array([False, True, False, True])) >>> >>> [4, null, 4, null ] >>> pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True])) >>> >>> [4, null, 4, null ] >>> pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True])) >>> >>> [ 4, null, 4, >>> null ]{code} was: I would expect Int64Array type in all cases below : {code:java} pa.array([4, None, 4, None], mask=np.array([False, True, False, True])) [ 4, null, 4, null ] pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True])) [ 4, null, 4, null ] pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True])) [ 4, null, 4, null ]{code} > Inconsistent resulting type during casting in pa.array() when mask is present > - > > Key: ARROW-5208 > URL: https://issues.apache.org/jira/browse/ARROW-5208 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 >Reporter: Artem KOZHEVNIKOV >Priority: Major > > I would expect Int64Array type in all cases below : > {code:java} > >>> pa.array([4, None, 4, None], mask=np.array([False, True, False, True])) > >>> > >>> > [4, null, 4, null ] > >>> pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True])) > >>> > >>> > [4, null, 4, null ] > >>> pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True])) > >>> > >>> [ 4, null, > >>> 4, null ]{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5208) Inconsistent resulting type during casting in pa.array() when mask is present
Artem KOZHEVNIKOV created ARROW-5208: Summary: Inconsistent resulting type during casting in pa.array() when mask is present Key: ARROW-5208 URL: https://issues.apache.org/jira/browse/ARROW-5208 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.13.0 Reporter: Artem KOZHEVNIKOV I would expect Int64Array type in all cases below : {code:java} pa.array([4, None, 4, None], mask=np.array([False, True, False, True])) [ 4, null, 4, null ] pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True])) [ 4, null, 4, null ] pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True])) [ 4, null, 4, null ]{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5103) Segfault when using chunked_array.to_pandas on array different types (edge case)
[ https://issues.apache.org/jira/browse/ARROW-5103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem KOZHEVNIKOV updated ARROW-5103: - Priority: Minor (was: Major) > Segfault when using chunked_array.to_pandas on array different types (edge > case) > -- > > Key: ARROW-5103 > URL: https://issues.apache.org/jira/browse/ARROW-5103 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.12.1 > Environment: pyarrow 0.12.1 py37hf9e6f3b_0 conda-forge > numpy 1.15.4 py37hacdab7b_0 > MacOs | gcc7 | what else ? >Reporter: Artem KOZHEVNIKOV >Priority: Minor > > {code:java} > import numpy as np > import pyarrow as pa > ca = pa.chunked_array([pa.array(['rr'] * 10), pa.array(np.arange(10))]) > ca.type > ca.to_pandas() > libc++abi.dylib: terminating with uncaught exception of type > std::length_error: basic_string > Abort trap: 6 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5103) Segfault when using chunked_array.to_pandas on array different types (edge case)
Artem KOZHEVNIKOV created ARROW-5103: Summary: Segfault when using chunked_array.to_pandas on array different types (edge case) Key: ARROW-5103 URL: https://issues.apache.org/jira/browse/ARROW-5103 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 0.12.1 Environment: pyarrow 0.12.1 py37hf9e6f3b_0 conda-forge numpy 1.15.4 py37hacdab7b_0 MacOs | gcc7 | what else ? Reporter: Artem KOZHEVNIKOV {code:java} import numpy as np import pyarrow as pa ca = pa.chunked_array([pa.array(['rr'] * 10), pa.array(np.arange(10))]) ca.type ca.to_pandas() libc++abi.dylib: terminating with uncaught exception of type std::length_error: basic_string Abort trap: 6 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)