[jira] [Created] (ARROW-10494) take silently overflow on list array (when casting to large_list is needed)

2020-11-04 Thread Artem KOZHEVNIKOV (Jira)
Artem KOZHEVNIKOV created ARROW-10494:
-

 Summary: take silently overflow on list array (when casting to 
large_list is needed)
 Key: ARROW-10494
 URL: https://issues.apache.org/jira/browse/ARROW-10494
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 2.0.0
Reporter: Artem KOZHEVNIKOV


reproducer below
{code:python}
import numpy as np
import pyarrow as pa
arr = pa.array([np.arange(x).astype(np.int8) for x in range(6)])
nb_repeat = 2**32 // arr.offsets.to_numpy()[-1]
indices = pa.array(np.repeat(np.arange(len(arr)), nb_repeat))
big_arr = arr.take(indices)
print(big_arr.offsets[-5:])
big_arr.validate() # hopefully this can catch it 

[
  -21,
  -16,
  -11,
  -6,
  -1
]
---
ArrowInvalid  Traceback (most recent call last)
 in 
  6 big_arr = arr.take(indices)
  7 print(big_arr.offsets[-5:])
> 8 big_arr.validate()

/opt/conda/envs/model/lib/python3.7/site-packages/pyarrow/array.pxi in 
pyarrow.lib.Array.validate()

/opt/conda/envs/model/lib/python3.7/site-packages/pyarrow/error.pxi in 
pyarrow.lib.check_status()

ArrowInvalid: Negative offsets in list array
{code}

and it works fine with large_array (as expected) :

{code:python}

import numpy as np
import pyarrow as pa
arr = pa.array([np.arange(x).astype(np.int8) for x in range(6)], 
type=pa.large_list(pa.int8()))
nb_repeat = 2**32 // arr.offsets.to_numpy()[-1]
indices = pa.array(np.repeat(np.arange(len(arr)), nb_repeat))
big_arr = arr.take(indices)
print(big_arr.offsets[-5:])
big_arr.validate()
[
  4294967275,
  4294967280,
  4294967285,
  4294967290,
  4294967295
]
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10172) cancat_arrays requires upcast for large array

2020-10-05 Thread Artem KOZHEVNIKOV (Jira)
Artem KOZHEVNIKOV created ARROW-10172:
-

 Summary: cancat_arrays requires upcast for large array
 Key: ARROW-10172
 URL: https://issues.apache.org/jira/browse/ARROW-10172
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 1.0.1
Reporter: Artem KOZHEVNIKOV


I'm sorry if this was already reported, but there's an overflow issue in 
concatenation of large arrays

{code:python}
In [1]: import pyarrow as pa

In [2]: str_array = pa.array(['a' * 128] * 10**8)

In [3]: large_array = pa.concat_arrays([str_array] * 50)
Segmentation fault (core dumped)
{code}

I suppose that  this should be handled by upcast to large_string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-7731) [C++][Parquet] Support LargeListArray

2020-02-04 Thread Artem KOZHEVNIKOV (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027457#comment-17027457
 ] 

Artem KOZHEVNIKOV edited comment on ARROW-7731 at 2/4/20 9:11 AM:
--

I found another edge case that's maybe linked to this (pyarrow=0.15.1)
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
l1 = pa.array([list(range(100))] * 10**7, type=pa.list_(pa.int16()))
tt = pa.Table.from_pydict(\{'big': pa.chunked_array([l1]*10)})  # if concat, 
offset will overflow int32
pq.write_table(tt, '/tmp/test.parquet') # that took a while but worked
tt_reload = pq.read_table('/tmp/test.parquet')   # it consumes a huge amount of 
memory before failing

ArrowInvalid Traceback (most recent call last)  
in  > 1 tt_reload = pq.read_table('/tmp/test.parquet') 
/opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in 
read_table(source, columns, use_threads, metadata, use_pandas_metadata, 
memory_map, read_dictionary, filesystem, filters, buffer_size)  1279 
buffer_size=buffer_size)  1280 return pf.read(columns=columns, 
use_threads=use_threads, -> 1281 use_pandas_metadata=use_pandas_metadata)  1282 
 1283 /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in 
read(self, columns, use_threads, use_pandas_metadata)  1135 table = 
piece.read(columns=columns, use_threads=use_threads,  1136 
partitions=self.partitions, -> 1137 use_pandas_metadata=use_pandas_metadata)  
1138 tables.append(table)  1139 
/opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in 
read(self, columns, use_threads, partitions, file, use_pandas_metadata)  603 
table = reader.read_row_group(self.row_group, **options)  604 else: --> 605 
table = reader.read(**options)  606  607 if len(self.partition_keys) > 0: 
/opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in 
read(self, columns, use_threads, use_pandas_metadata)  251 columns, 
use_pandas_metadata=use_pandas_metadata)  252 return 
self.reader.read_all(column_indices=column_indices, --> 253 
use_threads=use_threads)  254  255 def scan_contents(self, columns=None, 
batch_size=65536): 
/opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/_parquet.pyx in 
pyarrow._parquet.ParquetReader.read_all() 
/opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/error.pxi in 
pyarrow.lib.check_status() ArrowInvalid: Column 0: Offset invariant failure: 
21474837 inconsistent offset for non-null slot: -2147483596<2147483600

{code}
Thrown Error is not explicit. I wonder if created parquet file is correct (I've 
not tried yet to reload it with spark) or it's just a reading by pyarrow that 
does not support it.


was (Author: artemk):
I found another edge case that maybe link to this (pyarrow=0.15.1)
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
l1 = pa.array([list(range(100))] * 10**7, type=pa.list_(pa.int16()))
tt = pa.Table.from_pydict(\{'big': pa.chunked_array([l1]*10)})  # if concat, 
offset will overflow int32
pq.write_table(tt, '/tmp/test.parquet') # that took a while but worked
tt_reload = pq.read_table('/tmp/test.parquet')   # it consumes a huge amount of 
memory before failing

ArrowInvalid Traceback (most recent call last)  
in  > 1 tt_reload = pq.read_table('/tmp/test.parquet') 
/opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in 
read_table(source, columns, use_threads, metadata, use_pandas_metadata, 
memory_map, read_dictionary, filesystem, filters, buffer_size)  1279 
buffer_size=buffer_size)  1280 return pf.read(columns=columns, 
use_threads=use_threads, -> 1281 use_pandas_metadata=use_pandas_metadata)  1282 
 1283 /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in 
read(self, columns, use_threads, use_pandas_metadata)  1135 table = 
piece.read(columns=columns, use_threads=use_threads,  1136 
partitions=self.partitions, -> 1137 use_pandas_metadata=use_pandas_metadata)  
1138 tables.append(table)  1139 
/opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in 
read(self, columns, use_threads, partitions, file, use_pandas_metadata)  603 
table = reader.read_row_group(self.row_group, **options)  604 else: --> 605 
table = reader.read(**options)  606  607 if len(self.partition_keys) > 0: 
/opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in 
read(self, columns, use_threads, use_pandas_metadata)  251 columns, 
use_pandas_metadata=use_pandas_metadata)  252 return 
self.reader.read_all(column_indices=column_indices, --> 253 
use_threads=use_threads)  254  255 def scan_contents(self, columns=None, 
batch_size=65536): 
/opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/_parquet.pyx in 
pyarrow._parquet.ParquetReader.read_all() 
/opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/error.pxi in 
pyarrow.lib.check_status() ArrowInvalid: Column 0: Offset invariant failure: 
21474837 inconsistent offset for 

[jira] [Commented] (ARROW-7731) [Parquet] Support LargeListArray

2020-01-31 Thread Artem KOZHEVNIKOV (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027457#comment-17027457
 ] 

Artem KOZHEVNIKOV commented on ARROW-7731:
--

I found another edge case that maybe link to this (pyarrow=0.15.1)
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
l1 = pa.array([list(range(100))] * 10**7, type=pa.list_(pa.int16()))
tt = pa.Table.from_pydict(\{'big': pa.chunked_array([l1]*10)})  # if concat, 
offset will overflow int32
pq.write_table(tt, '/tmp/test.parquet') # that took a while but worked
tt_reload = pq.read_table('/tmp/test.parquet')   # it consumes a huge amount of 
memory before failing

ArrowInvalid Traceback (most recent call last)  
in  > 1 tt_reload = pq.read_table('/tmp/test.parquet') 
/opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in 
read_table(source, columns, use_threads, metadata, use_pandas_metadata, 
memory_map, read_dictionary, filesystem, filters, buffer_size)  1279 
buffer_size=buffer_size)  1280 return pf.read(columns=columns, 
use_threads=use_threads, -> 1281 use_pandas_metadata=use_pandas_metadata)  1282 
 1283 /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in 
read(self, columns, use_threads, use_pandas_metadata)  1135 table = 
piece.read(columns=columns, use_threads=use_threads,  1136 
partitions=self.partitions, -> 1137 use_pandas_metadata=use_pandas_metadata)  
1138 tables.append(table)  1139 
/opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in 
read(self, columns, use_threads, partitions, file, use_pandas_metadata)  603 
table = reader.read_row_group(self.row_group, **options)  604 else: --> 605 
table = reader.read(**options)  606  607 if len(self.partition_keys) > 0: 
/opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in 
read(self, columns, use_threads, use_pandas_metadata)  251 columns, 
use_pandas_metadata=use_pandas_metadata)  252 return 
self.reader.read_all(column_indices=column_indices, --> 253 
use_threads=use_threads)  254  255 def scan_contents(self, columns=None, 
batch_size=65536): 
/opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/_parquet.pyx in 
pyarrow._parquet.ParquetReader.read_all() 
/opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/error.pxi in 
pyarrow.lib.check_status() ArrowInvalid: Column 0: Offset invariant failure: 
21474837 inconsistent offset for non-null slot: -2147483596<2147483600

{code}

> [Parquet] Support LargeListArray
> 
>
> Key: ARROW-7731
> URL: https://issues.apache.org/jira/browse/ARROW-7731
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: marc abboud
>Priority: Major
>
> For now it's not possible to write a pyarrow.Table containing a 
> LargeListArray in parquet. The lines
> {code:java}
> from pyarrow import parquet
> import pyarrow as pa
> indices = [1, 2, 3]
> indptr = [0, 1, 2, 3]
> q = pa.lib.LargeListArray.from_arrays(indptr, indices) 
> table = pa.Table.from_arrays([q], names=['no']) 
> parquet.write_table(table, '/test'){code}
> yields the error 
> {code:java}
> ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema 
> conversion: large_list
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7008) [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers

2019-10-28 Thread Artem KOZHEVNIKOV (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961079#comment-16961079
 ] 

Artem KOZHEVNIKOV commented on ARROW-7008:
--

is 

> [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers
> 
>
> Key: ARROW-7008
> URL: https://issues.apache.org/jira/browse/ARROW-7008
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Uwe Korn
>Priority: Major
>
> Minimal reproducer:
> {code}
> import pyarrow as pa
> pa.chunked_array([pa.array([], 
> type=pa.string()).dictionary_encode().dictionary])
> {code}
> Traceback
> {code}
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x20)
>   * frame #0: 0x000112cd5d0e libarrow.15.dylib`arrow::Status 
> arrow::internal::ValidateVisitor::ValidateOffsets const>(arrow::BinaryArray const&) + 94
> frame #1: 0x000112cc79a3 libarrow.15.dylib`arrow::Status 
> arrow::VisitArrayInline(arrow::Array 
> const&, arrow::internal::ValidateVisitor*) + 915
> frame #2: 0x000112cc747d libarrow.15.dylib`arrow::Array::Validate() 
> const + 829
> frame #3: 0x000112e3ea19 
> libarrow.15.dylib`arrow::ChunkedArray::Validate() const + 89
> frame #4: 0x000112b8eb7d 
> lib.cpython-37m-darwin.so`__pyx_pw_7pyarrow_3lib_135chunked_array(_object*, 
> _object*, _object*) + 3661
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-7008) [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers

2019-10-28 Thread Artem KOZHEVNIKOV (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961079#comment-16961079
 ] 

Artem KOZHEVNIKOV edited comment on ARROW-7008 at 10/28/19 2:10 PM:


is it the same issue as https://issues.apache.org/jira/browse/ARROW-6857 ?


was (Author: artemk):
is 

> [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers
> 
>
> Key: ARROW-7008
> URL: https://issues.apache.org/jira/browse/ARROW-7008
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Uwe Korn
>Priority: Major
>
> Minimal reproducer:
> {code}
> import pyarrow as pa
> pa.chunked_array([pa.array([], 
> type=pa.string()).dictionary_encode().dictionary])
> {code}
> Traceback
> {code}
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x20)
>   * frame #0: 0x000112cd5d0e libarrow.15.dylib`arrow::Status 
> arrow::internal::ValidateVisitor::ValidateOffsets const>(arrow::BinaryArray const&) + 94
> frame #1: 0x000112cc79a3 libarrow.15.dylib`arrow::Status 
> arrow::VisitArrayInline(arrow::Array 
> const&, arrow::internal::ValidateVisitor*) + 915
> frame #2: 0x000112cc747d libarrow.15.dylib`arrow::Array::Validate() 
> const + 829
> frame #3: 0x000112e3ea19 
> libarrow.15.dylib`arrow::ChunkedArray::Validate() const + 89
> frame #4: 0x000112b8eb7d 
> lib.cpython-37m-darwin.so`__pyx_pw_7pyarrow_3lib_135chunked_array(_object*, 
> _object*, _object*) + 3661
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5454) [C++] Implement Take on ChunkedArray for DataFrame use

2019-10-15 Thread Artem KOZHEVNIKOV (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951704#comment-16951704
 ] 

Artem KOZHEVNIKOV commented on ARROW-5454:
--

[~wesm], could you have a look on this ? a part a arrow:DataFrame project, in 
meanwhile this feature can be very useful to work with pyarrow data structures 
via pandas.ExtenensionArray.

> [C++] Implement Take on ChunkedArray for DataFrame use
> --
>
> Key: ARROW-5454
> URL: https://issues.apache.org/jira/browse/ARROW-5454
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Follow up to ARROW-2667



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-5454) [C++] Implement Take on ChunkedArray for DataFrame use

2019-10-15 Thread Artem KOZHEVNIKOV (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951704#comment-16951704
 ] 

Artem KOZHEVNIKOV edited comment on ARROW-5454 at 10/15/19 8:09 AM:


[~wesm], could you have a look on this ? apart a arrow:DataFrame project, in 
meanwhile this feature can be very useful to work with pyarrow data structures 
via pandas.ExtenensionArray.


was (Author: artemk):
[~wesm], could you have a look on this ? a part a arrow:DataFrame project, in 
meanwhile this feature can be very useful to work with pyarrow data structures 
via pandas.ExtenensionArray.

> [C++] Implement Take on ChunkedArray for DataFrame use
> --
>
> Key: ARROW-5454
> URL: https://issues.apache.org/jira/browse/ARROW-5454
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Follow up to ARROW-2667



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6882) cannot create a chunked_array from dictionary_encoding result

2019-10-14 Thread Artem KOZHEVNIKOV (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem KOZHEVNIKOV updated ARROW-6882:
-
Description: 
I've experienced a strange error raise when trying to apply `pa.chunked_array` 
directly on the indices of dictionary_encoding (code is below). Making a memory 
view solves the problem.
{code:python}
import pyarrow as pa
ca = pa.array(['a', 'a', 'b', 'b', 'c'])
   
fca = ca.dictionary_encode()
   
fca.indices 
   

[
  0,
  0,
  1,
  1,
  2
]

pa.chunked_array([fca.indices]) 
   
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 pa.chunked_array([fca.indices])

~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi
 in pyarrow.lib.chunked_array()

~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi
 in pyarrow.lib.check_status()

ArrowInvalid: Unexpected dictionary values in array of type int32

# with another memory view it's  OK
pa.chunked_array([fca.indices.view(fca.indices.type)]) 
Out[45]: 

[
  [
0,
0,
1,
1,
2
  ]
]
 {code}

  was:
I've experienced a strange error raise when trying to apply `pa.chunked_array` 
directly on the indices of dictionary_encoding (code is below). Making a memory 
view solves the problem.
{code:python}
import pyarrow as pa
ca = pa.array(['a', 'a', 'b', 'b', 'c'])
   
fca = ca.dictionary_encode()
   
fca.indices 
   

[
  0,
  0,
  1,
  1,
  2
]

pa.chunked_array([fca.indices]) 
   
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 pa.chunked_array([fca.indices])

~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi
 in pyarrow.lib.chunked_array()

~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi
 in pyarrow.lib.check_status()

ArrowInvalid: Unexpected dictionary values in array of type int32

# with another memory view it's  OK
pa.chunked_array([pa.Array.from_buffers(type=pa.int32(), 
length=len(fca.indices), buffers=fca.indices.buffers())]) 
Out[45]: 

[
  [
0,
0,
1,
1,
2
  ]
]
 {code}


> cannot create a chunked_array from dictionary_encoding result
> -
>
> Key: ARROW-6882
> URL: https://issues.apache.org/jira/browse/ARROW-6882
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
> Fix For: 0.15.1
>
>
> I've experienced a strange error raise when trying to apply 
> `pa.chunked_array` directly on the indices of dictionary_encoding (code is 
> below). Making a memory view solves the problem.
> {code:python}
> import pyarrow as pa
> ca = pa.array(['a', 'a', 'b', 'b', 'c'])  
>  
> fca = ca.dictionary_encode()  
>  
> fca.indices   
>  
> 
> [
>   0,
>   0,
>   1,
>   1,
>   2
> ]
> pa.chunked_array([fca.indices])   
>  
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pa.chunked_array([fca.indices])
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.chunked_array()
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowInvalid: Unexpected dictionary values in array of type int32
> # with 

[jira] [Created] (ARROW-6882) cannot create a chunked_array from dictionary_encoding result

2019-10-14 Thread Artem KOZHEVNIKOV (Jira)
Artem KOZHEVNIKOV created ARROW-6882:


 Summary: cannot create a chunked_array from dictionary_encoding 
result
 Key: ARROW-6882
 URL: https://issues.apache.org/jira/browse/ARROW-6882
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.0
Reporter: Artem KOZHEVNIKOV


I've experienced a strange error raise when trying to apply `pa.chunked_array` 
directly on the indices of dictionary_encoding (code is below). Making a memory 
view solves the problem.
{code:python}
import pyarrow as pa
ca = pa.array(['a', 'a', 'b', 'b', 'c'])
   
fca = ca.dictionary_encode()
   
fca.indices 
   

[
  0,
  0,
  1,
  1,
  2
]

pa.chunked_array([fca.indices]) 
   
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 pa.chunked_array([fca.indices])

~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi
 in pyarrow.lib.chunked_array()

~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi
 in pyarrow.lib.check_status()

ArrowInvalid: Unexpected dictionary values in array of type int32

# with another memory view it's  OK
pa.chunked_array([pa.Array.from_buffers(type=pa.int32(), 
length=len(fca.indices), buffers=fca.indices.buffers())]) 
Out[45]: 

[
  [
0,
0,
1,
1,
2
  ]
]
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6857) Segfault for dictionary_encode on empty chunked_array (edge case)

2019-10-11 Thread Artem KOZHEVNIKOV (Jira)
Artem KOZHEVNIKOV created ARROW-6857:


 Summary: Segfault for dictionary_encode on empty chunked_array 
(edge case)
 Key: ARROW-6857
 URL: https://issues.apache.org/jira/browse/ARROW-6857
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.0
Reporter: Artem KOZHEVNIKOV


a reproducer is here :
{code:python}
import pyarrow as pa
aa = pa.chunked_array([pa.array(['a', 'b', 'c'])])
aa[:0].dictionary_encode()  
# Segmentation fault: 11
{code}
For pyarrow=0.14, I could not reproduce. 
 I use a conda version : "pyarrow 0.15.0 py37hdca360a_0 conda-forge"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5103) [Python] Segfault when using chunked_array.to_pandas on array different types (edge case)

2019-08-22 Thread Artem KOZHEVNIKOV (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913746#comment-16913746
 ] 

Artem KOZHEVNIKOV commented on ARROW-5103:
--

it was fixed in 0.14, wasn't it?

> [Python] Segfault when using chunked_array.to_pandas on array different types 
> (edge case)  
> ---
>
> Key: ARROW-5103
> URL: https://issues.apache.org/jira/browse/ARROW-5103
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.12.1, 0.13.0
> Environment: pyarrow 0.12.1 py37hf9e6f3b_0 conda-forge
> numpy   1.15.4   py37hacdab7b_0  
> MacOs | gcc7 | what else ?
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
> Fix For: 0.15.0
>
>
> {code:java}
> import numpy as np
> import pyarrow as pa
> ca = pa.chunked_array([pa.array(['rr'] * 10), pa.array(np.arange(10))])
> ca.type
> ca.to_pandas()
> libc++abi.dylib: terminating with uncaught exception of type 
> std::length_error: basic_string
> Abort trap: 6
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (ARROW-5454) [C++] Implement Take on ChunkedArray for DataFrame use

2019-08-22 Thread Artem KOZHEVNIKOV (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912675#comment-16912675
 ] 

Artem KOZHEVNIKOV edited comment on ARROW-5454 at 8/22/19 8:48 AM:
---

if it were in pure python, we could do something like below (relying on 
`pa.array.take`)
{code:python}
import numpy as np
import pyarrow as pa
from pandas.core.sorting import get_group_index_sorter

def take_on_chunked_array(charr, indices):
indices = np.array(indices, dtype=np.int)
if indices.max() > len(charr):
raise IndexError()

indices[indices < 0] += len(charr)

if indices.min() < 0:
raise IndexError()

lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64)
cum_lengths = lengths.cumsum()

bins = np.searchsorted(cum_lengths, indices, side="right")
limits_idx = np.concatenate([[0], np.bincount(bins).cumsum()])

sort_idx = get_group_index_sorter(bins, len(cum_lengths))
del bins

indices = indices[sort_idx]
sort_idx = np.argsort(sort_idx, kind="merge")  # inverse sort indices

cum_lengths -= lengths
res_array = 
pa.concat_arrays([charr.chunks[i].take(pa.array(indices[limits_idx[i]:limits_idx[i
 + 1]] - cum_length))
  for i, cum_length in enumerate(cum_lengths)])
return res_array.take(pa.array(sort_idx))


charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), pa.array([5, 
6, 7, 8])])
take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy()
pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy()
{code}

Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method 
and concat_arrays (or we want to avoid an extra copy) ? if we don't reuse 
`array.take` we can of cause avoid sorting indices back and forth.



was (Author: artemk):
if it were in pure python, we could do something like below (relying on 
`pa.array.take`)
{code:python}
import numpy as np
import pyarrow as pa
from pandas.core.sorting import get_group_index_sorter

def take_on_chunked_array(charr, indices):
indices = np.array(indices, dtype=np.int)
if indices.max() > len(charr):
raise IndexError()

indices[indices < 0] += len(charr)

if indices.min() < 0:
raise IndexError()

lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64)
cum_lengths = lengths.cumsum()

bins = np.searchsorted(cum_lengths, indices, side="right")
limits_idx = np.concatenate([[0], np.bincount(bins).cumsum()])

sort_idx = get_group_index_sorter(bins, len(cum_lengths))
del bins

indices = indices[sort_idx]
sort_idx = np.argsort(sort_idx, kind="merge")  # inverse sort indices

cum_lengths -= lengths
res_array = 
pa.concat_arrays([charr.chunks[i].take(pa.array(indices[limits_idx[i]:limits_idx[i
 + 1]] - cum_length))
  for i, cum_length in enumerate(cum_lengths)])
return res_array.take(pa.array(sort_idx))


charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), pa.array([5, 
6, 7, 8])])
take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy()
pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy()

{code}
Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method 
and concat_arrays (or we want to avoid an extra copy) ?

> [C++] Implement Take on ChunkedArray for DataFrame use
> --
>
> Key: ARROW-5454
> URL: https://issues.apache.org/jira/browse/ARROW-5454
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Follow up to ARROW-2667



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (ARROW-5454) [C++] Implement Take on ChunkedArray for DataFrame use

2019-08-22 Thread Artem KOZHEVNIKOV (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912675#comment-16912675
 ] 

Artem KOZHEVNIKOV edited comment on ARROW-5454 at 8/22/19 6:42 AM:
---

if it were in pure python, we could do something like below (relying on 
`pa.array.take`)
{code:python}
import numpy as np
import pyarrow as pa
from pandas.core.sorting import get_group_index_sorter

def take_on_chunked_array(charr, indices):
indices = np.array(indices, dtype=np.int)
if indices.max() > len(charr):
raise IndexError()

indices[indices < 0] += len(charr)

if indices.min() < 0:
raise IndexError()

lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64)
cum_lengths = lengths.cumsum()

bins = np.searchsorted(cum_lengths, indices, side="right")
limits_idx = np.concatenate([[0], np.bincount(bins).cumsum()])

sort_idx = get_group_index_sorter(bins, len(cum_lengths))
del bins

indices = indices[sort_idx]
sort_idx = np.argsort(sort_idx, kind="merge")  # inverse sort indices

cum_lengths -= lengths
res_array = 
pa.concat_arrays([charr.chunks[i].take(pa.array(indices[limits_idx[i]:limits_idx[i
 + 1]] - cum_length))
  for i, cum_length in enumerate(cum_lengths)])
return res_array.take(pa.array(sort_idx))


charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), pa.array([5, 
6, 7, 8])])
take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy()
pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy()

{code}
Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method 
and concat_arrays (or we want to avoid an extra copy) ?


was (Author: artemk):
if it were in pure python, we could do something like below (relying on 
`pa.array.take`)
{code:python}
import numpy as np
import pyarrow as pa
from pandas.core.sorting import get_group_index_sorter

def take_on_chunked_array(charr, indices):
indices = np.array(indices, dtype=np.int)
if indices.max() > len(charr):
raise IndexError()

indices[indices < 0] += len(charr)

if indices.min() < 0:
raise IndexError()

lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64)
cum_lengths = lengths.cumsum()

bins = np.searchsorted(cum_lengths, indices, side="right")
limits_idx = np.concatenate([[0], np.bincount(bins).cumsum()])

sort_idx = get_group_index_sorter(bins, len(cum_lengths))
del bins

indices = indices[sort_idx]
sort_idx = np.argsort(sort_idx, kind="merge")  # inverse sort indices

cum_lengths -= lengths
res_array = 
pa.concat_arrays([charr.chunks[i].take(pa.array(indices[limits_idx[i]:limits_idx[i
 + 1]] - cum_length))
  for i, cum_length in enumerate(cum_lengths)])
return res_array.take(pa.array(sort_idx))


charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), pa.array([5, 
6, 7, 8])])
take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy()
pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy()

{code}
Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method 
(or we want to avoid an extra copy) ? 

> [C++] Implement Take on ChunkedArray for DataFrame use
> --
>
> Key: ARROW-5454
> URL: https://issues.apache.org/jira/browse/ARROW-5454
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Follow up to ARROW-2667



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (ARROW-5454) [C++] Implement Take on ChunkedArray for DataFrame use

2019-08-22 Thread Artem KOZHEVNIKOV (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912675#comment-16912675
 ] 

Artem KOZHEVNIKOV edited comment on ARROW-5454 at 8/22/19 6:34 AM:
---

if it were in pure python, we could do something like below (relying on 
`pa.array.take`)
{code:python}
import numpy as np
import pyarrow as pa
from pandas.core.sorting import get_group_index_sorter

def take_on_chunked_array(charr, indices):
indices = np.array(indices, dtype=np.int)
if indices.max() > len(charr):
raise IndexError()

indices[indices < 0] += len(charr)

if indices.min() < 0:
raise IndexError()

lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64)
cum_lengths = lengths.cumsum()

bins = np.searchsorted(cum_lengths, indices, side="right")
limits_idx = np.concatenate([[0], np.bincount(bins).cumsum()])

sort_idx = get_group_index_sorter(bins, len(cum_lengths))
del bins

indices = indices[sort_idx]
sort_idx = np.argsort(sort_idx, kind="merge")  # inverse sort indices

cum_lengths -= lengths
res_array = 
pa.concat_arrays([charr.chunks[i].take(pa.array(indices[limits_idx[i]:limits_idx[i
 + 1]] - cum_length))
  for i, cum_length in enumerate(cum_lengths)])
return res_array.take(pa.array(sort_idx))


charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), pa.array([5, 
6, 7, 8])])
take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy()
pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy()

{code}
Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method 
(or we want to avoid an extra copy) ? 


was (Author: artemk):
if it were in pure python, we could do something like below (relying on 
`pa.array.take`)
{code:python}
import numpy as np
import pyarrow as pa
def take_on_chunked_array(charr, indices):
indices = np.asarray(indices, dtype=np.int)
if indices.max() > len(charr):
raise IndexError()
indices[indices < 0] += len(charr)
if indices.min() < 0:
raise IndexError()
lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64)
cum_lengths = lengths.cumsum()
sort_idx = np.argsort(indices)  
indices = indices[sort_idx]
sort_idx = np.argsort(sort_idx)  # inverse sort indices
# btw, we could check if indices are already sorted to avoid an extra copy 
in this case

limit_idx = [(0, 0, 0)]
for i, cum_length in enumerate(cum_lengths):
limit_idx.append((i, limit_idx[-1][-1], np.searchsorted(indices, 
cum_length)))
limit_idx = limit_idx[1:]
cum_lengths -= lengths
res_array = 
pa.concat_arrays([charr.chunks[i].take(pa.array(indices[j_start:j_end] - 
cum_lengths[i])) for i, j_start, j_end in limit_idx if j_start < j_end])
return res_array.take(pa.array(sort_idx))


charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), 
  pa.array([5, 6, 7, 8])])  
take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy()
pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy() 
   {code}
Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method 
(or we want to avoid an extra copy) ? We certainly can avoid global indices 
sorting as well.

> [C++] Implement Take on ChunkedArray for DataFrame use
> --
>
> Key: ARROW-5454
> URL: https://issues.apache.org/jira/browse/ARROW-5454
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Follow up to ARROW-2667



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (ARROW-5454) [C++] Implement Take on ChunkedArray for DataFrame use

2019-08-21 Thread Artem KOZHEVNIKOV (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912675#comment-16912675
 ] 

Artem KOZHEVNIKOV edited comment on ARROW-5454 at 8/21/19 9:03 PM:
---

if it were in pure python, we could do something like below (relying on 
`pa.array.take`)
{code:python}
import numpy as np
import pyarrow as pa
def take_on_chunked_array(charr, indices):
indices = np.asarray(indices, dtype=np.int)
if indices.max() > len(charr):
raise IndexError()
indices[indices < 0] += len(charr)
if indices.min() < 0:
raise IndexError()
lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64)
cum_lengths = lengths.cumsum()
sort_idx = np.argsort(indices)  
indices = indices[sort_idx]
sort_idx = np.argsort(sort_idx)  # inverse sort indices
# btw, we could check if indices are already sorted to avoid an extra copy 
in this case

limit_idx = [(0, 0, 0)]
for i, cum_length in enumerate(cum_lengths):
limit_idx.append((i, limit_idx[-1][-1], np.searchsorted(indices, 
cum_length)))
limit_idx = limit_idx[1:]
cum_lengths -= lengths
res_array = 
pa.concat_arrays([charr.chunks[i].take(pa.array(indices[j_start:j_end] - 
cum_lengths[i])) for i, j_start, j_end in limit_idx if j_start < j_end])
return res_array.take(pa.array(sort_idx))


charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), 
  pa.array([5, 6, 7, 8])])  
take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy()
pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy() 
   {code}
Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method 
(or we want to avoid an extra copy) ? We certainly can avoid global indices 
sorting as well.


was (Author: artemk):
if it were in pure python, we could do something like below (relying on 
`pa.array.take`)
{code:python}
import numpy as np
import pyarrow as pa
def take_on_chunked_array(charr, indices):
indices = np.asarray(indices, dtype=np.int)
if indices.max() > len(charr):
raise IndexError()
indices[indices < 0] += len(charr)
if indices.min() < 0:
raise IndexError()
lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64)
cum_lengths = lengths.cumsum()
sort_idx = np.argsort(indices)  
indices = indices[sort_idx]
sort_idx = np.argsort(sort_idx)  # inverse sort indices
# btw, we could check if indices are already sorted to avoid an extra copy 
in this case

limit_idx = [(0, 0, 0)]
for i, cum_length in enumerate(cum_lengths):
limit_idx.append((i, limit_idx[-1][-1], np.searchsorted(indices, 
cum_length)))
limit_idx = limit_idx[1:]
cum_lengths -= lengths
res_array = 
pa.concat_arrays([charr.chunks[i].take(pa.array(indices[j_start:j_end] - 
cum_lengths[i])) for i, j_start, j_end in limit_idx if j_start < j_end])
return res_array.take(pa.array(sort_idx))


charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), 
  pa.array([5, 6, 7, 8])])  
take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy()
pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy() 
   {code}
Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method 
(or we want to avoid an extra copy) ?

> [C++] Implement Take on ChunkedArray for DataFrame use
> --
>
> Key: ARROW-5454
> URL: https://issues.apache.org/jira/browse/ARROW-5454
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Follow up to ARROW-2667



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-5454) [C++] Implement Take on ChunkedArray for DataFrame use

2019-08-21 Thread Artem KOZHEVNIKOV (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912675#comment-16912675
 ] 

Artem KOZHEVNIKOV commented on ARROW-5454:
--

if it were in pure python, we could do something like below (relying on 
`pa.array.take`)
{code:python}
import numpy as np
import pyarrow as pa
def take_on_chunked_array(charr, indices):
indices = np.asarray(indices, dtype=np.int)
if indices.max() > len(charr):
raise IndexError()
indices[indices < 0] += len(charr)
if indices.min() < 0:
raise IndexError()
lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64)
cum_lengths = lengths.cumsum()
sort_idx = np.argsort(indices)  
indices = indices[sort_idx]
sort_idx = np.argsort(sort_idx)  # inverse sort indices
# btw, we could check if indices are already sorted to avoid an extra copy 
in this case

limit_idx = [(0, 0, 0)]
for i, cum_length in enumerate(cum_lengths):
limit_idx.append((i, limit_idx[-1][-1], np.searchsorted(indices, 
cum_length)))
limit_idx = limit_idx[1:]
cum_lengths -= lengths
res_array = 
pa.concat_arrays([charr.chunks[i].take(pa.array(indices[j_start:j_end] - 
cum_lengths[i])) for i, j_start, j_end in limit_idx if j_start < j_end])
return res_array.take(pa.array(sort_idx))


charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), 
  pa.array([5, 6, 7, 8])])  
take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy()
pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy() 
   {code}
Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method 
(or we want to avoid an extra copy) ?

> [C++] Implement Take on ChunkedArray for DataFrame use
> --
>
> Key: ARROW-5454
> URL: https://issues.apache.org/jira/browse/ARROW-5454
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Follow up to ARROW-2667



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-5713) [Python] fancy indexing on pa.array

2019-08-16 Thread Artem KOZHEVNIKOV (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909283#comment-16909283
 ] 

Artem KOZHEVNIKOV commented on ARROW-5713:
--

it looks like `Array.take` function is already available in version 0.14.2 ! 
`Table.take` would be nice as well.

> [Python] fancy indexing on pa.array
> ---
>
> Key: ARROW-5713
> URL: https://issues.apache.org/jira/browse/ARROW-5713
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
>
> In numpy one can do :
> {code:java}
> In [2]: import numpy as np                                                    
>                                                                               
>     
> In [3]: a = np.array(['a', 'bb', 'ccc', ''], dtype="O")                   
>                                                                               
>     
> In [4]: indices = np.array([0, -1, 2, 2, 0, 3])                               
>                                                                               
>     
> In [5]: a[indices]                                                            
>                                                                               
>     
> Out[5]: array(['a', '', 'ccc', 'ccc', 'a', ''], dtype=object)
> {code}
> It would be nice to have a similar feature in pyarrow.
> Currently, pa.arrow __getitem__ supports only a slice or a single element as 
> an argument.
> Of course, using that we've some workarounds, like below
> {code:java}
> In [6]: import pyarrow as pa                                                  
>                                                                               
>     
> In [7]: a = pa.array(['a', 'bb', 'ccc', ''])                              
>                                                                               
>     
> In [8]: pa.array(a.to_pandas()[indices])  # if len(indices) is high           
>                                                                               
>                               
> Out[8]:
> 
> [
>   "a",
>   "",
>   "ccc",
>   "ccc",
>   "a",
>   ""
> ]
> In [9]: pa.array([a[i].as_py() for i in indices])  # if len(indices) is low   
>                                                                              
> Out[9]:
> 
> [
>   "a",
>   "",
>   "ccc",
>   "ccc",
>   "a",
>   ""
> ]
> {code}
> both are not memory efficient.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5713) fancy indexing on pa.array

2019-06-24 Thread Artem KOZHEVNIKOV (JIRA)
Artem KOZHEVNIKOV created ARROW-5713:


 Summary: fancy indexing on pa.array
 Key: ARROW-5713
 URL: https://issues.apache.org/jira/browse/ARROW-5713
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Python
Reporter: Artem KOZHEVNIKOV


In numpy one can do :
{code:java}
In [2]: import numpy as np                                                      
                                                                                
In [3]: a = np.array(['a', 'bb', 'ccc', ''], dtype="O")                     
                                                                                
In [4]: indices = np.array([0, -1, 2, 2, 0, 3])                                 
                                                                                
In [5]: a[indices]                                                              
                                                                                
Out[5]: array(['a', '', 'ccc', 'ccc', 'a', ''], dtype=object)
{code}
It would be nice to have a similar feature in pyarrow.

Currently, pa.arrow __getitem__ supports only a slice or a single element as an 
argument.

Of course, using that we've some workarounds, like below
{code:java}
In [6]: import pyarrow as pa                                                    
                                                                                
In [7]: a = pa.array(['a', 'bb', 'ccc', ''])                                
                                                                                
In [8]: pa.array(a.to_pandas()[indices])  # if len(indices) is high             
                                                                                
                          
Out[8]:



[

  "a",

  "",

  "ccc",

  "ccc",

  "a",

  ""

]

In [9]: pa.array([a[i].as_py() for i in indices])  # if len(indices) is low     
                                                                           
Out[9]:



[

  "a",

  "",

  "ccc",

  "ccc",

  "a",

  ""

]
{code}
both are not memory efficient.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5208) [Python] Inconsistent resulting type during casting in pa.array() when mask is present

2019-06-24 Thread Artem KOZHEVNIKOV (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871655#comment-16871655
 ] 

Artem KOZHEVNIKOV commented on ARROW-5208:
--

yes, I'm aware about cython limitations and modern c++ features (looking 
sometimes very pythonic :). So if I get it right, in "arrow" every non-trivial 
computational part is c++ based and cython is used only to wrap c++ api. Do you 
consider in some future to switch to automatic bindings generation in arrow 
(like in pytorch with pybind11) and get rid completely of cython (current c++ 
modules look still far from being auto-generated) ?

> [Python] Inconsistent resulting type during casting in pa.array() when mask 
> is present
> --
>
> Key: ARROW-5208
> URL: https://issues.apache.org/jira/browse/ARROW-5208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Artem KOZHEVNIKOV
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I would expect Int64Array type in all cases below :
> {code:java}
> >>> pa.array([4, None, 4, None], mask=np.array([False, True, False, True]))   
> >>>                                                                           
> >>>      
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True]))  
> >>>                                                                           
> >>>         
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True]))     
> >>>                                                                           
> >>>           [   4,   null,   
> >>> 4,   null ]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5208) [Python] Inconsistent resulting type during casting in pa.array() when mask is present

2019-06-24 Thread Artem KOZHEVNIKOV (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870879#comment-16870879
 ] 

Artem KOZHEVNIKOV commented on ARROW-5208:
--

Just curiosity question : what was the reasons to have `python_to_arrow.cc` 
module in pure c++ and not in cython let say ? (To be honest, I did not feel 
myself confortable enough to contribute in c++...)

Passing mask argument into `InferArrowType` will also solve the case 
`_is_array_like(obj) is True` (or in this case the inference is based on pandas 
) ?

> [Python] Inconsistent resulting type during casting in pa.array() when mask 
> is present
> --
>
> Key: ARROW-5208
> URL: https://issues.apache.org/jira/browse/ARROW-5208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Artem KOZHEVNIKOV
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> I would expect Int64Array type in all cases below :
> {code:java}
> >>> pa.array([4, None, 4, None], mask=np.array([False, True, False, True]))   
> >>>                                                                           
> >>>      
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True]))  
> >>>                                                                           
> >>>         
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True]))     
> >>>                                                                           
> >>>           [   4,   null,   
> >>> 4,   null ]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5208) [Python] Inconsistent resulting type during casting in pa.array() when mask is present

2019-04-26 Thread Artem KOZHEVNIKOV (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826896#comment-16826896
 ] 

Artem KOZHEVNIKOV commented on ARROW-5208:
--

yes, absolutely, it would be nice to get involved! Any doc that can be useful 
to start with ? CI best practices ?

> [Python] Inconsistent resulting type during casting in pa.array() when mask 
> is present
> --
>
> Key: ARROW-5208
> URL: https://issues.apache.org/jira/browse/ARROW-5208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
> Fix For: 0.14.0
>
>
> I would expect Int64Array type in all cases below :
> {code:java}
> >>> pa.array([4, None, 4, None], mask=np.array([False, True, False, True]))   
> >>>                                                                           
> >>>      
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True]))  
> >>>                                                                           
> >>>         
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True]))     
> >>>                                                                           
> >>>           [   4,   null,   
> >>> 4,   null ]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5208) Inconsistent resulting type during casting in pa.array() when mask is present

2019-04-24 Thread Artem KOZHEVNIKOV (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem KOZHEVNIKOV updated ARROW-5208:
-
Description: 
I would expect Int64Array type in all cases below :
{code:java}
>>> pa.array([4, None, 4, None], mask=np.array([False, True, False, True]))     
>>>                                                                             
>>>  
 [4, null, 4,  null ]

>>> pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True]))    
>>>                                                                             
>>>     
 [4, null, 4,  null ]

>>> pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True]))       
>>>                                                                             
>>>       [   4,   null,   4,   
>>> null ]{code}

  was:
I would expect Int64Array type in all cases below :
{code:java}
pa.array([4, None, 4, None], mask=np.array([False, True, False, True]))         
                                                                         


[

  4,

  null,

  4,

  null

]


pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True]))        
                                                                            


[

  4,

  null,

  4,

  null

]

pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True]))           
                                                                             
 [   4,   null,   4,   null 
]{code}


> Inconsistent resulting type during casting in pa.array() when mask is present
> -
>
> Key: ARROW-5208
> URL: https://issues.apache.org/jira/browse/ARROW-5208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
>
> I would expect Int64Array type in all cases below :
> {code:java}
> >>> pa.array([4, None, 4, None], mask=np.array([False, True, False, True]))   
> >>>                                                                           
> >>>      
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True]))  
> >>>                                                                           
> >>>         
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True]))     
> >>>                                                                           
> >>>           [   4,   null,   
> >>> 4,   null ]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5208) Inconsistent resulting type during casting in pa.array() when mask is present

2019-04-24 Thread Artem KOZHEVNIKOV (JIRA)
Artem KOZHEVNIKOV created ARROW-5208:


 Summary: Inconsistent resulting type during casting in pa.array() 
when mask is present
 Key: ARROW-5208
 URL: https://issues.apache.org/jira/browse/ARROW-5208
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.13.0
Reporter: Artem KOZHEVNIKOV


I would expect Int64Array type in all cases below :
{code:java}
pa.array([4, None, 4, None], mask=np.array([False, True, False, True]))         
                                                                         


[

  4,

  null,

  4,

  null

]


pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True]))        
                                                                            


[

  4,

  null,

  4,

  null

]

pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True]))           
                                                                             
 [   4,   null,   4,   null 
]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5103) Segfault when using chunked_array.to_pandas on array different types (edge case)

2019-04-03 Thread Artem KOZHEVNIKOV (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem KOZHEVNIKOV updated ARROW-5103:
-
Priority: Minor  (was: Major)

> Segfault when using chunked_array.to_pandas on array different types (edge 
> case)  
> --
>
> Key: ARROW-5103
> URL: https://issues.apache.org/jira/browse/ARROW-5103
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.12.1
> Environment: pyarrow 0.12.1 py37hf9e6f3b_0 conda-forge
> numpy   1.15.4   py37hacdab7b_0  
> MacOs | gcc7 | what else ?
>Reporter: Artem KOZHEVNIKOV
>Priority: Minor
>
> {code:java}
> import numpy as np
> import pyarrow as pa
> ca = pa.chunked_array([pa.array(['rr'] * 10), pa.array(np.arange(10))])
> ca.type
> ca.to_pandas()
> libc++abi.dylib: terminating with uncaught exception of type 
> std::length_error: basic_string
> Abort trap: 6
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5103) Segfault when using chunked_array.to_pandas on array different types (edge case)

2019-04-03 Thread Artem KOZHEVNIKOV (JIRA)
Artem KOZHEVNIKOV created ARROW-5103:


 Summary: Segfault when using chunked_array.to_pandas on array 
different types (edge case)  
 Key: ARROW-5103
 URL: https://issues.apache.org/jira/browse/ARROW-5103
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.12.1
 Environment: pyarrow 0.12.1 py37hf9e6f3b_0 conda-forge
numpy   1.15.4   py37hacdab7b_0  

MacOs | gcc7 | what else ?
Reporter: Artem KOZHEVNIKOV


{code:java}
import numpy as np
import pyarrow as pa

ca = pa.chunked_array([pa.array(['rr'] * 10), pa.array(np.arange(10))])
ca.type
ca.to_pandas()
libc++abi.dylib: terminating with uncaught exception of type std::length_error: 
basic_string
Abort trap: 6
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)