[jira] [Created] (ARROW-11458) PyArrow 1.x and 2.x do not work with numpy 1.20

2021-02-01 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-11458:
-

 Summary: PyArrow 1.x and 2.x do not work with numpy 1.20
 Key: ARROW-11458
 URL: https://issues.apache.org/jira/browse/ARROW-11458
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 2.0.0, 1.0.1, 1.0.0
Reporter: Zhuo Peng


Numpy 1.20 was released on 1/30 and it is not compatible with libraries that 
built against numpy<1.16.6 which is the case for pyarrow 1.x and 2.x. However, 
pyarrow does not specify an upper bound for the numpy version [1].

```

Python 3.7.9 (default, Oct 30 2020, 13:50:59)
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> import numpy as np
>>> np.__version__

'1.20.0'

>>> pa.__version__
'2.0.0'
>>> pa.array(np.arange(10))
Traceback (most recent call last):
 File "", line 1, in 
 File "pyarrow/array.pxi", line 292, in pyarrow.lib.array
 File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array
 File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type
 File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object

```

 

[1] 
https://github.com/apache/arrow/blob/478286658055bb91737394c2065b92a7e92fb0c1/python/setup.py#L572

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11232) Table::CombineChunks() returns incorrect results if Table has no column

2021-01-12 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-11232:
-

 Summary: Table::CombineChunks() returns incorrect results if Table 
has no column
 Key: ARROW-11232
 URL: https://issues.apache.org/jira/browse/ARROW-11232
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Zhuo Peng
Assignee: Zhuo Peng


>>> pa.table([[1]], ["a"])
pyarrow.Table
a: int64
>>> t = pa.table([[1]], ["a"])
>>> t.num_rows
1
>>> t1 = t.drop(["a"])
>>> t1.num_rows
1
>>> t2 = t1.combine_chunks()
>>> t2.num_rows
0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9098) RecordBatch::ToStructArray cannot handle record batches with 0 column

2020-06-10 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-9098:


 Summary: RecordBatch::ToStructArray cannot handle record batches 
with 0 column
 Key: ARROW-9098
 URL: https://issues.apache.org/jira/browse/ARROW-9098
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.17.1
Reporter: Zhuo Peng


If RecordBatch::ToStructArray is called against a record batch with 0 column, 
the following error will be raised:

Invalid: Can't infer struct array length with 0 child arrays



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9071) [C++] MakeArrayOfNull makes invalid ListArray

2020-06-08 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-9071:


 Summary: [C++] MakeArrayOfNull makes invalid ListArray
 Key: ARROW-9071
 URL: https://issues.apache.org/jira/browse/ARROW-9071
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Zhuo Peng


One way to reproduce this bug is:

 

>>> a = pa.array([[1, 2]])

>>> b = pa.array([None, None], type=pa.null())

>>> t1 = pa.Table.from_arrays([a], ["a"])
>>> t2 = pa.Table.from_arrays([b], ["b"])

 

>>> pa.concat_tables([t1, t2], promote=True)
Traceback (most recent call last):
 File "", line 1, in 
 File "pyarrow/table.pxi", line 2138, in pyarrow.lib.concat_tables
 File "pyarrow/public-api.pxi", line 390, in pyarrow.lib.pyarrow_wrap_table
 File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 0: In chunk 1: Invalid: List child array 
invalid: Invalid: Buffer #1 too small in array of type int64 and length 2: 
expected at least 16 byte(s), got 12

(because concat_tables(promote=True) will call MakeArrayOfNulls 
([https://github.com/apache/arrow/blob/ec3bae18157723411bb772fca628cbd02eea5c25/cpp/src/arrow/table.cc#L647))|https://github.com/apache/arrow/blob/ec3bae18157723411bb772fca628cbd02eea5c25/cpp/src/arrow/table.cc#L647)']

 

The code here seems incorrect:

[https://github.com/apache/arrow/blob/ec3bae18157723411bb772fca628cbd02eea5c25/cpp/src/arrow/array/util.cc#L218]

the length of the child array of a ListArray may not equal to the length of the 
ListArray.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9037) [C++/C-ABI] unable to import array with null count == -1 (which could be exported)

2020-06-04 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-9037:


 Summary: [C++/C-ABI] unable to import array with null count == -1 
(which could be exported)
 Key: ARROW-9037
 URL: https://issues.apache.org/jira/browse/ARROW-9037
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.17.1
Reporter: Zhuo Peng


If an Array is created with null_count == -1 but without any null (and thus no 
null bitmap buffer), then ArrayData.null_count will remain -1 when exporting if 
null_count is never computed. The exported C struct also has null_count == -1 
[1]. But when importing, if null_count != 0, an error [2] will be raised.

[1] 
https://github.com/apache/arrow/blob/5389008df0267ba8d57edb7d6bb6ec0bfa10ff9a/cpp/src/arrow/c/bridge.cc#L560

[2] 
https://github.com/apache/arrow/blob/5389008df0267ba8d57edb7d6bb6ec0bfa10ff9a/cpp/src/arrow/c/bridge.cc#L1404

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7229) [C++] Unify ConcatenateTables APIs

2020-04-26 Thread Zhuo Peng (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092964#comment-17092964
 ] 

Zhuo Peng commented on ARROW-7229:
--

AFAIK this is done. The API has been unified and an option struct has been 
introduced. Maybe the test cases in table_test.cc could be refactored to 
reflect closely the API change.

> [C++] Unify ConcatenateTables APIs
> --
>
> Key: ARROW-7229
> URL: https://issues.apache.org/jira/browse/ARROW-7229
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Zhuo Peng
>Assignee: Zhuo Peng
>Priority: Minor
> Fix For: 1.0.0
>
>
> Today we have ConcatenateTables() and ConcatenateTablesWithPromotion() in 
> C++. It's anticipated that they will allow more customization/tweaking. To 
> avoid complicating the API surface, we should introduce a 
> ConcatenateTableOption object, unify the two functions, and allow further 
> customization to be expressed in that option object.
> Related discussion: 
> [https://lists.apache.org/thread.html/1fa85b078dae09639de04afcf948aad1bfabd48ea8a38e33969495c5@%3Cdev.arrow.apache.org%3E]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8277) [Python] RecordBatch interface improvements

2020-03-30 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-8277:


 Summary: [Python] RecordBatch interface improvements
 Key: ARROW-8277
 URL: https://issues.apache.org/jira/browse/ARROW-8277
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Zhuo Peng
Assignee: Zhuo Peng


Currently __eq__, __repr__ of RecordBatch are not implemented.

compute::Take also supports RecordBatch inputs but there's no python wrapper 
for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7806) [Python] Implement to_pandas for lists of LargeBinary/String

2020-03-05 Thread Zhuo Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhuo Peng reassigned ARROW-7806:


Assignee: Zhuo Peng

> [Python] Implement to_pandas for lists of LargeBinary/String
> 
>
> Key: ARROW-7806
> URL: https://issues.apache.org/jira/browse/ARROW-7806
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Zhuo Peng
>Assignee: Zhuo Peng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For example:
>  
> >>> a = pa.array([['a']], type=pa.list_(pa.large_binary()))
> >>> a.to_pandas()
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "pyarrow/array.pxi", line 468, in 
> pyarrow.lib._PandasConvertible.to_pandas
>  File "pyarrow/array.pxi", line 902, in pyarrow.lib.Array._to_pandas
>  File "pyarrow/error.pxi", line 86, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: Not implemented type for lists: 
> large_binary



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage

2020-03-05 Thread Zhuo Peng (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17052531#comment-17052531
 ] 

Zhuo Peng commented on ARROW-1231:
--

I don't work on related stuff, but looking at our internal site, 
google-cloud-cpp seems to be right choice.

Micah might know more.

 

[https://googleapis.dev/cpp/google-cloud-storage/latest/] seems to be the 
documentation for [https://googleapis.github.io/google-cloud-cpp/] ?

> [C++] Add filesystem / IO implementation for Google Cloud Storage
> -
>
> Key: ARROW-1231
> URL: https://issues.apache.org/jira/browse/ARROW-1231
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
>
> See example jumping off point
> https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7802) [C++] Support for LargeBinary and LargeString in the hash kernel

2020-03-05 Thread Zhuo Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhuo Peng reassigned ARROW-7802:


Assignee: Zhuo Peng

> [C++] Support for LargeBinary and LargeString in the hash kernel
> 
>
> Key: ARROW-7802
> URL: https://issues.apache.org/jira/browse/ARROW-7802
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Zhuo Peng
>Assignee: Zhuo Peng
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently they are not supported:
> https://github.com/apache/arrow/blob/a76e277213e166dbeb148260498995ba053566fb/cpp/src/arrow/compute/kernels/hash.cc#L456



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7806) [Python] {Array,Table,RecordBatch}.to_pandas() do not support Large variants of ListArray, BinaryArray and StringArray

2020-02-09 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7806:


 Summary: [Python] {Array,Table,RecordBatch}.to_pandas() do not 
support Large variants of ListArray, BinaryArray and StringArray
 Key: ARROW-7806
 URL: https://issues.apache.org/jira/browse/ARROW-7806
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Zhuo Peng


For example:

 

>>> a = pa.array([['a']], type=pa.list_(pa.large_binary()))
>>> a.to_pandas()
Traceback (most recent call last):
 File "", line 1, in 
 File "pyarrow/array.pxi", line 468, in pyarrow.lib._PandasConvertible.to_pandas
 File "pyarrow/array.pxi", line 902, in pyarrow.lib.Array._to_pandas
 File "pyarrow/error.pxi", line 86, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Not implemented type for lists: 
large_binary



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7802) [C++] Support for LargeBinary and LargeString in the hash kernel

2020-02-07 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7802:


 Summary: [C++] Support for LargeBinary and LargeString in the hash 
kernel
 Key: ARROW-7802
 URL: https://issues.apache.org/jira/browse/ARROW-7802
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Zhuo Peng


Currently they are not supported:

https://github.com/apache/arrow/blob/a76e277213e166dbeb148260498995ba053566fb/cpp/src/arrow/compute/kernels/hash.cc#L456



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7510) [C++] Array::null_count() is not thread-compatible

2020-01-07 Thread Zhuo Peng (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009948#comment-17009948
 ] 

Zhuo Peng commented on ARROW-7510:
--

Yes. Please see the attached articles.

 

> [C++] Array::null_count() is not thread-compatible
> --
>
> Key: ARROW-7510
> URL: https://issues.apache.org/jira/browse/ARROW-7510
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Zhuo Peng
>Priority: Minor
>
> ArrayData has a mutable member null_count, that can be updated in a const 
> function. However null_count is not atomic, so it's subject to data race.
>  
> I guess Arrays are not thread-safe (which is reasonable), but at least they 
> should be thread-compatible so that concurrent access to const member 
> functions are fine.
> (The race looks "benign", but see [1][2])
> [https://github.com/apache/arrow/blob/dbe708c7527a4aa6b63df7722cd57db4e0bd2dc7/cpp/src/arrow/array.cc#L123]
>  
> [1][https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong]
> [2][https://bartoszmilewski.com/2014/10/25/dealing-with-benign-data-races-the-c-way/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7510) [C++] Array::null_count() is not thread-compatible

2020-01-07 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7510:


 Summary: [C++] Array::null_count() is not thread-compatible
 Key: ARROW-7510
 URL: https://issues.apache.org/jira/browse/ARROW-7510
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Zhuo Peng


ArrayData has a mutable member null_count, that can be updated in a const 
function. However null_count is not atomic, so it's subject to data race.

 

I guess Arrays are not thread-safe (which is reasonable), but at least they 
should be thread-compatible so that concurrent access to const member functions 
are fine.

(The race looks "benign", but see [1][2])

[https://github.com/apache/arrow/blob/dbe708c7527a4aa6b63df7722cd57db4e0bd2dc7/cpp/src/arrow/array.cc#L123]

 

[1][https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong]

[2][https://bartoszmilewski.com/2014/10/25/dealing-with-benign-data-races-the-c-way/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7096) [C++] Add options structs for concatenation-with-promotion and schema unification

2019-12-19 Thread Zhuo Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhuo Peng reassigned ARROW-7096:


Assignee: Zhuo Peng

> [C++] Add options structs for concatenation-with-promotion and schema 
> unification
> -
>
> Key: ARROW-7096
> URL: https://issues.apache.org/jira/browse/ARROW-7096
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Zhuo Peng
>Priority: Major
> Fix For: 1.0.0
>
>
> Follow up to ARROW-6625



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7362) [Python] ListArray.flatten() should take care of slicing offsets

2019-12-09 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7362:


 Summary: [Python] ListArray.flatten() should take care of slicing 
offsets
 Key: ARROW-7362
 URL: https://issues.apache.org/jira/browse/ARROW-7362
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Zhuo Peng
Assignee: Zhuo Peng


Currently ListArray.flatten() simply returns the child array. If a ListArray is 
a slice of another ListArray, they will share the same child array, however the 
expected behavior (I think) of flatten() should be returning an Array that's a 
concatenation of all the sub-lists in the ListArray, so the slicing offset 
should be taken into account.

 

For example:

a = pa.array([[1], [2], [3]])

assert a.flatten().equals(pa.array([1,2,3]))

# expected:

a.slice(1).flatten().equals(pa.array([2, 3]))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7229) [C++] Unify ConcatenateTables APIs

2019-11-21 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7229:


 Summary: [C++] Unify ConcatenateTables APIs
 Key: ARROW-7229
 URL: https://issues.apache.org/jira/browse/ARROW-7229
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Zhuo Peng
Assignee: Zhuo Peng


Today we have ConcatenateTables() and ConcatenateTablesWithPromotion() in C++. 
It's anticipated that they will allow more customization/tweaking. To avoid 
complicating the API surface, we should introduce a ConcatenateTableOption 
object, unify the two functions, and allow further customization to be 
expressed in that option object.

Related discussion: 
[https://lists.apache.org/thread.html/1fa85b078dae09639de04afcf948aad1bfabd48ea8a38e33969495c5@%3Cdev.arrow.apache.org%3E]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7228) [Python] Expose RecordBatch.FromStructArray in Python.

2019-11-21 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7228:


 Summary: [Python] Expose RecordBatch.FromStructArray in Python.
 Key: ARROW-7228
 URL: https://issues.apache.org/jira/browse/ARROW-7228
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Zhuo Peng
Assignee: Zhuo Peng
 Fix For: 1.0.0


This API was introduced in ARROW-6243. It will make converting from a list of 
python dicts to a RecordBatch easier:

 

struct_array = pa.array([\{"column1": 1, "column2": 5}, \{"column2": 6}])

record_batch = pa.RecordBatch.from_struct_array(struct_array)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7227) [Python] Provide wrappers for ConcatenateWithPromotion()

2019-11-21 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7227:


 Summary: [Python] Provide wrappers for ConcatenateWithPromotion()
 Key: ARROW-7227
 URL: https://issues.apache.org/jira/browse/ARROW-7227
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Zhuo Peng
Assignee: Zhuo Peng
 Fix For: 1.0.0


[https://github.com/apache/arrow/pull/5534] Introduced 
ConcatenateWithPromotion() to C++. Provide a Python wrapper for it.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6878) [Python] pa.array() does not handle list of dicts with bytes keys correctly under python3

2019-10-14 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-6878:


 Summary: [Python] pa.array() does not handle list of dicts with 
bytes keys correctly under python3
 Key: ARROW-6878
 URL: https://issues.apache.org/jira/browse/ARROW-6878
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Zhuo Peng


It creates sub-arrays with nulls filled, instead of the provided values.

$ python

Python 3.6.8 (default, Jan 3 2019, 03:42:36) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> pa.__version__
'0.15.0'
>>> a = pa.array([\{b"a": [1, 2, 3]}])
>>> a

-- is_valid: all not null
-- child 0 type: list
 [
 null
 ]
>>> a = pa.array([\{"a": [1, 2, 3]}])
>>> a

-- is_valid: all not null
-- child 0 type: list
 [
 [
 1,
 2,
 3
 ]
 ]

 

It works under python2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6848) [C++] Specify -std=c++11 instead of -std=gnu++11 when building

2019-10-10 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-6848:


 Summary: [C++] Specify -std=c++11 instead of -std=gnu++11 when 
building
 Key: ARROW-6848
 URL: https://issues.apache.org/jira/browse/ARROW-6848
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Zhuo Peng


Relevant discussion:

[https://lists.apache.org/thread.html/5807e65d865c1736b3a7a32653ca8bb405d719eb13b8a10b6fe0e904@%3Cdev.arrow.apache.org%3E]

in addition to

set(CMAKE_CXX_STANDARD 11)

, we also need to

set(CMAKE_CXX_EXTENSIONS OFF)

in order to turn off compiler-specific extensions (with GCC, it's -std=gnu++11)

 

This is supposed to be a no-op, because Arrow builds fine with other compilers 
(Clang-LLVM / MSCV). But opening this bug to track any issues with flipping the 
switch.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6775) Proposal for several Array utility functions

2019-10-02 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-6775:


 Summary: Proposal for several Array utility functions
 Key: ARROW-6775
 URL: https://issues.apache.org/jira/browse/ARROW-6775
 Project: Apache Arrow
  Issue Type: Wish
Reporter: Zhuo Peng


Hi,

We developed several utilities that computes / accesses certain properties of 
Arrays and wonder if they make sense to get them into the upstream (into both 
the C++ API and pyarrow) and assuming yes, where is the best place to put them?

Maybe I have overlooked existing APIs that already do the same.. in that case 
please point out.

 

1/ ListLengthFromListArray(ListArray&)

Returns lengths of lists in a ListArray, as a Int32Array (or Int64Array for 
large lists). For example:

[[1, 2, 3], [], None] => [3, 0, 0] (or [3, 0, None], but we hope the returned 
array can be converted to numpy)

 

2/ GetBinaryArrayTotalByteSize(BinaryArray&)

Returns the total byte size of a BinaryArray (basically offset[len - 1] - 
offset[0]).

Alternatively, a BinaryArray::Flatten() -> Uint8Array would work.

 

3/ GetArrayNullBitmapAsByteArray(Array&)

Returns the array's null bitmap as a UInt8Array (which can be efficiently 
converted to a bool numpy array)

 

4/ GetFlattenedArrayParentIndices(ListArray&)

Makes a int32 array of the same length as the flattened ListArray. 
returned_array[i] == j means i-th element in the flattened ListArray came from 
j-th list in the ListArray.


For example [[1,2,3], [], None, [4,5]] => [0, 0, 0, 3, 3]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6625) [Python] Allow concat_tables to null or default fill missing columns

2019-09-26 Thread Zhuo Peng (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938956#comment-16938956
 ] 

Zhuo Peng commented on ARROW-6625:
--

Daniel / Wes, are you working on implementing this? I'm also interested in this 
feature and if you are not working on it I can take it.

I have a small amendment to this FR though: does it make sense to allow 
concatenating a column of type Null (NullArray) with any other type of column? 
The result would be again a column of the other type, with null filled for the 
rows in the NullArray.

 

> [Python] Allow concat_tables to null or default fill missing columns
> 
>
> Key: ARROW-6625
> URL: https://issues.apache.org/jira/browse/ARROW-6625
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Reporter: Daniel Nugent
>Priority: Minor
> Fix For: 1.0.0
>
>
> The concat_tables function currently requires schemas to be identical across 
> all tables to be concat'ed together. However, tables occasionally are 
> conforming on type where present, but a column will be absent.
> In this case, allowing for null filling (or default filling) would be ideal.
> I imagine this feature would be an optional parameter on the concat_tables 
> function. Presumably the argument could be either a boolean in the case of 
> blanket null filling, or a mapping type for default filling. If a user wanted 
> to default fill some columns, but null fill others, they could use a None as 
> the value (defaultdict would make it simple to provide a blanket null fill if 
> only a few default value columns were desired).
> If a mapping wasn't present, the function should probably raise an error.
> The default behavior would be the current and thus the default value of the 
> parameter should be False or None.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5894) [C++] libgandiva.so.14 is exporting libstdc++ symbols

2019-07-15 Thread Zhuo Peng (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885414#comment-16885414
 ] 

Zhuo Peng commented on ARROW-5894:
--

[https://github.com/apache/arrow/pull/4883]

> [C++] libgandiva.so.14 is exporting libstdc++ symbols
> -
>
> Key: ARROW-5894
> URL: https://issues.apache.org/jira/browse/ARROW-5894
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Affects Versions: 0.14.0
>Reporter: Zhuo Peng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For example:
> $ nm libgandiva.so.14 | grep "once_proxy"
> 018c0a10 T __once_proxy
>  
> many other symbols are also exported which I guess shouldn't be (e.g. LLVM 
> symbols)
>  
> There seems to be no linker script for libgandiva.so (there was, but was 
> never used and got deleted? 
> [https://github.com/apache/arrow/blob/9265fe35b67db93f5af0b47e92e039c637ad5b3e/cpp/src/gandiva/symbols-helpers.map]).
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5894) libgandiva.so.14 is exporting libstdc++ symbols

2019-07-09 Thread Zhuo Peng (JIRA)
Zhuo Peng created ARROW-5894:


 Summary: libgandiva.so.14 is exporting libstdc++ symbols
 Key: ARROW-5894
 URL: https://issues.apache.org/jira/browse/ARROW-5894
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Affects Versions: 0.14.0
Reporter: Zhuo Peng


For example:

$ nm libgandiva.so.14 | grep "once_proxy"
018c0a10 T __once_proxy

 

many other symbols are also exported which I guess shouldn't be (e.g. LLVM 
symbols)

 

There seems to be no linker script for libgandiva.so (there was, but was never 
used and got deleted? 
[https://github.com/apache/arrow/blob/9265fe35b67db93f5af0b47e92e039c637ad5b3e/cpp/src/gandiva/symbols-helpers.map]).

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5749) [Python] Add Python binding for Table::CombineChunks()

2019-06-26 Thread Zhuo Peng (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873711#comment-16873711
 ] 

Zhuo Peng commented on ARROW-5749:
--

[https://github.com/apache/arrow/pull/4712]

> [Python] Add Python binding for Table::CombineChunks()
> --
>
> Key: ARROW-5749
> URL: https://issues.apache.org/jira/browse/ARROW-5749
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Zhuo Peng
>Assignee: Zhuo Peng
>Priority: Minor
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5635) Support "compacting" a table

2019-06-17 Thread Zhuo Peng (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866045#comment-16866045
 ] 

Zhuo Peng commented on ARROW-5635:
--

[https://github.com/apache/arrow/pull/4598]

> Support "compacting" a table
> 
>
> Key: ARROW-5635
> URL: https://issues.apache.org/jira/browse/ARROW-5635
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Zhuo Peng
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A column in a table might consists of multiple chunks. I'm proposing a 
> Table.Compact() method that returns a table whose columns are of just one 
> chunks, which is the concatenation of the corresponding column's chunks.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5635) Support "compacting" a table

2019-06-17 Thread Zhuo Peng (JIRA)
Zhuo Peng created ARROW-5635:


 Summary: Support "compacting" a table
 Key: ARROW-5635
 URL: https://issues.apache.org/jira/browse/ARROW-5635
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Zhuo Peng


A column in a table might consists of multiple chunks. I'm proposing a 
Table.Compact() method that returns a table whose columns are of just one 
chunks, which is the concatenation of the corresponding column's chunks.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5554) Add a python wrapper for arrow::Concatenate

2019-06-11 Thread Zhuo Peng (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861143#comment-16861143
 ] 

Zhuo Peng commented on ARROW-5554:
--

[https://github.com/apache/arrow/pull/4519]

> Add a python wrapper for arrow::Concatenate
> ---
>
> Key: ARROW-5554
> URL: https://issues.apache.org/jira/browse/ARROW-5554
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: Zhuo Peng
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5554) Add a python wrapper for arrow::Concatenate

2019-06-11 Thread Zhuo Peng (JIRA)
Zhuo Peng created ARROW-5554:


 Summary: Add a python wrapper for arrow::Concatenate
 Key: ARROW-5554
 URL: https://issues.apache.org/jira/browse/ARROW-5554
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.14.0
Reporter: Zhuo Peng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5528) Concatenate() crashes when concatenating empty binary arrays.

2019-06-07 Thread Zhuo Peng (JIRA)
Zhuo Peng created ARROW-5528:


 Summary: Concatenate() crashes when concatenating empty binary 
arrays.
 Key: ARROW-5528
 URL: https://issues.apache.org/jira/browse/ARROW-5528
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.13.0
Reporter: Zhuo Peng
 Fix For: 0.14.0


[https://github.com/brills/arrow/commit/42063bb5297f34d9b98e831264c47add2da68591]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)