[jira] [Commented] (ARROW-2282) [Python] Create StringArray from buffers
[ https://issues.apache.org/jira/browse/ARROW-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395650#comment-16395650 ] ASF GitHub Bot commented on ARROW-2282: --- wesm commented on issue #1720: ARROW-2282: [Python] Create StringArray from buffers URL: https://github.com/apache/arrow/pull/1720#issuecomment-372412839 +1, thanks @xhochy! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Create StringArray from buffers > > > Key: ARROW-2282 > URL: https://issues.apache.org/jira/browse/ARROW-2282 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > While we will add a more general-purpose functionality in > https://issues.apache.org/jira/browse/ARROW-2281, the interface is more > complicate then the constructor that explicitly states all arguments: > {{StringArray(int64_t length, const std::shared_ptr& value_offsets, > …}} > Thus I will also expose this explicit constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2282) [Python] Create StringArray from buffers
[ https://issues.apache.org/jira/browse/ARROW-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395652#comment-16395652 ] ASF GitHub Bot commented on ARROW-2282: --- wesm closed pull request #1720: ARROW-2282: [Python] Create StringArray from buffers URL: https://github.com/apache/arrow/pull/1720 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/python/pyarrow/array.pxi b/python/pyarrow/array.pxi index e785c0ec5c..1e6bc22d39 100644 --- a/python/pyarrow/array.pxi +++ b/python/pyarrow/array.pxi @@ -774,8 +774,41 @@ cdef class UnionArray(Array): return pyarrow_wrap_array(out) cdef class StringArray(Array): -pass +@staticmethod +def from_buffers(int length, Buffer value_offsets, Buffer data, + Buffer null_bitmap=None, int null_count=-1, + int offset=0): +""" +Construct a StringArray from value_offsets and data buffers. +If there are nulls in the data, also a null_bitmap and the matching +null_count must be passed. + +Parameters +-- +length : int +value_offsets : Buffer +data : Buffer +null_bitmap : Buffer, optional +null_count : int, default 0 +offset : int, default 0 + +Returns +--- +string_array : StringArray +""" +cdef shared_ptr[CBuffer] c_null_bitmap +cdef shared_ptr[CArray] out + +if null_bitmap is not None: +c_null_bitmap = null_bitmap.buffer +else: +null_count = 0 + +out.reset(new CStringArray( +length, value_offsets.buffer, data.buffer, c_null_bitmap, +null_count, offset)) +return pyarrow_wrap_array(out) cdef class BinaryArray(Array): pass diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 456fcca360..09a6065bcd 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -367,6 +367,12 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: const uint8_t* GetValue(int i, int32_t* length) cdef cppclass CStringArray" arrow::StringArray"(CBinaryArray): +CStringArray(int64_t length, shared_ptr[CBuffer] value_offsets, + shared_ptr[CBuffer] data, + shared_ptr[CBuffer] null_bitmap, + int64_t null_count, + int64_t offset) + c_string GetString(int i) cdef cppclass CStructArray" arrow::StructArray"(CArray): diff --git a/python/pyarrow/tests/test_array.py b/python/pyarrow/tests/test_array.py index f034d78b39..c710f7cdbe 100644 --- a/python/pyarrow/tests/test_array.py +++ b/python/pyarrow/tests/test_array.py @@ -258,6 +258,36 @@ def test_union_from_sparse(): assert result.to_pylist() == [b'a', 1, b'b', b'c', 2, 3, b'd'] +def test_string_from_buffers(): +array = pa.array(["a", None, "b", "c"]) + +buffers = array.buffers() +copied = pa.StringArray.from_buffers( +len(array), buffers[1], buffers[2], buffers[0], array.null_count, +array.offset) +assert copied.to_pylist() == ["a", None, "b", "c"] + +copied = pa.StringArray.from_buffers( +len(array), buffers[1], buffers[2], buffers[0]) +assert copied.to_pylist() == ["a", None, "b", "c"] + +sliced = array[1:] +buffers = sliced.buffers() +copied = pa.StringArray.from_buffers( +len(sliced), buffers[1], buffers[2], buffers[0], -1, sliced.offset) +assert copied.to_pylist() == [None, "b", "c"] +assert copied.null_count == 1 + +# Slice but exclude all null entries so that we don't need to pass +# the null bitmap. +sliced = array[2:] +buffers = sliced.buffers() +copied = pa.StringArray.from_buffers( +len(sliced), buffers[1], buffers[2], None, -1, sliced.offset) +assert copied.to_pylist() == ["b", "c"] +assert copied.null_count == 0 + + def _check_cast_case(case, safe=True): in_data, in_type, out_data, out_type = case This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Create StringArray from buffers > > > Key: ARROW-2282 > URL: https://issues.apache.org/jira/browse/ARROW-2282 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >
[jira] [Commented] (ARROW-2282) [Python] Create StringArray from buffers
[ https://issues.apache.org/jira/browse/ARROW-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395459#comment-16395459 ] ASF GitHub Bot commented on ARROW-2282: --- xhochy commented on a change in pull request #1720: ARROW-2282: [Python] Create StringArray from buffers URL: https://github.com/apache/arrow/pull/1720#discussion_r173858008 ## File path: python/pyarrow/tests/test_array.py ## @@ -258,6 +258,26 @@ def test_union_from_sparse(): assert result.to_pylist() == [b'a', 1, b'b', b'c', 2, 3, b'd'] +def test_string_from_buffers(): +array = pa.array(["a", None, "b", "c"]) + +buffers = array.buffers() +copied = pa.StringArray.from_buffers( +len(array), buffers[1], buffers[2], buffers[0], array.null_count, +array.offset) +assert copied.to_pylist() == ["a", None, "b", "c"] + +copied = pa.StringArray.from_buffers( +len(array), buffers[1], buffers[2], buffers[0]) +assert copied.to_pylist() == ["a", None, "b", "c"] + +sliced = array[1:] +copied = pa.StringArray.from_buffers( +len(sliced), buffers[1], buffers[2], buffers[0], -1, sliced.offset) +buffers = array.buffers() +assert copied.to_pylist() == [None, "b", "c"] Review comment: Done and worked out of the box :) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Create StringArray from buffers > > > Key: ARROW-2282 > URL: https://issues.apache.org/jira/browse/ARROW-2282 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > While we will add a more general-purpose functionality in > https://issues.apache.org/jira/browse/ARROW-2281, the interface is more > complicate then the constructor that explicitly states all arguments: > {{StringArray(int64_t length, const std::shared_ptr& value_offsets, > …}} > Thus I will also expose this explicit constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2282) [Python] Create StringArray from buffers
[ https://issues.apache.org/jira/browse/ARROW-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393510#comment-16393510 ] ASF GitHub Bot commented on ARROW-2282: --- wesm commented on a change in pull request #1720: ARROW-2282: [Python] Create StringArray from buffers URL: https://github.com/apache/arrow/pull/1720#discussion_r173561454 ## File path: python/pyarrow/tests/test_array.py ## @@ -258,6 +258,26 @@ def test_union_from_sparse(): assert result.to_pylist() == [b'a', 1, b'b', b'c', 2, 3, b'd'] +def test_string_from_buffers(): +array = pa.array(["a", None, "b", "c"]) + +buffers = array.buffers() +copied = pa.StringArray.from_buffers( +len(array), buffers[1], buffers[2], buffers[0], array.null_count, +array.offset) +assert copied.to_pylist() == ["a", None, "b", "c"] + +copied = pa.StringArray.from_buffers( +len(array), buffers[1], buffers[2], buffers[0]) +assert copied.to_pylist() == ["a", None, "b", "c"] + +sliced = array[1:] +copied = pa.StringArray.from_buffers( +len(sliced), buffers[1], buffers[2], buffers[0], -1, sliced.offset) +buffers = array.buffers() +assert copied.to_pylist() == [None, "b", "c"] Review comment: We need to add checks for the computed null count, and for the case where the null bitmap is not passed This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Create StringArray from buffers > > > Key: ARROW-2282 > URL: https://issues.apache.org/jira/browse/ARROW-2282 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > While we will add a more general-purpose functionality in > https://issues.apache.org/jira/browse/ARROW-2281, the interface is more > complicate then the constructor that explicitly states all arguments: > {{StringArray(int64_t length, const std::shared_ptr& value_offsets, > …}} > Thus I will also expose this explicit constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2282) [Python] Create StringArray from buffers
[ https://issues.apache.org/jira/browse/ARROW-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392031#comment-16392031 ] ASF GitHub Bot commented on ARROW-2282: --- wesm commented on issue #1720: ARROW-2282: [Python] Create StringArray from buffers URL: https://github.com/apache/arrow/pull/1720#issuecomment-371648673 rebased This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Create StringArray from buffers > > > Key: ARROW-2282 > URL: https://issues.apache.org/jira/browse/ARROW-2282 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > While we will add a more general-purpose functionality in > https://issues.apache.org/jira/browse/ARROW-2281, the interface is more > complicate then the constructor that explicitly states all arguments: > {{StringArray(int64_t length, const std::shared_ptr& value_offsets, > …}} > Thus I will also expose this explicit constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2282) [Python] Create StringArray from buffers
[ https://issues.apache.org/jira/browse/ARROW-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389952#comment-16389952 ] ASF GitHub Bot commented on ARROW-2282: --- xhochy commented on a change in pull request #1720: ARROW-2282: [Python] Create StringArray from buffers URL: https://github.com/apache/arrow/pull/1720#discussion_r172944392 ## File path: python/pyarrow/array.pxi ## @@ -761,8 +761,39 @@ cdef class UnionArray(Array): return pyarrow_wrap_array(out) cdef class StringArray(Array): -pass +@staticmethod +def from_buffers(int length, Buffer value_offsets, Buffer data, + Buffer null_bitmap=None, int null_count=0, + int offset=0): +""" +Construct a StringArray from value_offsets and data buffers. +If there are nulls in the data, also a null_bitmap and the matching +null_count must be passed. + +Parameters +-- +length : int +value_offsets : Buffer +data : Buffer +null_bitmap : Buffer, optional +null_count : int, default 0 +offset : int, default 0 + +Returns +--- +string_array : StringArray +""" +cdef shared_ptr[CBuffer] c_null_bitmap +cdef shared_ptr[CArray] out + +if null_bitmap is not None: +c_null_bitmap = null_bitmap.buffer Review comment: I used the same defaults as we do in C++, we might also should adjust the behaviour there too. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Create StringArray from buffers > > > Key: ARROW-2282 > URL: https://issues.apache.org/jira/browse/ARROW-2282 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > While we will add a more general-purpose functionality in > https://issues.apache.org/jira/browse/ARROW-2281, the interface is more > complicate then the constructor that explicitly states all arguments: > {{StringArray(int64_t length, const std::shared_ptr& value_offsets, > …}} > Thus I will also expose this explicit constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2282) [Python] Create StringArray from buffers
[ https://issues.apache.org/jira/browse/ARROW-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389795#comment-16389795 ] ASF GitHub Bot commented on ARROW-2282: --- wesm commented on a change in pull request #1720: ARROW-2282: [Python] Create StringArray from buffers URL: https://github.com/apache/arrow/pull/1720#discussion_r172912650 ## File path: python/pyarrow/array.pxi ## @@ -761,8 +761,39 @@ cdef class UnionArray(Array): return pyarrow_wrap_array(out) cdef class StringArray(Array): -pass +@staticmethod +def from_buffers(int length, Buffer value_offsets, Buffer data, + Buffer null_bitmap=None, int null_count=0, + int offset=0): +""" +Construct a StringArray from value_offsets and data buffers. +If there are nulls in the data, also a null_bitmap and the matching +null_count must be passed. + +Parameters +-- +length : int +value_offsets : Buffer +data : Buffer +null_bitmap : Buffer, optional +null_count : int, default 0 +offset : int, default 0 + +Returns +--- +string_array : StringArray +""" +cdef shared_ptr[CBuffer] c_null_bitmap +cdef shared_ptr[CArray] out + +if null_bitmap is not None: +c_null_bitmap = null_bitmap.buffer Review comment: Yes This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Create StringArray from buffers > > > Key: ARROW-2282 > URL: https://issues.apache.org/jira/browse/ARROW-2282 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > While we will add a more general-purpose functionality in > https://issues.apache.org/jira/browse/ARROW-2281, the interface is more > complicate then the constructor that explicitly states all arguments: > {{StringArray(int64_t length, const std::shared_ptr& value_offsets, > …}} > Thus I will also expose this explicit constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2282) [Python] Create StringArray from buffers
[ https://issues.apache.org/jira/browse/ARROW-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389756#comment-16389756 ] ASF GitHub Bot commented on ARROW-2282: --- pitrou commented on a change in pull request #1720: ARROW-2282: [Python] Create StringArray from buffers URL: https://github.com/apache/arrow/pull/1720#discussion_r172899796 ## File path: python/pyarrow/array.pxi ## @@ -761,8 +761,39 @@ cdef class UnionArray(Array): return pyarrow_wrap_array(out) cdef class StringArray(Array): -pass +@staticmethod +def from_buffers(int length, Buffer value_offsets, Buffer data, + Buffer null_bitmap=None, int null_count=0, + int offset=0): +""" +Construct a StringArray from value_offsets and data buffers. +If there are nulls in the data, also a null_bitmap and the matching +null_count must be passed. + +Parameters +-- +length : int +value_offsets : Buffer +data : Buffer +null_bitmap : Buffer, optional +null_count : int, default 0 +offset : int, default 0 + +Returns +--- +string_array : StringArray +""" +cdef shared_ptr[CBuffer] c_null_bitmap +cdef shared_ptr[CArray] out + +if null_bitmap is not None: +c_null_bitmap = null_bitmap.buffer Review comment: Shouldn't null_count default to -1 if not passed explicitly and there's a null bitmap? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Create StringArray from buffers > > > Key: ARROW-2282 > URL: https://issues.apache.org/jira/browse/ARROW-2282 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > While we will add a more general-purpose functionality in > https://issues.apache.org/jira/browse/ARROW-2281, the interface is more > complicate then the constructor that explicitly states all arguments: > {{StringArray(int64_t length, const std::shared_ptr& value_offsets, > …}} > Thus I will also expose this explicit constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2282) [Python] Create StringArray from buffers
[ https://issues.apache.org/jira/browse/ARROW-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388634#comment-16388634 ] ASF GitHub Bot commented on ARROW-2282: --- xhochy commented on issue #1720: ARROW-2282: [Python] Create StringArray from buffers URL: https://github.com/apache/arrow/pull/1720#issuecomment-370951959 Depends on https://github.com/apache/arrow/pull/1719 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Create StringArray from buffers > > > Key: ARROW-2282 > URL: https://issues.apache.org/jira/browse/ARROW-2282 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > While we will add a more general-purpose functionality in > https://issues.apache.org/jira/browse/ARROW-2281, the interface is more > complicate then the constructor that explicitly states all arguments: > {{StringArray(int64_t length, const std::shared_ptr& value_offsets, > …}} > Thus I will also expose this explicit constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2282) [Python] Create StringArray from buffers
[ https://issues.apache.org/jira/browse/ARROW-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388629#comment-16388629 ] ASF GitHub Bot commented on ARROW-2282: --- xhochy commented on a change in pull request #1720: ARROW-2282: [Python] Create StringArray from buffers URL: https://github.com/apache/arrow/pull/1720#discussion_r172681819 ## File path: python/pyarrow/includes/libarrow.pxd ## @@ -366,6 +366,12 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: const uint8_t* GetValue(int i, int32_t* length) cdef cppclass CStringArray" arrow::StringArray"(CBinaryArray): +CStringArray(int64_t length, shared_ptr[CBuffer] value_offsets, + shared_ptr[CBuffer] data, Review comment: This is a stripped down-version without default arguments or const declarations as otherwise the compilation would fail with: ``` /home/uwe/Development/arrow-repos-3/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx: In function ‘PyObject* __pyx_pf_7pyarrow_3lib_11StringArray_from_buffers(int, __pyx_obj_7pyarrow_3lib_Buffer*, __pyx_obj_7pyarrow_3lib_Buffer*, __pyx_obj_7pyarrow_3lib_Buffer*, int, int)’: /home/uwe/Development/arrow-repos-3/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:42804:53: error: aggregate ‘__pyx_pf_7pyarrow_3lib_11StringArray_from_buffers(int, __pyx_obj_7pyarrow_3lib_Buffer*, __pyx_obj_7pyarrow_3lib_Buffer*, __pyx_obj_7pyarrow_3lib_Buffer*, int, int)::__pyx_opt_args_12CStringArray_CStringArray __pyx_t_5’ has incomplete type and cannot be defined struct __pyx_opt_args_12CStringArray_CStringArray __pyx_t_5; ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Create StringArray from buffers > > > Key: ARROW-2282 > URL: https://issues.apache.org/jira/browse/ARROW-2282 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > While we will add a more general-purpose functionality in > https://issues.apache.org/jira/browse/ARROW-2281, the interface is more > complicate then the constructor that explicitly states all arguments: > {{StringArray(int64_t length, const std::shared_ptr& value_offsets, > …}} > Thus I will also expose this explicit constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2282) [Python] Create StringArray from buffers
[ https://issues.apache.org/jira/browse/ARROW-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388625#comment-16388625 ] ASF GitHub Bot commented on ARROW-2282: --- xhochy opened a new pull request #1720: ARROW-2282: [Python] Create StringArray from buffers URL: https://github.com/apache/arrow/pull/1720 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Create StringArray from buffers > > > Key: ARROW-2282 > URL: https://issues.apache.org/jira/browse/ARROW-2282 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > While we will add a more general-purpose functionality in > https://issues.apache.org/jira/browse/ARROW-2281, the interface is more > complicate then the constructor that explicitly states all arguments: > {{StringArray(int64_t length, const std::shared_ptr& value_offsets, > …}} > Thus I will also expose this explicit constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)