[jira] [Created] (ARROW-17539) Reading a StructArray column with an ExtensionType causes segfault
Jim Pivarski created ARROW-17539: Summary: Reading a StructArray column with an ExtensionType causes segfault Key: ARROW-17539 URL: https://issues.apache.org/jira/browse/ARROW-17539 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 9.0.0 Reporter: Jim Pivarski We can make nested columns in a Parquet file by putting a {{pa.StructArray}} in a {{pa.Table}} and writing that Table to Parquet. We can selectively read back that nested column by specifying it with dot syntax: {{pq.ParquetFile("f.parquet").read_row_groups([0], ["table_column.struct_field"])}} But if the Arrow types are ExtensionTypes, then the above causes a segfault. The segfault depends both on the nested struct field and the ExtensionTypes. Here is a minimally reproducing example of reading a nested struct field without extension types, which does not raise a segfault. (I'm building the {{pa.StructArray}} manually with {{from_buffers}} because I'll have to add the ExtensionTypes in the next example.) {code:python} import numpy as np import pyarrow as pa import pyarrow.parquet as pq one = pa.Array.from_buffers( pa.int64(), 3, [None, pa.py_buffer(np.array([10, 20, 30], dtype=np.int64))], ) two = pa.Array.from_buffers( pa.float64(), 3, [None, pa.py_buffer(np.array([1.1, 2.2, 3.3], dtype=np.float64))], ) record = pa.Array.from_buffers( pa.struct([ pa.field("one", one.type, False), pa.field("two", two.type, False), ]), 3, [None], children=[one, two], ) assert record.to_pylist() == [ {"one": 10, "two": 1.1}, {"one": 20, "two": 2.2}, {"one": 30, "two": 3.3}, ] table = pa.Table.from_arrays([record], names=["column"]) pq.write_table(table, "record.parquet") table2 = pq.ParquetFile("record.parquet").read_row_groups([0], ["column.one"]) assert table2.to_pylist() == [ {"column": {"one": 10}}, {"column": {"one": 20}}, {"column": {"one": 30}}, ] {code} So far, so good; no segfault. Next, we define and register an ExtensionType, {code:python} import json class AnnotatedType(pa.ExtensionType): def __init__(self, storage_type, annotation): self.annotation = annotation super().__init__(storage_type, "my:app") def __arrow_ext_serialize__(self): return json.dumps(self.annotation).encode() @classmethod def __arrow_ext_deserialize__(cls, storage_type, serialized): annotation = json.loads(serialized.decode()) print(storage_type, annotation) return cls(storage_type, annotation) @property def num_buffers(self): return self.storage_type.num_buffers @property def num_fields(self): return self.storage_type.num_fields pa.register_extension_type(AnnotatedType(pa.null(), None)) {code} build the {{pa.StructArray}} again, {code:python} one = pa.Array.from_buffers( AnnotatedType(pa.int64(), {"annotated": "one"}), 3, [None, pa.py_buffer(np.array([10, 20, 30], dtype=np.int64))], ) two = pa.Array.from_buffers( AnnotatedType(pa.float64(), {"annotated": "two"}), 3, [None, pa.py_buffer(np.array([1.1, 2.2, 3.3], dtype=np.float64))], ) record = pa.Array.from_buffers( AnnotatedType( pa.struct([ pa.field("one", one.type, False), pa.field("two", two.type, False), ]), {"annotated": "record"}, ), 3, [None], children=[one, two], ) assert record.to_pylist() == [ {"one": 10, "two": 1.1}, {"one": 20, "two": 2.2}, {"one": 30, "two": 3.3}, ] {code} Now when we write and read this back, there's a segfault: {code:python} table = pa.Table.from_arrays([record], names=["column"]) pq.write_table(table, "record_annotated.parquet") print("before segfault") table2 = pq.ParquetFile("record_annotated.parquet").read_row_groups([0], ["column.one"]) print("after segfault") {code} The output, which prints each annotation as the ExtensionType is deserialized, is {code:java} before segfault int64 {'annotated': 'one'} double {'annotated': 'two'} int64 {'annotated': 'one'} double {'annotated': 'two'} struct> not null, two: extension> not null> {'annotated': 'record'} Segmentation fault (core dumped) {code} Note that if we read back that file, {{{}record_annotated.parquet{}}}, without the ExtensionType, everything is fine: {code:java} Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) [GCC 10.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow as pa >>> import pyarrow.parquet as pq >>> table2 = pq.ParquetFile("record_annotated.parquet").read_row_groups([0], >>> ["column.one"]) >>> assert table2.to_pylist() == [ ... {"column": {"one": 10}}, ... {"column": {"one": 20}}, ... {"column": {"one": 30}}, ... ] {code} and if we register the ExtensionType but don't select a column,
[jira] [Created] (ARROW-16348) ParquetWriter use_compliant_nested_type=True does not preserve ExtensionArray when reading back
Jim Pivarski created ARROW-16348: Summary: ParquetWriter use_compliant_nested_type=True does not preserve ExtensionArray when reading back Key: ARROW-16348 URL: https://issues.apache.org/jira/browse/ARROW-16348 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 7.0.0 Environment: pyarrow 7.0.0 installed via pip. Reporter: Jim Pivarski I've been happily making ExtensionArrays, but recently noticed that they aren't preserved by round-trips through Parquet files when {{{}use_compliant_nested_type=True{}}}. Consider this writer.py: {code:java} import json import numpy as np import pyarrow as pa import pyarrow.parquet as pq class AnnotatedType(pa.ExtensionType): def __init__(self, storage_type, annotation): self.annotation = annotation super().__init__(storage_type, "my:app") def __arrow_ext_serialize__(self): return json.dumps(self.annotation).encode() @classmethod def __arrow_ext_deserialize__(cls, storage_type, serialized): annotation = json.loads(serialized.decode()) return cls(storage_type, annotation) @property def num_buffers(self): return self.storage_type.num_buffers @property def num_fields(self): return self.storage_type.num_fields pa.register_extension_type(AnnotatedType(pa.null(), None)) array = pa.Array.from_buffers( AnnotatedType(pa.list_(pa.float64()), {"cool": "beans"}), 3, [None, pa.py_buffer(np.array([0, 3, 3, 5], np.int32))], children=[pa.array([1.1, 2.2, 3.3, 4.4, 5.5])], ) table = pa.table({"": array}) print(table) pq.write_table(table, "tmp.parquet", use_compliant_nested_type=True) {code} And this reader.py: {code:java} import json import numpy as np import pyarrow as pa import pyarrow.parquet as pq class AnnotatedType(pa.ExtensionType): def __init__(self, storage_type, annotation): self.annotation = annotation super().__init__(storage_type, "my:app") def __arrow_ext_serialize__(self): return json.dumps(self.annotation).encode() @classmethod def __arrow_ext_deserialize__(cls, storage_type, serialized): annotation = json.loads(serialized.decode()) return cls(storage_type, annotation) @property def num_buffers(self): return self.storage_type.num_buffers @property def num_fields(self): return self.storage_type.num_fields pa.register_extension_type(AnnotatedType(pa.null(), None)) table = pq.read_table("tmp.parquet") print(table) {code} (The AnnotatedType is the same; I wrote it twice for explicitness.) When the writer.py has {{{}use_compliant_nested_type=False{}}}, the output is {code:java} % python writer.py pyarrow.Table : extension> : [[[1.1,2.2,3.3],[],[4.4,5.5]]] % python reader.py pyarrow.Table : extension> : [[[1.1,2.2,3.3],[],[4.4,5.5]]]{code} In other words, the AnnotatedType is preserved. When {{{}use_compliant_nested_type=True{}}}, however, {code:java} % rm tmp.parquet rm: remove regular file 'tmp.parquet'? y % python writer.py pyarrow.Table : extension> : [[[1.1,2.2,3.3],[],[4.4,5.5]]] % python reader.py pyarrow.Table : list child 0, element: double : [[[1.1,2.2,3.3],[],[4.4,5.5]]]{code} The issue doesn't seem to be in the writing, but in the reading: regardless of whether {{use_compliant_nested_type}} is {{True}} or {{{}False{}}}, I can see the extension metadata in the Parquet → Arrow converted schema. {code:java} >>> import pyarrow.parquet as pq >>> pq.ParquetFile("tmp.parquet").schema.to_arrow_schema() : list child 0, item: double -- field metadata -- ARROW:extension:metadata: '{"cool": "beans"}' ARROW:extension:name: 'my:app'{code} versus {code:java} >>> import pyarrow.parquet as pq >>> pq.ParquetFile("tmp.parquet").schema.to_arrow_schema() : list child 0, element: double -- field metadata -- ARROW:extension:metadata: '{"cool": "beans"}' ARROW:extension:name: 'my:app'{code} Note that the first has "{{{}item: double{}}}" and the second has "{{{}element: double{}}}". (I'm also rather surprised that {{use_compliant_nested_type=False}} is an option. Wouldn't you want the Parquet files to always be written with compliant lists? I noticed this when I was having trouble getting the data into BigQuery.) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-14770) Direct (individualized) access to definition levels, repetition levels, and numeric data of a column
Jim Pivarski created ARROW-14770: Summary: Direct (individualized) access to definition levels, repetition levels, and numeric data of a column Key: ARROW-14770 URL: https://issues.apache.org/jira/browse/ARROW-14770 Project: Apache Arrow Issue Type: New Feature Components: C++, Parquet, Python Reporter: Jim Pivarski It would be useful to have more low-level access to the three components of a Parquet column in Python: the definition levels, the repetition levels, and the numeric data, {_}individually{_}. The particular use-case we have in Awkward Array is that users will sometimes lazily read an array of lists of structs without reading any of the fields of those structs. To build the data structure, we need the lengths of the lists independently of the columns (which users can then use in functions like {{{}ak.num{}}}; the number of structs without their field values is useful information). What we're doing right now is reading a column, converting it to Arrow ({{{}pa.Array{}}}), and getting the list lengths from that Arrow array. We have been using the schema to try to pick the smallest column (booleans are best!), but that's because we really just want the definition and repetition levels without the numeric data. I've heard that the Parquet metadata includes offsets to select just the definition levels, just the repetition levels, or just the numeric data (pre-decompression?). Exposing those in Python as {{pa.Buffer}} objects would be ideal. Beyond our use case, such a feature could also help with wide structs in lists: all of the non-nullable fields of the struct would share the same definition and repetition levels, so they don't need to be re-read. For that use-case, the ability to pick out definition, repetition, and numeric data separately would still be useful, but the purpose would be to read the numeric data without the structural integers (opposite of ours). The desired interface would be like {{{}ParquetFile.read_row_group{}}}, but would return one, two, or three {{pa.Buffer}} objects depending on three boolean arguments, {{{}definition{}}}, {{{}repetition{}}}, and {{{}numeric{}}}. The {{pa.Buffer}} would be unpacked, with all run-length encodings and fixed-width encodings converted into integers of at least one byte each. It may make more sense for the output to be {{{}np.ndarray{}}}, to carry {{dtype}} information if that can depend on the maximum level (though levels larger than 255 are likely rare!). This information must be available at some level in Arrow's C++ code; the request is to expose it to Python. I've labeled this minor because it is for optimizations, but it would be really nice to have! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14547) Reading FixedSizeListArray from Parquet with nulls
Jim Pivarski created ARROW-14547: Summary: Reading FixedSizeListArray from Parquet with nulls Key: ARROW-14547 URL: https://issues.apache.org/jira/browse/ARROW-14547 Project: Apache Arrow Issue Type: Bug Components: Parquet, Python Affects Versions: 6.0.0 Reporter: Jim Pivarski This one is easy to describe: given an array of fixed-sized lists, in which some are null, {code:python} >>> import numpy as np >>> import pyarrow as pa >>> import pyarrow.parquet >>> a = pa.FixedSizeListArray.from_arrays(np.arange(10), 5).take([0, None]) >>> a [ [ 0, 1, 2, 3, 4 ], null ] {code} you can write them to a Parquet file, but not read them back: {code:python} >>> pa.parquet.write_table(pa.table({"": a}), "tmp.parquet") >>> pa.parquet.read_table("tmp.parquet") Traceback (most recent call last): File "", line 1, in File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 1941, in read_table return dataset.read(columns=columns, use_threads=use_threads, File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 1776, in read table = self._dataset.to_table( File "pyarrow/_dataset.pyx", line 491, in pyarrow._dataset.Dataset.to_table File "pyarrow/_dataset.pyx", line 3235, in pyarrow._dataset.Scanner.to_table File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Expected all lists to be of size=5 but index 2 had size=0 {code} It could be that, at some level, the second list is considered to be empty. For completeness, this doesn't happen if the fixed-sized lists have no nulls: {code:python} >>> b = pa.FixedSizeListArray.from_arrays(np.arange(10), 5) >>> b [ [ 0, 1, 2, 3, 4 ], [ 5, 6, 7, 8, 9 ] ] >>> pa.parquet.write_table(pa.table({"": b}), "tmp2.parquet") >>> pa.parquet.read_table("tmp2.parquet") pyarrow.Table : fixed_size_list[5] child 0, item: int64 : [[[0,1,2,3,4],[5,6,7,8,9]]] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14525) Writing DictionaryArrays with ExtensionType to Parquet
Jim Pivarski created ARROW-14525: Summary: Writing DictionaryArrays with ExtensionType to Parquet Key: ARROW-14525 URL: https://issues.apache.org/jira/browse/ARROW-14525 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 6.0.0 Reporter: Jim Pivarski Thanks to some help I got from [~jorisvandenbossche], I can create DictionaryArrays with ExtensionType (on just the dictionary, the dictionary array itself, or both). However, these extended-DictionaryArrays can't be written to Parquet files. To start, let's set up my minimal reproducer ExtensionType, this time with an explicit ExtensionArray: {code:python} >>> import json >>> import numpy as np >>> import pyarrow as pa >>> import pyarrow.parquet >>> >>> class AnnotatedArray(pa.ExtensionArray): ... pass ... >>> class AnnotatedType(pa.ExtensionType): ... def __init__(self, storage_type, annotation): ... self.annotation = annotation ... super().__init__(storage_type, "my:app") ... def __arrow_ext_serialize__(self): ... return json.dumps(self.annotation).encode() ... @classmethod ... def __arrow_ext_deserialize__(cls, storage_type, serialized): ... annotation = json.loads(serialized.decode()) ... return cls(storage_type, annotation) ... def __arrow_ext_class__(self): ... return AnnotatedArray ... >>> pa.register_extension_type(AnnotatedType(pa.null(), None)) {code} A non-extended DictionaryArray could be built like this: {code:python} >>> dictarray = pa.DictionaryArray.from_arrays( ... np.array([3, 2, 2, 2, 0, 1, 3], np.int32), ... pa.Array.from_buffers( ... pa.float64(), ... 4, ... [ ... None, ... pa.py_buffer(np.array([0.0, 1.1, 2.2, 3.3])), ... ], ... ), ... ) >>> dictarray -- dictionary: [ 0, 1.1, 2.2, 3.3 ] -- indices: [ 3, 2, 2, 2, 0, 1, 3 ] {code} I can write it to a file and read it back, though the fact that it comes back as a non-DictionaryArray might be part of the problem. Is some decision being made about the array of indices being too short to warrant dictionary encoding? {code:python} >>> pa.parquet.write_table(pa.table({"": dictarray}), "tmp.parquet") >>> pa.parquet.read_table("tmp.parquet") pyarrow.Table : double : [[3.3,2.2,2.2,2.2,0,1.1,3.3]] {code} Anyway, the next step is to make a DictionaryArray with ExtensionTypes. In this example, I'm making both the dictionary and the outer DictionaryArray itself be extended: {code:python} >>> dictionary_type = AnnotatedType(pa.float64(), "inner annotation") >>> dictarray_type = AnnotatedType( ... pa.dictionary(pa.int32(), dictionary_type), "outer annotation" ... ) >>> dictarray_ext = AnnotatedArray.from_storage( ... dictarray_type, ... pa.DictionaryArray.from_arrays( ... np.array([3, 2, 2, 2, 0, 1, 3], np.int32), ... pa.Array.from_buffers( ... dictionary_type, ... 4, ... [ ... None, ... pa.py_buffer(np.array([0.0, 1.1, 2.2, 3.3])), ... ], ... ), ... ) ... ) >>> dictarray_ext <__main__.AnnotatedArray object at 0x7f8c71ec7ee0> -- dictionary: [ 0, 1.1, 2.2, 3.3 ] -- indices: [ 3, 2, 2, 2, 0, 1, 3 ] {code} This can't be written to a Parquet file: {code:python} >>> pa.parquet.write_table(pa.table({"": dictarray_ext}), "tmp2.parquet") Traceback (most recent call last): File "", line 1, in File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 2034, in write_table writer.write_table(table, row_group_size=row_group_size) File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 701, in write_table self.writer.write_table(table, row_group_size=row_group_size) File "pyarrow/_parquet.pyx", line 1451, in pyarrow._parquet.ParquetWriter.write_table File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: Unsupported cast from dictionary>, indices=int32, ordered=0> to extension> (no available cast function for target type) {code} My first thought was maybe the data used in the dictionary must be simple (it's usually strings). So how about making the outer DictionaryArray extended, but the inner dictionary not extended? The type definitions are now inline. {code:python} >>> dictarray_partial = AnnotatedArray.from_storage( ... AnnotatedType( # extended, but the content is not ... pa.dictionary(pa.int32(), pa.float64()), "only annotation" ... ), ... pa.DictionaryArray.from_arrays( ... np.array([3, 2, 2, 2, 0, 1, 3], np.int32), ... pa.Array.from_buffers( ... pa.float64(), # not extended ...
[jira] [Created] (ARROW-14522) Can't read empty-but-for-nulls data from Parquet if it has an ExtensionType
Jim Pivarski created ARROW-14522: Summary: Can't read empty-but-for-nulls data from Parquet if it has an ExtensionType Key: ARROW-14522 URL: https://issues.apache.org/jira/browse/ARROW-14522 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 6.0.0 Reporter: Jim Pivarski Here's a corner case: suppose that I have data with type null, but it can have missing values so the whole array consists of nothing but nulls. In real life, this might only happen inside a nested data structure, at some level where an untyped data source (e.g. nested Python lists) had no entries so a type could not be determined. We expect to be able to write and read this data to and from Parquet, and we can—as long as it doesn't have an ExtensionType. Here's an example that works, _without_ ExtensionType: {code:python} >>> import json >>> import numpy as np >>> import pyarrow as pa >>> import pyarrow.parquet >>> >>> validbits = np.packbits(np.ones(14, dtype=np.uint8), bitorder="little") >>> empty_but_for_nulls = pa.Array.from_buffers( ... pa.null(), 14, [pa.py_buffer(validbits)], null_count=14 ... ) >>> empty_but_for_nulls 14 nulls >>> >>> pa.parquet.write_table(pa.table({"": empty_but_for_nulls}), "tmp.parquet") >>> pa.parquet.read_table("tmp.parquet") pyarrow.Table : null : [14 nulls] {code} And here's a continuation of that example, which doesn't work because the type {{pa.null()}} is replaced by {{AnnotatedType(pa.null(), \{"cool": "beans"})}}: {code:python} >>> class AnnotatedType(pa.ExtensionType): ... def __init__(self, storage_type, annotation): ... self.annotation = annotation ... super().__init__(storage_type, "my:app") ... def __arrow_ext_serialize__(self): ... return json.dumps(self.annotation).encode() ... @classmethod ... def __arrow_ext_deserialize__(cls, storage_type, serialized): ... annotation = json.loads(serialized.decode()) ... return cls(storage_type, annotation) ... >>> pa.register_extension_type(AnnotatedType(pa.null(), None)) >>> >>> empty_but_for_nulls = pa.Array.from_buffers( ... AnnotatedType(pa.null(), {"cool": "beans"}), ... 14, ... [pa.py_buffer(validbits)], ... null_count=14, ... ) >>> empty_but_for_nulls 14 nulls >>> >>> pa.parquet.write_table(pa.table({"": empty_but_for_nulls}), "tmp2.parquet") >>> pa.parquet.read_table("tmp2.parquet") Traceback (most recent call last): File "", line 1, in File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 1941, in read_table return dataset.read(columns=columns, use_threads=use_threads, File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 1776, in read table = self._dataset.to_table( File "pyarrow/_dataset.pyx", line 491, in pyarrow._dataset.Dataset.to_table File "pyarrow/_dataset.pyx", line 3235, in pyarrow._dataset.Scanner.to_table File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Array of type extension> has 14 nulls but no null bitmap {code} If "nullable type null" were outside the set of types that should be writable to Parquet, then it would not work for the non-ExtensionType or it would fail on writing, not reading, so I'm quite sure this is a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14485) ParquetFile.read_row_group looses struct nullability when selecting one column from a struct
Jim Pivarski created ARROW-14485: Summary: ParquetFile.read_row_group looses struct nullability when selecting one column from a struct Key: ARROW-14485 URL: https://issues.apache.org/jira/browse/ARROW-14485 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 6.0.0 Reporter: Jim Pivarski Attachments: test8.parquet This appeared minutes ago because we have a test suite that saw Arrow 6.0.0 land in PyPI. (Congrats, by the way! I've been looking forward to this one!) Below, you'll see one thing that version 6 fixed (asking for one column in a nested struct returns only that one column) and a new error (it does not preserve nullability of the surrounding struct). Here, I'll write down the steps to reproduce and then explain. {code:python} Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow.parquet >>> pyarrow.__version__ '5.0.0' >>> file = pyarrow.parquet.ParquetFile("test8.parquet") >>> file.schema required group field_id=-1 schema { required group field_id=-1 x (List) { repeated group field_id=-1 list { required group field_id=-1 item { required int64 field_id=-1 y; required double field_id=-1 z; } } } } >>> file.schema_arrow x: large_list not null> not null child 0, item: struct not null child 0, y: int64 not null child 1, z: double not null >>> file.read_row_group(0, ["x.list.item.y"]).schema x: large_list not null> not null child 0, item: struct not null child 0, y: int64 not null child 1, z: double not null >>> file.read_row_group(0, ["x.list.item.y", "x.list.item.z"]).schema x: large_list not null> not null child 0, item: struct not null child 0, y: int64 not null child 1, z: double not null >>> file.read_row_group(0).schema x: large_list not null> not null child 0, item: struct not null child 0, y: int64 not null child 1, z: double not null Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow.parquet >>> pyarrow.__version__ '6.0.0' >>> file = pyarrow.parquet.ParquetFile("test8.parquet") >>> file.schema required group field_id=-1 schema { required group field_id=-1 x (List) { repeated group field_id=-1 list { required group field_id=-1 item { required int64 field_id=-1 y; required double field_id=-1 z; } } } } >>> file.schema_arrow x: large_list not null> not null child 0, item: struct not null child 0, y: int64 not null child 1, z: double not null >>> file.read_row_group(0, ["x.list.item.y"]).schema x: large_list> not null child 0, item: struct child 0, y: int64 not null >>> file.read_row_group(0, ["x.list.item.y", "x.list.item.z"]).schema x: large_list not null> not null child 0, item: struct not null child 0, y: int64 not null child 1, z: double not null >>> file.read_row_group(0).schema x: large_list not null> not null child 0, item: struct not null child 0, y: int64 not null child 1, z: double not null {code} In Arrow 5, asking for only column {{"x.list.item.y"}} returns a struct of type {{x: large_list not null> not null}}, which was undesirable because it has unnecessarily read the {{"z"}} column, but it got all of the {{"not null"}} types right. In test8.parquet, the data are non-nullable at each level. In Arrow 6, asking for only column {{"x.list.item.y"}} returns a struct of type {{x: large_list> not null}}, which is great because it's not reading the {{"z"}} column, but the struct's nullability is wrong: we should see three {{"not nulls"}} here, one for the data in {{y}}, one for the {{struct}}, and one for the {{list}}. It's just missing the middle one. When I ask for two columns specifically or don't specify the columns, the nullability is correct. I think that can help to narrow it down. I've attached the file (test8.parquet). It was the same in both of the above tests (generated by Arrow 5). I labeled this as "Python" because I've only seen the symptom in Python, but I suspect that the actual error is in C++. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13040) to_pandas_dtype values are wrong or unimplemented for date-time types
Jim Pivarski created ARROW-13040: Summary: to_pandas_dtype values are wrong or unimplemented for date-time types Key: ARROW-13040 URL: https://issues.apache.org/jira/browse/ARROW-13040 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 4.0.0 Reporter: Jim Pivarski Most of them mistakenly assume nanoseconds, but some are not implemented. Here's the complete run-down: {{date32/date64/time32/time64}} {{---}} {{>>> pyarrow.date32()}} {{DataType(date32[day])}} {{>>> pyarrow.date32().to_pandas_dtype()}} {{dtype('>> pyarrow.date64()}} {{DataType(date64[ms])}} {{>>> pyarrow.date64().to_pandas_dtype()}} {{dtype('>> pyarrow.time32("s")}} {{Time32Type(time32[s])}} {{>>> pyarrow.time32("s").to_pandas_dtype()}} {{Traceback (most recent call last):}} {{ File "", line 1, in }} {{ File "pyarrow/types.pxi", line 200, in pyarrow.lib.DataType.to_pandas_dtype}} {{NotImplementedError: time32[s]}} {{>>> pyarrow.time32("ms")}} {{Time32Type(time32[ms])}} {{>>> pyarrow.time32("ms").to_pandas_dtype()}} {{Traceback (most recent call last):}} {{ File "", line 1, in }} {{ File "pyarrow/types.pxi", line 200, in pyarrow.lib.DataType.to_pandas_dtype}} {{NotImplementedError: time32[ms]}} {{>>> pyarrow.time64("us")}} {{Time64Type(time64[us])}} {{>>> pyarrow.time64("us").to_pandas_dtype()}} {{Traceback (most recent call last):}} {{ File "", line 1, in }} {{ File "pyarrow/types.pxi", line 200, in pyarrow.lib.DataType.to_pandas_dtype}} {{NotImplementedError: time64[us]}} {{>>> pyarrow.time64("ns")}} {{Time64Type(time64[ns])}} {{>>> pyarrow.time64("ns").to_pandas_dtype()}} {{Traceback (most recent call last):}} {{ File "", line 1, in }} {{ File "pyarrow/types.pxi", line 200, in pyarrow.lib.DataType.to_pandas_dtype}} {{NotImplementedError: time64[ns]}} {{timestamp}} {{-}} {{>>> pyarrow.timestamp("s")}} {{TimestampType(timestamp[s])}} {{>>> pyarrow.timestamp("s").to_pandas_dtype()}} {{dtype('>> pyarrow.timestamp("ms")}} {{TimestampType(timestamp[ms])}} {{>>> pyarrow.timestamp("ms").to_pandas_dtype()}} {{dtype('>> pyarrow.timestamp("us")}} {{TimestampType(timestamp[us])}} {{>>> pyarrow.timestamp("us").to_pandas_dtype()}} {{dtype('>> pyarrow.timestamp("ns")}} {{TimestampType(timestamp[ns])}} {{>>> pyarrow.timestamp("ns").to_pandas_dtype()}} {{dtype('>> pyarrow.duration("s")}} {{DurationType(duration[s])}} {{>>> pyarrow.duration("s").to_pandas_dtype()}} {{dtype('>> pyarrow.duration("ms")}} {{DurationType(duration[ms])}} {{>>> pyarrow.duration("ms").to_pandas_dtype()}} {{dtype('>> pyarrow.duration("us")}} {{DurationType(duration[us])}} {{>>> pyarrow.duration("us").to_pandas_dtype()}} {{dtype('>> pyarrow.duration("ns")}} {{DurationType(duration[ns])}} {{>>> pyarrow.duration("ns").to_pandas_dtype()}} {{dtype('
[jira] [Created] (ARROW-10930) In pyarrow, LargeListArray doesn't have a value_field
Jim Pivarski created ARROW-10930: Summary: In pyarrow, LargeListArray doesn't have a value_field Key: ARROW-10930 URL: https://issues.apache.org/jira/browse/ARROW-10930 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 2.0.0 Reporter: Jim Pivarski This one is easy: it looks like the LargeListType is just missing this field. Here it is for a 32-bit list (the reason I want this is to get at the "nullable" field, although the "metadata" would be nice, too): {code:java} >>> import pyarrow as pa >>> small_array = pa.ListArray.from_arrays(pa.array([0, 3, 3, 5]), >>> pa.array([1.1, 2.2, 3.3, 4.4, 5.5])) >>> small_array.type.value_field pyarrow.Field >>> small_array.type.value_field.nullable True{code} Now with a large list: {code:java} >>> large_array = pa.LargeListArray.from_arrays(pa.array([0, 3, 3, 5]), >>> pa.array([1.1, 2.2, 3.3, 4.4, 5.5])) >>> large_array.type.value_field Traceback (most recent call last): File "", line 1, in AttributeError: 'pyarrow.lib.LargeListType' object has no attribute 'value_field'{code} Verifying version: {code:java} >>> pa.__version__ '2.0.0'{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9801) DictionaryArray with non-unique values are silently corrupted when written to a Parquet file
Jim Pivarski created ARROW-9801: --- Summary: DictionaryArray with non-unique values are silently corrupted when written to a Parquet file Key: ARROW-9801 URL: https://issues.apache.org/jira/browse/ARROW-9801 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 1.0.0 Environment: pyarrow 1.0.0 installed from conda-forge. Reporter: Jim Pivarski Suppose that you have a DictionaryArray with repeated values in the dictionary: {{>>> import pyarrow as pa}} {{>>> pa_array = pa.DictionaryArray.from_arrays(}} {{... pa.array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]),}} {{... pa.array(["one", "two", "three", "one", "two", "three"])}} {{... )}} {{>>> pa_array}} {{}}{{-- dictionary:}} {{ [}} {{ "one",}} {{ }}{{"two",}} {{ }}{{"three",}} {{ }}{{"one",}} {{ }}{{"two",}} {{ }}{{"three"}} {{ ]}} {{-- indices:}} {{ [}} {{ }}{{0,}} {{ }}{{1,}} {{ }}{{2,}} {{ }}{{3,}} {{ }}{{4,}} {{ }}{{5,}} {{ }}{{0,}} {{ }}{{1,}} {{ }}{{2,}} {{ }}{{3,}} {{ }}{{4,}} {{ }}{{5}} {{ ]}} According to [the documentation|[https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout]], {quote}Dictionary encoding is a data representation technique to represent values by integers referencing a *dictionary* usually consisting of unique values. {quote} so a DictionaryArray like the one above is arguably invalid, but if so, then I'd expect some error messages, rather than corrupt data, when I try to write it to a Parquet file. {{>>> pa_table = pa.Table.from_batches(}} {{... [pa.RecordBatch.from_arrays([pa_array], ["column"])]}} {{... )}} {{>>> pa_table}} {{pyarrow.Table}} {{column: dictionary}} {{>>> import pyarrow.parquet}} {{>>> pyarrow.parquet.write_table(pa_table, "tmp2.parquet")}} No errors so far. So we try to read it back and view it: {{>>> pa_loaded = pyarrow.parquet.read_table("tmp2.parquet")}} {{>>> pa_loaded}} {{pyarrow.Table}} {{column: dictionary}} {{>>> pa_loaded.to_pydict()}} {{Traceback (most recent call last):}} {{ File "", line 1, in }} {{ File "pyarrow/table.pxi", line 1587, in pyarrow.lib.Table.to_pydict}} {{ File "pyarrow/table.pxi", line 405, in pyarrow.lib.ChunkedArray.to_pylist}} {{ File "pyarrow/array.pxi", line 1144, in pyarrow.lib.Array.to_pylist}} {{ File "pyarrow/scalar.pxi", line 712, in pyarrow.lib.DictionaryScalar.as_py}} {{ File "pyarrow/scalar.pxi", line 701, in pyarrow.lib.DictionaryScalar.value.__get__}} {{ File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status}} {{ File "pyarrow/error.pxi", line 111, in pyarrow.lib.check_status}} {{pyarrow.lib.ArrowIndexError: tried to refer to element 3 but array is only 3 long}} Looking more closely at this, we see that the dictionary has been minimized to include only unique values, but the indices haven't been correctly translated: {{>>> pa_loaded["column"]}} {{}} {{[}} {{ }}{{}}{{-- dictionary:}} {{ }}{{[}} {{ }}{{ }}{{"one",}} {{ }}{{ }}{{"two",}} {{ }}{{ }}{{"three"}} {{ }}{{]}} {{ }}{{-- indices:}} {{ }}{{[}} {{ }}{{ }}{{0,}} {{ }}{{ }}{{1,}} {{ }}{{ }}{{2,}} {{ }}{{ }}{{3,}} {{ }}{{ }}{{0,}} {{ }}{{ }}{{1,}} {{ }}{{ }}{{1,}} {{ }}{{ }}{{1,}} {{ }}{{ }}{{2,}} {{ }}{{ }}{{3,}} {{ }}{{ }}{{0,}} {{ }}{{ }}{{1}} {{ }}{{]}} {{]}} It looks like an attempt was made to minimize it, but the indices ought to be [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2] I don't know what your preferred course of action is—adding an error message or fixing the attempted conversion—but this is wrong. On my side, I'm adding code to prevent the creation of non-unique values in DictionaryArrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9577) posix_madvise error on Debian in pyarrow 1.0.0
Jim Pivarski created ARROW-9577: --- Summary: posix_madvise error on Debian in pyarrow 1.0.0 Key: ARROW-9577 URL: https://issues.apache.org/jira/browse/ARROW-9577 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 1.0.0 Environment: Installed with Miniconda (for Debian; used pip for the Ubuntu test) Reporter: Jim Pivarski The following writes and reads back from a Parquet file in both pyarrow 0.17.0 and 1.0.0 on Ubuntu 18.04: {code:java} >>> import pyarrow.parquet >>> a = pyarrow.array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]) >>> t = pyarrow.Table.from_batches([pyarrow.RecordBatch.from_arrays([a], >>> ["stuff"])]) >>> pyarrow.parquet.write_table(t, "stuff.parquet") >>> t2 = pyarrow.parquet.read_table("stuff.parquet") {code} However, the same thing raises the following exception on Debian 9 (stretch) in pyarrow 1.0.0 but not in pyarrow 0.17.0: {code:java} Traceback (most recent call last): File "", line 1, in File "/home/jpivarski/miniconda3/lib/python3.7/site-packages/pyarrow/parquet.py", line 1564, in read_table filters=filters, File "/home/jpivarski/miniconda3/lib/python3.7/site-packages/pyarrow/parquet.py", line 1433, in __init__ partitioning=partitioning) File "/home/jpivarski/miniconda3/lib/python3.7/site-packages/pyarrow/dataset.py", line 667, in dataset return _filesystem_dataset(source, **kwargs) File "/home/jpivarski/miniconda3/lib/python3.7/site-packages/pyarrow/dataset.py", line 434, in _filesystem_dataset return factory.finish(schema) File "pyarrow/_dataset.pyx", line 1451, in pyarrow._dataset.DatasetFactory.finish File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: posix_madvise failed. Detail: [errno 0] Success{code} It's a little odd that the error says that it failed with "detail: success". That suggests to me that an "if" predicate is backward (missing "not"), which might only be triggered on some OS/distributions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9556) Segfaults in UnionArray with null values
Jim Pivarski created ARROW-9556: --- Summary: Segfaults in UnionArray with null values Key: ARROW-9556 URL: https://issues.apache.org/jira/browse/ARROW-9556 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 1.0.0 Environment: Conda, but pyarrow was installed using pip (in the conda environment) Reporter: Jim Pivarski Extracting null values from a UnionArray containing nulls and constructing a UnionArray with a bitmask in pyarrow.Array.from_buffers causes segfaults in pyarrow 1.0.0. I have an environment with pyarrow 0.17.0 and all of the following run correctly without segfaults in the older version. Here's a UnionArray that works (because there are no nulls): {code:java} # GOOD a = pyarrow.UnionArray.from_sparse( pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()), [ pyarrow.array([0.0, 1.1, 2.2, 3.3, 4.4]), pyarrow.array([True, True, False, True, False]), ], ) a.to_pylist(){code} Here's one the fails when you try a.to_pylist() or even just a[2], because one of the children has a null at 2: {code:java} # SEGFAULT a = pyarrow.UnionArray.from_sparse( pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()), [ pyarrow.array([0.0, 1.1, None, 3.3, 4.4]), pyarrow.array([True, True, False, True, False]), ], ) a.to_pylist() # also just a[2] causes a segfault{code} Here's another that fails because both children have nulls; the segfault occurs at both positions with nulls: {code:java} # SEGFAULT a = pyarrow.UnionArray.from_sparse( pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()), [ pyarrow.array([0.0, 1.1, None, 3.3, 4.4]), pyarrow.array([True, None, False, True, False]), ], ) a.to_pylist() # also a[1] and a[2] cause segfaults{code} Here's one that succeeds, but it's dense, rather than sparse: {code:java} # GOOD a = pyarrow.UnionArray.from_dense( pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()), pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()), [pyarrow.array([0.0, 1.1, 2.2, 3.3]), pyarrow.array([True, True, False])], ) a.to_pylist(){code} Here's a dense that fails because one child has a null: {code:java} # SEGFAULT a = pyarrow.UnionArray.from_dense( pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()), pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()), [pyarrow.array([0.0, 1.1, None, 3.3]), pyarrow.array([True, True, False])], ) a.to_pylist() # also just a[3] causes a segfault{code} Here's a dense that fails in two positions because both children have a null: {code:java} # SEGFAULT a = pyarrow.UnionArray.from_dense( pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()), pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()), [pyarrow.array([0.0, 1.1, None, 3.3]), pyarrow.array([True, None, False])], ) a.to_pylist() # also a[3] and a[5] cause segfaults{code} In all of the above, we created the UnionArray using its from_dense method. We could instead create it with pyarrow.Array.from_buffers. If created with content0 and content1 that have no nulls, it's fine, but if created with nulls in the content, it segfaults as soon as you view the null value. {code:java} # GOOD content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3, 4.4]) content1 = pyarrow.array([True, True, False, True, False]) # SEGFAULT content0 = pyarrow.array([0.0, 1.1, 2.2, None, 4.4]) content1 = pyarrow.array([True, True, False, True, False]) types = pyarrow.union( [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)], "sparse", [0, 1], ) a = pyarrow.Array.from_buffers( types, 5, [ None, pyarrow.py_buffer(numpy.array([0, 1, 0, 0, 1], numpy.int8)), ], children=[content0, content1], ) a.to_pylist() # also just a[3] causes a segfault{code} Similarly for a dense union. {code:java} # GOOD content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3]) content1 = pyarrow.array([True, True, False]) # SEGFAULT content0 = pyarrow.array([0.0, 1.1, None, 3.3]) content1 = pyarrow.array([True, True, False]) types = pyarrow.union( [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)], "dense", [0, 1], ) a = pyarrow.Array.from_buffers( types, 7, [ None, pyarrow.py_buffer(numpy.array([0, 1, 0, 0, 0, 1, 1], numpy.int8)), pyarrow.py_buffer(numpy.array([0, 0, 1, 2, 3, 1, 2], numpy.int32)), ], children=[content0, content1], ) a.to_pylist() # also just a[3] causes a segfault{code} The next segfaults are different: instead of putting the null values in the content, we put the null value in the UnionArray itself. This time, it segfaults when it is being created. It also prints some output (all of the above were silent segfaults). {code:java} # SEGFAULT (even to create) content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3, 4.4]) content1 = pyarrow.array([True, True, False, True, False]) types = pyarrow.union( [pyarrow.field("0", content0.type), pyarrow.field("1",
[jira] [Commented] (ARROW-5870) Development compile instructions need to include "make" and "re2"
[ https://issues.apache.org/jira/browse/ARROW-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879969#comment-16879969 ] Jim Pivarski commented on ARROW-5870: - Actually, only "make" is needed; "re2" comes in the "sudo apt-get install" line I didn't realize was part of the installation (because it was under the "using pip" section and I was using conda, but it is needed for Boost). But even after installing Boost (and re2) with apt-get and re-running cmake, I'm running into "undefined reference to `boost::system::detail::generic_category_ncx()'" errors. I think this is due to a missing boost_system, but I can't see from the instructions on [https://arrow.apache.org/docs/python/development.html] what's missing. I had thought this was a simple omission from the instructions (and therefore an easy "bug" fix), but it's beginning to look like a long installation struggle. Should I move this to the Arrow developers mailing iist? > Development compile instructions need to include "make" and "re2" > - > > Key: ARROW-5870 > URL: https://issues.apache.org/jira/browse/ARROW-5870 > Project: Apache Arrow > Issue Type: Bug > Components: Website >Affects Versions: 0.14.0 >Reporter: Jim Pivarski >Priority: Major > Labels: documentation > > Following the build instructions on > [https://arrow.apache.org/docs/python/development.html] > using conda—I additionally needed to install the "make" and "re2" packages > for cmake to succeed. These are such common packages, it probably didn't come > up in your tests, but I have a minimal system. > (It's not done with "make", but it looks promising so far.) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5870) Development compile instructions need to include "make" and "re2"
Jim Pivarski created ARROW-5870: --- Summary: Development compile instructions need to include "make" and "re2" Key: ARROW-5870 URL: https://issues.apache.org/jira/browse/ARROW-5870 Project: Apache Arrow Issue Type: Bug Components: Website Affects Versions: 0.14.0 Reporter: Jim Pivarski Following the build instructions on [https://arrow.apache.org/docs/python/development.html] using conda—I additionally needed to install the "make" and "re2" packages for cmake to succeed. These are such common packages, it probably didn't come up in your tests, but I have a minimal system. (It's not done with "make", but it looks promising so far.) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5869) [Python] Need a way to access UnionArray's children as Arrays in pyarrow
[ https://issues.apache.org/jira/browse/ARROW-5869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879965#comment-16879965 ] Jim Pivarski commented on ARROW-5869: - I'm not currently in a position to do that, never having built pyarrow before. I could look for instructions and try it out, but not immediately. > [Python] Need a way to access UnionArray's children as Arrays in pyarrow > > > Key: ARROW-5869 > URL: https://issues.apache.org/jira/browse/ARROW-5869 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: Jim Pivarski >Priority: Major > > > There doesn't seem to be a way to get to the children of sparse or dense > UnionArrays. For other types, there's > * ListType: array.flatten() > * StructType: array.field("fieldname") > * DictionaryType: array.indices and now array.dictionary (in 0.14.0) > * (other types have no children, I think...) > The reason this comes up now is that I have a downstream library that does a > zero-copy view of Arrow by recursively walking over its types and > interpreting the list of buffers for each type. In the past, I didn't need > the _array_ children of each array—I popped the right number of buffers off > the list depending on the type—but now the dictionary for DictionaryType has > been moved from the type object to the array object (in 0.14.0). Since it's > neither in the buffers list, nor in the type tree, I need to walk the tree of > arrays in tandem with the tree of types. > That would be okay, except that I don't see how to descend from a UnionArray > to its children. > This is the function where I do the walk down types (tpe), and now arrays > (array), while interpreting the right number of buffers at each step. > [https://github.com/scikit-hep/awkward-array/blob/7c5961405cc39bbf2b489fad171652019c8de41b/awkward/arrow.py#L228-L364] > Simply exposing the std::vector named "children" as a Python sequence or a > child(int i) method would provide a way to descend UnionTypes and make this > kind of access uniform across all types. > Alternatively, putting the array.dictionary in the list of buffers would also > do it (and make it unnecessary for me to walk over the arrays), but in > general it seems like a good idea to make arrays accessible. It seems like it > belongs in the buffers, but that would probably be a big change, not to be > undertaken for minor reasons. > Thanks! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5869) Need a way to access UnionArray's children as Arrays in pyarrow
Jim Pivarski created ARROW-5869: --- Summary: Need a way to access UnionArray's children as Arrays in pyarrow Key: ARROW-5869 URL: https://issues.apache.org/jira/browse/ARROW-5869 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.0 Reporter: Jim Pivarski There doesn't seem to be a way to get to the children of sparse or dense UnionArrays. For other types, there's * ListType: array.flatten() * StructType: array.field("fieldname") * DictionaryType: array.indices and now array.dictionary (in 0.14.0) * (other types have no children, I think...) The reason this comes up now is that I have a downstream library that does a zero-copy view of Arrow by recursively walking over its types and interpreting the list of buffers for each type. In the past, I didn't need the _array_ children of each array—I popped the right number of buffers off the list depending on the type—but now the dictionary for DictionaryType has been moved from the type object to the array object (in 0.14.0). Since it's neither in the buffers list, nor in the type tree, I need to walk the tree of arrays in tandem with the tree of types. That would be okay, except that I don't see how to descend from a UnionArray to its children. This is the function where I do the walk down types (tpe), and now arrays (array), while interpreting the right number of buffers at each step. [https://github.com/scikit-hep/awkward-array/blob/7c5961405cc39bbf2b489fad171652019c8de41b/awkward/arrow.py#L228-L364] Simply exposing the std::vector named "children" as a Python sequence or a child(int i) method would provide a way to descend UnionTypes and make this kind of access uniform across all types. Alternatively, putting the array.dictionary in the list of buffers would also do it (and make it unnecessary for me to walk over the arrays), but in general it seems like a good idea to make arrays accessible. It seems like it belongs in the buffers, but that would probably be a big change, not to be undertaken for minor reasons. Thanks! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2295) Add to_numpy functions
[ https://issues.apache.org/jira/browse/ARROW-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394223#comment-16394223 ] Jim Pivarski commented on ARROW-2295: - Array.buffers() must be a new feature, after 0.8.0. I'll look for it in the next release. Thanks! > Add to_numpy functions > -- > > Key: ARROW-2295 > URL: https://issues.apache.org/jira/browse/ARROW-2295 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Lawrence Chan >Priority: Minor > > There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to > propose that we include both. > Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho > is very confusing :). I think it would be more intuitive for the > `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` > objects, and the `to_numpy()` functions to return `numpy.ndarray` and either > a ordered dict of `numpy.ndarray` or a structured `numpy.ndarray` depending > on a flag, for example. The `to_pandas()` function is of course welcome to > use the `to_numpy()` func to avoid the additional index and whatnot of the > `pandas.Series`. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2295) Add to_numpy functions
[ https://issues.apache.org/jira/browse/ARROW-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394196#comment-16394196 ] Jim Pivarski commented on ARROW-2295: - I second this and would like to request that the Numpy interface has more low-level access to Arrow structures. For instance, ListArray is internally represented as two arrays: offsets and contents, and there are applications where we'd want to get a zero-copy view of these arrays. The to_pandas() function constructs a Numpy object array of subarrays, which is a performance bottleneck if you really do want the original offsets and contents. This function could be an inverse of pyarrow.ListArray.from_arrays, something that returns the offsets and contents as Numpy arrays for a List and something more complex for general cases (a dict from strings representing a place in the hierarchy to Numpy arrays?). A simpler interface that could be implemented immediately would be one that returns the raw bytes of the Arrow buffer, to let us identify its contents using [the Arrow spec|[https://github.com/apache/arrow/blob/master/format/Layout.md].] But that doesn't make use of the dtype (probably just set it to uint8) and would probably make more sense as a raw __buffer__. (Should that be a separate JIRA ticket?) > Add to_numpy functions > -- > > Key: ARROW-2295 > URL: https://issues.apache.org/jira/browse/ARROW-2295 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Lawrence Chan >Priority: Minor > > There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to > propose that we include both. > Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho > is very confusing :). I think it would be more intuitive for the > `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` > objects, and the `to_numpy()` functions to return `numpy.ndarray` and either > a ordered dict of `numpy.ndarray` or a structured `numpy.ndarray` depending > on a flag, for example. The `to_pandas()` function is of course welcome to > use the `to_numpy()` func to avoid the additional index and whatnot of the > `pandas.Series`. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-230) Python: Do not name modules like native ones (i.e. rename pyarrow.io)
[ https://issues.apache.org/jira/browse/ARROW-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589997#comment-15589997 ] Jim Pivarski commented on ARROW-230: I didn't have any luck with that, but it's a different issue, so I created it here: [https://issues.apache.org/jira/browse/ARROW-344]. > Python: Do not name modules like native ones (i.e. rename pyarrow.io) > - > > Key: ARROW-230 > URL: https://issues.apache.org/jira/browse/ARROW-230 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Uwe L. Korn > > Although you can rename it in imports, still weird stuff can happen. > E.g. if you re-run make in the build directory (only happens probably if you > change pyarrow's CMakeLists.txt and do not call it via setup.py) you will get > the following error: > {noformat} > -- Found Python lib /usr/lib/x86_64-linux-gnu/libpython2.7.so > CMake Error at cmake_modules/FindNumPy.cmake:62 (message): > NumPy import failure: > Traceback (most recent call last): > File "", line 1, in > File > "/home/uwe/.virtualenvs/pyarrow/local/lib/python2.7/site-packages/numpy/__init__.py", > line 180, in > from . import add_newdocs > File > "/home/uwe/.virtualenvs/pyarrow/local/lib/python2.7/site-packages/numpy/add_newdocs.py", > line 13, in > from numpy.lib import add_newdoc > File > "/home/uwe/.virtualenvs/pyarrow/local/lib/python2.7/site-packages/numpy/lib/__init__.py", > line 8, in > from .type_check import * > File > "/home/uwe/.virtualenvs/pyarrow/local/lib/python2.7/site-packages/numpy/lib/type_check.py", > line 11, in > import numpy.core.numeric as _nx > File > "/home/uwe/.virtualenvs/pyarrow/local/lib/python2.7/site-packages/numpy/core/__init__.py", > line 58, in > from numpy.testing import Tester > File > "/home/uwe/.virtualenvs/pyarrow/local/lib/python2.7/site-packages/numpy/testing/__init__.py", > line 14, in > from .utils import * > File > "/home/uwe/.virtualenvs/pyarrow/local/lib/python2.7/site-packages/numpy/testing/utils.py", > line 15, in > from tempfile import mkdtemp > File "/usr/lib/python2.7/tempfile.py", line 32, in > import io as _io > ImportError: > > /home/uwe/Development/arrow/python/build/temp.linux-x86_64-2.7/./libpyarrow.so: > undefined symbol: pyarrow_ARRAY_API > Call Stack (most recent call first): > CMakeLists.txt:223 (find_package) > {noformat} > The actual error message here is confusing but the basic problem is that here > the wrong io module is imported. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (ARROW-344) Instructions for building with conda
Jim Pivarski created ARROW-344: -- Summary: Instructions for building with conda Key: ARROW-344 URL: https://issues.apache.org/jira/browse/ARROW-344 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.1.0 Reporter: Jim Pivarski According to [this comment|https://issues.apache.org/jira/browse/ARROW-230?focusedCommentId=15588846=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15588846], Arrow 0.1.0 for Python can be installed with Conda. {{arrow-cpp}} is a dependency of the Python version, and I can install {{arrow-cpp}} locally with {noformat} conda install --channel conda-forge parquet-cpp numpy pandas pytest cd apache-arrow-0.1.0/cpp conda-build conda.recipe --channel conda-forge conda install -c conda-forge --use-local arrow-cpp cd ../python {noformat} but I can't build and locally install the {{conda.recipe}} in the Python directory because conda keeps trying to get the {{arrow-cpp}} on {{conda-forge}}, rather than the one in the 0.1.0 release. Those versions are incompatible due to a changed API: {noformat} [ 24%] Building CXX object CMakeFiles/pyarrow.dir/src/pyarrow/adapters/builtin.cc.o /usr/bin/c++ -Dpyarrow_EXPORTS -isystem /opt/miniconda2/conda-bld/conda.recipe_1476908391204/_b_env_placehold_placehold_/lib/python2.7/site-packages/numpy/core/include -isystem /opt/miniconda2/conda-bld/conda.recipe_1476908391204/_b_env_placehold_placehold_/include/python2.7 -isystem /opt/apache-arrow-0.1.0/python/src -isystem /opt/miniconda2/conda-bld/conda.recipe_1476908391204/_b_env_placehold_placehold_/include -std=c++11 -Wall -ggdb -O0 -g -fPIC -fPIC -o CMakeFiles/pyarrow.dir/src/pyarrow/adapters/builtin.cc.o -c /opt/apache-arrow-0.1.0/python/src/pyarrow/adapters/builtin.cc /opt/apache-arrow-0.1.0/python/src/pyarrow/adapters/builtin.cc: In function 'pyarrow::Status pyarrow::ConvertPySequence(PyObject*, std::shared_ptr*)': /opt/apache-arrow-0.1.0/python/src/pyarrow/adapters/builtin.cc:434:26: error: no matching function for call to 'arrow::ArrayBuilder::Finish()' *out = builder->Finish(); ^ /opt/apache-arrow-0.1.0/python/src/pyarrow/adapters/builtin.cc:434:26: note: candidate is: In file included from /opt/miniconda2/conda-bld/conda.recipe_1476908391204/_b_env_placehold_placehold_/include/arrow/api.h:24:0, from /opt/apache-arrow-0.1.0/python/src/pyarrow/adapters/builtin.cc:23: /opt/miniconda2/conda-bld/conda.recipe_1476908391204/_b_env_placehold_placehold_/include/arrow/builder.h:96:18: note: virtual arrow::Status arrow::ArrayBuilder::Finish(std::shared_ptr*) virtual Status Finish(std::shared_ptr* out) = 0; ^ /opt/miniconda2/conda-bld/conda.recipe_1476908391204/_b_env_placehold_placehold_/include/arrow/builder.h:96:18: note: candidate expects 1 argument, 0 provided make[2]: *** [CMakeFiles/pyarrow.dir/src/pyarrow/adapters/builtin.cc.o] Error 1 make[2]: Leaving directory `/opt/apache-arrow-0.1.0/python/build/temp.linux-x86_64-2.7' make[1]: *** [CMakeFiles/pyarrow.dir/all] Error 2 make[1]: Leaving directory `/opt/apache-arrow-0.1.0/python/build/temp.linux-x86_64-2.7' make: *** [all] Error 2 error: command 'make' failed with exit status 2 {noformat} If I do {{conda-build --channel local --channel conda-forge --override-channels}}, it can't find some of the packages I've installed. If I don't {{--override-channels}}, it tries to use {{arrow-cpp 0.1.post-1}} from {{conda-forge}} as the dependency and I get the compilation error above. Note: my {{conda list}} is {noformat} # packages in environment at /opt/miniconda2: # conda-build 2.0.6py27_0 blas 1.1openblasconda-forge conda 4.1.12 py27_0conda-forge conda-env 2.5.2py27_0conda-forge numpy 1.11.2 py27_blas_openblas_200 [blas_openblas] conda-forge openblas 0.2.185conda-forge pandas0.19.0 np111py27_0conda-forge parquet-cpp 0.1.pre 3conda-forge pytest3.0.3py27_0conda-forge thrift-cpp0.9.3 3conda-forge enum341.1.6py27_0 filelock 2.0.6py27_0 jinja22.8 py27_1 libgfortran 3.0.0 1 arrow-cpp 0.1 0local markupsafe0.23 py27_2 mkl 11.3.30 openssl 1.0.2h1 patchelf
[jira] [Commented] (ARROW-230) Python: Do not name modules like native ones (i.e. rename pyarrow.io)
[ https://issues.apache.org/jira/browse/ARROW-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15588659#comment-15588659 ] Jim Pivarski commented on ARROW-230: I made sure my PYTHONPATH and LD_LIBRARY_PATH were blank, installed a new directory, compiled the C++ library from source, and then attempted to compile the Python library. I'm including a log of that process below with some of the early steps truncated (...) and the Python compilation completely untruncated. My prompt is a single percent (%). {noformat} % echo $PYTHONPATH % echo $LD_LIBRARY_PATH % export ARROW_HOME=/opt/apache-arrow-0.1.0/cpp/dist % cd /opt % tar -xzvf /tmp/downloads/apache-arrow-0.1.0.tar.gz apache-arrow-0.1.0/ apache-arrow-0.1.0/.travis.yml apache-arrow-0.1.0/LICENSE.txt apache-arrow-0.1.0/NOTICE.txt apache-arrow-0.1.0/README.md ... % cd apache-arrow-0.1.0/cpp % source setup_build_env.sh + set -e +++ dirname ./thirdparty/download_thirdparty.sh ++ cd ./thirdparty ++ pwd + TP_DIR=/opt/apache-arrow-0.1.0/cpp/thirdparty + source /opt/apache-arrow-0.1.0/cpp/thirdparty/versions.sh ++ GTEST_VERSION=1.7.0 ... % mkdir release % cd release % cmake .. -DCMAKE_INSTALL_PREFIX:PATH=$ARROW_HOME clang-tidy not found clang-format not found Configured for DEBUG build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...}) -- Build Type: DEBUG INFO Using built-in specs. COLLECT_GCC=/usr/bin/c++ COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.8/lto-wrapper Target: x86_64-linux-gnu ... -- Flatbuffers static library: /opt/apache-arrow-0.1.0/cpp/thirdparty/installed/lib/libflatbuffers.a -- Flatbuffers compiler: /opt/apache-arrow-0.1.0/cpp/thirdparty/installed/bin/flatc -- Configuring done -- Generating done -- Build files have been written to: /opt/apache-arrow-0.1.0/cpp/release % make unittest Scanning dependencies of target metadata_fbs [ 2%] Running flatc compiler on /opt/apache-arrow-0.1.0/format/Message.fbs;/opt/apache-arrow-0.1.0/format/File.fbs [ 2%] Built target metadata_fbs Scanning dependencies of target arrow_objlib [ 4%] Building CXX object CMakeFiles/arrow_objlib.dir/src/arrow/array.cc.o [ 6%] Building CXX object CMakeFiles/arrow_objlib.dir/src/arrow/builder.cc.o ... 17/18 Test #17: ipc-file-test Passed0.12 sec Start 18: ipc-metadata-test 18/18 Test #18: ipc-metadata-test Passed0.12 sec 100% tests passed, 0 tests failed out of 18 Label Time Summary: unittest= 2.29 sec Total Test time (real) = 2.31 sec [100%] Built target unittest % make install [ 2%] Built target metadata_fbs [ 42%] Built target arrow_objlib [ 42%] Built target arrow_shared [ 42%] Built target arrow_static [ 44%] Built target arrow_test_main [ 46%] Built target array-test [ 48%] Built target column-test ... -- Installing: /opt/apache-arrow-0.1.0/cpp/dist/include/arrow/types/union.h -- Installing: /opt/apache-arrow-0.1.0/cpp/dist/include/arrow/ipc/adapter.h -- Installing: /opt/apache-arrow-0.1.0/cpp/dist/include/arrow/ipc/file.h -- Installing: /opt/apache-arrow-0.1.0/cpp/dist/include/arrow/ipc/metadata.h -- Installing: /opt/apache-arrow-0.1.0/cpp/dist/lib/libarrow_ipc.so -- Removed runtime path from "/opt/apache-arrow-0.1.0/cpp/dist/lib/libarrow_ipc.so" % cd ../../python % tree $ARROW_HOME /opt/apache-arrow-0.1.0/cpp/dist |-- include | `-- arrow | |-- api.h | |-- array.h | |-- builder.h | |-- column.h | |-- io | | |-- file.h | | |-- hdfs.h | | |-- interfaces.h | | `-- memory.h | |-- ipc | | |-- adapter.h | | |-- file.h | | `-- metadata.h | |-- schema.h | |-- table.h | |-- test-util.h | |-- type.h | |-- types | | |-- collection.h | | |-- construct.h | | |-- datetime.h | | |-- decimal.h | | |-- json.h | | |-- list.h | | |-- primitive.h | | |-- string.h | | |-- struct.h | | `-- union.h | `-- util | |-- bit-util.h | |-- buffer.h | |-- logging.h | |-- macros.h | |-- memory-pool.h | |-- random.h | |-- status.h | `-- visibility.h `-- lib |-- libarrow.a |-- libarrow.so |-- libarrow_io.so `-- libarrow_ipc.so 7 directories, 37 files % python setup.py build_ext --inplace /home/pivarski/.local/lib/python2.7/site-packages/setuptools/dist.py:331: UserWarning: Normalizing '0.1.0dev' to '0.1.0.dev0' normalized_version, running build_ext creating build creating build/temp.linux-x86_64-2.7 cmake -DPYTHON_EXECUTABLE=/usr/bin/python /opt/apache-arrow-0.1.0/python -- The C compiler identification is GNU 4.8.4 -- The CXX compiler identification is GNU 4.8.4 -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc