[jira] [Created] (ARROW-17539) Reading a StructArray column with an ExtensionType causes segfault

2022-08-26 Thread Jim Pivarski (Jira)
Jim Pivarski created ARROW-17539:


 Summary: Reading a StructArray column with an ExtensionType causes 
segfault
 Key: ARROW-17539
 URL: https://issues.apache.org/jira/browse/ARROW-17539
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 9.0.0
Reporter: Jim Pivarski


We can make nested columns in a Parquet file by putting a {{pa.StructArray}} in 
a {{pa.Table}} and writing that Table to Parquet. We can selectively read back 
that nested column by specifying it with dot syntax:

{{pq.ParquetFile("f.parquet").read_row_groups([0], 
["table_column.struct_field"])}}

But if the Arrow types are ExtensionTypes, then the above causes a segfault. 
The segfault depends both on the nested struct field and the ExtensionTypes.

Here is a minimally reproducing example of reading a nested struct field 
without extension types, which does not raise a segfault. (I'm building the 
{{pa.StructArray}} manually with {{from_buffers}} because I'll have to add the 
ExtensionTypes in the next example.)
{code:python}
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

one = pa.Array.from_buffers(
pa.int64(),
3,
[None, pa.py_buffer(np.array([10, 20, 30], dtype=np.int64))],
)
two = pa.Array.from_buffers(
pa.float64(),
3,
[None, pa.py_buffer(np.array([1.1, 2.2, 3.3], dtype=np.float64))],
)
record = pa.Array.from_buffers(
pa.struct([
pa.field("one", one.type, False),
pa.field("two", two.type, False),
]),
3,
[None],
children=[one, two],
)
assert record.to_pylist() == [
{"one": 10, "two": 1.1},
{"one": 20, "two": 2.2},
{"one": 30, "two": 3.3},
]

table = pa.Table.from_arrays([record], names=["column"])
pq.write_table(table, "record.parquet")
table2 = pq.ParquetFile("record.parquet").read_row_groups([0], ["column.one"])
assert table2.to_pylist() == [
{"column": {"one": 10}},
{"column": {"one": 20}},
{"column": {"one": 30}},
]
{code}
So far, so good; no segfault. Next, we define and register an ExtensionType,
{code:python}
import json

class AnnotatedType(pa.ExtensionType):
def __init__(self, storage_type, annotation):
self.annotation = annotation
super().__init__(storage_type, "my:app")
def __arrow_ext_serialize__(self):
return json.dumps(self.annotation).encode()
@classmethod
def __arrow_ext_deserialize__(cls, storage_type, serialized):
annotation = json.loads(serialized.decode())
print(storage_type, annotation)
return cls(storage_type, annotation)
@property
def num_buffers(self):
return self.storage_type.num_buffers
@property
def num_fields(self):
return self.storage_type.num_fields

pa.register_extension_type(AnnotatedType(pa.null(), None))
{code}
build the {{pa.StructArray}} again,
{code:python}
one = pa.Array.from_buffers(
AnnotatedType(pa.int64(), {"annotated": "one"}),
3,
[None, pa.py_buffer(np.array([10, 20, 30], dtype=np.int64))],
)
two = pa.Array.from_buffers(
AnnotatedType(pa.float64(), {"annotated": "two"}),
3,
[None, pa.py_buffer(np.array([1.1, 2.2, 3.3], dtype=np.float64))],
)
record = pa.Array.from_buffers(
AnnotatedType(
pa.struct([
pa.field("one", one.type, False),
pa.field("two", two.type, False),
]),
{"annotated": "record"},
),
3,
[None],
children=[one, two],
)
assert record.to_pylist() == [
{"one": 10, "two": 1.1},
{"one": 20, "two": 2.2},
{"one": 30, "two": 3.3},
]
{code}
Now when we write and read this back, there's a segfault:
{code:python}
table = pa.Table.from_arrays([record], names=["column"])
pq.write_table(table, "record_annotated.parquet")

print("before segfault")

table2 = pq.ParquetFile("record_annotated.parquet").read_row_groups([0], 
["column.one"])

print("after segfault")
{code}
The output, which prints each annotation as the ExtensionType is deserialized, 
is
{code:java}
before segfault
int64 {'annotated': 'one'}
double {'annotated': 'two'}
int64 {'annotated': 'one'}
double {'annotated': 'two'}
struct> not null, two: 
extension> not null> {'annotated': 'record'}
Segmentation fault (core dumped)
{code}
Note that if we read back that file, {{{}record_annotated.parquet{}}}, without 
the ExtensionType, everything is fine:
{code:java}
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> table2 = pq.ParquetFile("record_annotated.parquet").read_row_groups([0], 
>>> ["column.one"])
>>> assert table2.to_pylist() == [
... {"column": {"one": 10}},
... {"column": {"one": 20}},
... {"column": {"one": 30}},
... ]
{code}
and if we register the ExtensionType but don't select a column, 

[jira] [Created] (ARROW-16348) ParquetWriter use_compliant_nested_type=True does not preserve ExtensionArray when reading back

2022-04-26 Thread Jim Pivarski (Jira)
Jim Pivarski created ARROW-16348:


 Summary: ParquetWriter use_compliant_nested_type=True does not 
preserve ExtensionArray when reading back
 Key: ARROW-16348
 URL: https://issues.apache.org/jira/browse/ARROW-16348
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 7.0.0
 Environment: pyarrow 7.0.0 installed via pip.
Reporter: Jim Pivarski


I've been happily making ExtensionArrays, but recently noticed that they aren't 
preserved by round-trips through Parquet files when 
{{{}use_compliant_nested_type=True{}}}.

Consider this writer.py:

 
{code:java}
import json
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
class AnnotatedType(pa.ExtensionType):
    def __init__(self, storage_type, annotation):
        self.annotation = annotation
        super().__init__(storage_type, "my:app")
    def __arrow_ext_serialize__(self):
        return json.dumps(self.annotation).encode()
    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        annotation = json.loads(serialized.decode())
        return cls(storage_type, annotation)
    @property
    def num_buffers(self):
        return self.storage_type.num_buffers
    @property
    def num_fields(self):
        return self.storage_type.num_fields
pa.register_extension_type(AnnotatedType(pa.null(), None))
array = pa.Array.from_buffers(
    AnnotatedType(pa.list_(pa.float64()), {"cool": "beans"}),
    3,
    [None, pa.py_buffer(np.array([0, 3, 3, 5], np.int32))],
    children=[pa.array([1.1, 2.2, 3.3, 4.4, 5.5])],
)
table = pa.table({"": array})
print(table)
pq.write_table(table, "tmp.parquet", use_compliant_nested_type=True)
{code}
And this reader.py:

 
{code:java}
import json
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
class AnnotatedType(pa.ExtensionType):
    def __init__(self, storage_type, annotation):
        self.annotation = annotation
        super().__init__(storage_type, "my:app")
    def __arrow_ext_serialize__(self):
        return json.dumps(self.annotation).encode()
    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        annotation = json.loads(serialized.decode())
        return cls(storage_type, annotation)
    @property
    def num_buffers(self):
        return self.storage_type.num_buffers
    @property
    def num_fields(self):
        return self.storage_type.num_fields
pa.register_extension_type(AnnotatedType(pa.null(), None))
table = pq.read_table("tmp.parquet")
print(table)
{code}
(The AnnotatedType is the same; I wrote it twice for explicitness.)

When the writer.py has {{{}use_compliant_nested_type=False{}}}, the output is
{code:java}
% python writer.py 
pyarrow.Table
: extension>

: [[[1.1,2.2,3.3],[],[4.4,5.5]]]
% python reader.py 
pyarrow.Table
: extension>

: [[[1.1,2.2,3.3],[],[4.4,5.5]]]{code}
In other words, the AnnotatedType is preserved. When 
{{{}use_compliant_nested_type=True{}}}, however,
{code:java}
% rm tmp.parquet
rm: remove regular file 'tmp.parquet'? y
% python writer.py 
pyarrow.Table
: extension>

: [[[1.1,2.2,3.3],[],[4.4,5.5]]]
% python reader.py 
pyarrow.Table
: list
  child 0, element: double

: [[[1.1,2.2,3.3],[],[4.4,5.5]]]{code}
The issue doesn't seem to be in the writing, but in the reading: regardless of 
whether {{use_compliant_nested_type}} is {{True}} or {{{}False{}}}, I can see 
the extension metadata in the Parquet → Arrow converted schema.
{code:java}
>>> import pyarrow.parquet as pq
>>> pq.ParquetFile("tmp.parquet").schema.to_arrow_schema()
: list
  child 0, item: double
  -- field metadata --
  ARROW:extension:metadata: '{"cool": "beans"}'
  ARROW:extension:name: 'my:app'{code}
versus
{code:java}
>>> import pyarrow.parquet as pq
>>> pq.ParquetFile("tmp.parquet").schema.to_arrow_schema()
: list
  child 0, element: double
  -- field metadata --
  ARROW:extension:metadata: '{"cool": "beans"}'
  ARROW:extension:name: 'my:app'{code}
Note that the first has "{{{}item: double{}}}" and the second has "{{{}element: 
double{}}}".

(I'm also rather surprised that {{use_compliant_nested_type=False}} is an 
option. Wouldn't you want the Parquet files to always be written with compliant 
lists? I noticed this when I was having trouble getting the data into BigQuery.)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-14770) Direct (individualized) access to definition levels, repetition levels, and numeric data of a column

2021-11-18 Thread Jim Pivarski (Jira)
Jim Pivarski created ARROW-14770:


 Summary: Direct (individualized) access to definition levels, 
repetition levels, and numeric data of a column
 Key: ARROW-14770
 URL: https://issues.apache.org/jira/browse/ARROW-14770
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Parquet, Python
Reporter: Jim Pivarski


It would be useful to have more low-level access to the three components of a 
Parquet column in Python: the definition levels, the repetition levels, and the 
numeric data, {_}individually{_}.

The particular use-case we have in Awkward Array is that users will sometimes 
lazily read an array of lists of structs without reading any of the fields of 
those structs. To build the data structure, we need the lengths of the lists 
independently of the columns (which users can then use in functions like 
{{{}ak.num{}}}; the number of structs without their field values is useful 
information).

What we're doing right now is reading a column, converting it to Arrow 
({{{}pa.Array{}}}), and getting the list lengths from that Arrow array. We have 
been using the schema to try to pick the smallest column (booleans are best!), 
but that's because we really just want the definition and repetition levels 
without the numeric data.

I've heard that the Parquet metadata includes offsets to select just the 
definition levels, just the repetition levels, or just the numeric data 
(pre-decompression?). Exposing those in Python as {{pa.Buffer}} objects would 
be ideal.

Beyond our use case, such a feature could also help with wide structs in lists: 
all of the non-nullable fields of the struct would share the same definition 
and repetition levels, so they don't need to be re-read. For that use-case, the 
ability to pick out definition, repetition, and numeric data separately would 
still be useful, but the purpose would be to read the numeric data without the 
structural integers (opposite of ours).

The desired interface would be like {{{}ParquetFile.read_row_group{}}}, but 
would return one, two, or three {{pa.Buffer}} objects depending on three 
boolean arguments, {{{}definition{}}}, {{{}repetition{}}}, and {{{}numeric{}}}. 
The {{pa.Buffer}} would be unpacked, with all run-length encodings and 
fixed-width encodings converted into integers of at least one byte each. It may 
make more sense for the output to be {{{}np.ndarray{}}}, to carry {{dtype}} 
information if that can depend on the maximum level (though levels larger than 
255 are likely rare!). This information must be available at some level in 
Arrow's C++ code; the request is to expose it to Python.

I've labeled this minor because it is for optimizations, but it would be really 
nice to have!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14547) Reading FixedSizeListArray from Parquet with nulls

2021-11-01 Thread Jim Pivarski (Jira)
Jim Pivarski created ARROW-14547:


 Summary: Reading FixedSizeListArray from Parquet with nulls
 Key: ARROW-14547
 URL: https://issues.apache.org/jira/browse/ARROW-14547
 Project: Apache Arrow
  Issue Type: Bug
  Components: Parquet, Python
Affects Versions: 6.0.0
Reporter: Jim Pivarski


This one is easy to describe: given an array of fixed-sized lists, in which 
some are null,
{code:python}
>>> import numpy as np
>>> import pyarrow as pa
>>> import pyarrow.parquet
>>> a = pa.FixedSizeListArray.from_arrays(np.arange(10), 5).take([0, None])
>>> a

[
  [
0,
1,
2,
3,
4
  ],
  null
]
{code}
you can write them to a Parquet file, but not read them back:
{code:python}
>>> pa.parquet.write_table(pa.table({"": a}), "tmp.parquet")
>>> pa.parquet.read_table("tmp.parquet")
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", 
line 1941, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", 
line 1776, in read
table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 491, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3235, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 143, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected all lists to be of size=5 but index 2 had 
size=0
{code}
It could be that, at some level, the second list is considered to be empty.

For completeness, this doesn't happen if the fixed-sized lists have no nulls:
{code:python}
>>> b = pa.FixedSizeListArray.from_arrays(np.arange(10), 5)
>>> b

[
  [
0,
1,
2,
3,
4
  ],
  [
5,
6,
7,
8,
9
  ]
]
>>> pa.parquet.write_table(pa.table({"": b}), "tmp2.parquet")
>>> pa.parquet.read_table("tmp2.parquet")
pyarrow.Table
: fixed_size_list[5]
  child 0, item: int64

: [[[0,1,2,3,4],[5,6,7,8,9]]]
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14525) Writing DictionaryArrays with ExtensionType to Parquet

2021-10-29 Thread Jim Pivarski (Jira)
Jim Pivarski created ARROW-14525:


 Summary: Writing DictionaryArrays with ExtensionType to Parquet
 Key: ARROW-14525
 URL: https://issues.apache.org/jira/browse/ARROW-14525
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 6.0.0
Reporter: Jim Pivarski


Thanks to some help I got from [~jorisvandenbossche], I can create 
DictionaryArrays with ExtensionType (on just the dictionary, the dictionary 
array itself, or both). However, these extended-DictionaryArrays can't be 
written to Parquet files.

To start, let's set up my minimal reproducer ExtensionType, this time with an 
explicit ExtensionArray:
{code:python}
>>> import json
>>> import numpy as np
>>> import pyarrow as pa
>>> import pyarrow.parquet
>>> 
>>> class AnnotatedArray(pa.ExtensionArray):
... pass
... 
>>> class AnnotatedType(pa.ExtensionType):
... def __init__(self, storage_type, annotation):
... self.annotation = annotation
... super().__init__(storage_type, "my:app")
... def __arrow_ext_serialize__(self):
... return json.dumps(self.annotation).encode()
... @classmethod
... def __arrow_ext_deserialize__(cls, storage_type, serialized):
... annotation = json.loads(serialized.decode())
... return cls(storage_type, annotation)
... def __arrow_ext_class__(self):
... return AnnotatedArray
... 
>>> pa.register_extension_type(AnnotatedType(pa.null(), None))
{code}
A non-extended DictionaryArray could be built like this:
{code:python}
>>> dictarray = pa.DictionaryArray.from_arrays(
... np.array([3, 2, 2, 2, 0, 1, 3], np.int32),
... pa.Array.from_buffers(
... pa.float64(),
... 4,
... [
... None,
... pa.py_buffer(np.array([0.0, 1.1, 2.2, 3.3])),
... ],
... ),
... )
>>> dictarray


-- dictionary:
  [
0,
1.1,
2.2,
3.3
  ]
-- indices:
  [
3,
2,
2,
2,
0,
1,
3
  ]
{code}
I can write it to a file and read it back, though the fact that it comes back 
as a non-DictionaryArray might be part of the problem. Is some decision being 
made about the array of indices being too short to warrant dictionary encoding?
{code:python}
>>> pa.parquet.write_table(pa.table({"": dictarray}), "tmp.parquet")
>>> pa.parquet.read_table("tmp.parquet")
pyarrow.Table
: double

: [[3.3,2.2,2.2,2.2,0,1.1,3.3]]
{code}
Anyway, the next step is to make a DictionaryArray with ExtensionTypes. In this 
example, I'm making both the dictionary and the outer DictionaryArray itself be 
extended:
{code:python}
>>> dictionary_type = AnnotatedType(pa.float64(), "inner annotation")
>>> dictarray_type = AnnotatedType(
... pa.dictionary(pa.int32(), dictionary_type), "outer annotation"
... )
>>> dictarray_ext = AnnotatedArray.from_storage(
... dictarray_type,
... pa.DictionaryArray.from_arrays(
... np.array([3, 2, 2, 2, 0, 1, 3], np.int32),
... pa.Array.from_buffers(
... dictionary_type,
... 4,
... [
... None,
... pa.py_buffer(np.array([0.0, 1.1, 2.2, 3.3])),
... ],
... ),
... )
... )
>>> dictarray_ext
<__main__.AnnotatedArray object at 0x7f8c71ec7ee0>

-- dictionary:
  [
0,
1.1,
2.2,
3.3
  ]
-- indices:
  [
3,
2,
2,
2,
0,
1,
3
  ]
{code}
This can't be written to a Parquet file:
{code:python}
>>> pa.parquet.write_table(pa.table({"": dictarray_ext}), "tmp2.parquet")
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", 
line 2034, in write_table
writer.write_table(table, row_group_size=row_group_size)
  File 
"/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", 
line 701, in write_table
self.writer.write_table(table, row_group_size=row_group_size)
  File "pyarrow/_parquet.pyx", line 1451, in 
pyarrow._parquet.ParquetWriter.write_table
  File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from 
dictionary>, indices=int32, ordered=0> 
to extension> (no available cast function for target type)
{code}
My first thought was maybe the data used in the dictionary must be simple (it's 
usually strings). So how about making the outer DictionaryArray extended, but 
the inner dictionary not extended? The type definitions are now inline.
{code:python}
>>> dictarray_partial = AnnotatedArray.from_storage(
... AnnotatedType(  # extended, but the content is not
... pa.dictionary(pa.int32(), pa.float64()), "only annotation"
... ),
... pa.DictionaryArray.from_arrays(
... np.array([3, 2, 2, 2, 0, 1, 3], np.int32),
... pa.Array.from_buffers(
... pa.float64(),   # not extended
...  

[jira] [Created] (ARROW-14522) Can't read empty-but-for-nulls data from Parquet if it has an ExtensionType

2021-10-29 Thread Jim Pivarski (Jira)
Jim Pivarski created ARROW-14522:


 Summary: Can't read empty-but-for-nulls data from Parquet if it 
has an ExtensionType
 Key: ARROW-14522
 URL: https://issues.apache.org/jira/browse/ARROW-14522
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 6.0.0
Reporter: Jim Pivarski


Here's a corner case: suppose that I have data with type null, but it can have 
missing values so the whole array consists of nothing but nulls. In real life, 
this might only happen inside a nested data structure, at some level where an 
untyped data source (e.g. nested Python lists) had no entries so a type could 
not be determined. We expect to be able to write and read this data to and from 
Parquet, and we can—as long as it doesn't have an ExtensionType.

Here's an example that works, _without_ ExtensionType:
{code:python}
>>> import json
>>> import numpy as np
>>> import pyarrow as pa
>>> import pyarrow.parquet
>>> 
>>> validbits = np.packbits(np.ones(14, dtype=np.uint8), bitorder="little")
>>> empty_but_for_nulls = pa.Array.from_buffers(
... pa.null(), 14, [pa.py_buffer(validbits)], null_count=14
... )
>>> empty_but_for_nulls

14 nulls
>>> 
>>> pa.parquet.write_table(pa.table({"": empty_but_for_nulls}), "tmp.parquet")
>>> pa.parquet.read_table("tmp.parquet")
pyarrow.Table
: null

: [14 nulls]
{code}
And here's a continuation of that example, which doesn't work because the type 
{{pa.null()}} is replaced by {{AnnotatedType(pa.null(), \{"cool": "beans"})}}:
{code:python}
>>> class AnnotatedType(pa.ExtensionType):
... def __init__(self, storage_type, annotation):
... self.annotation = annotation
... super().__init__(storage_type, "my:app")
... def __arrow_ext_serialize__(self):
... return json.dumps(self.annotation).encode()
... @classmethod
... def __arrow_ext_deserialize__(cls, storage_type, serialized):
... annotation = json.loads(serialized.decode())
... return cls(storage_type, annotation)
... 
>>> pa.register_extension_type(AnnotatedType(pa.null(), None))
>>> 
>>> empty_but_for_nulls = pa.Array.from_buffers(
... AnnotatedType(pa.null(), {"cool": "beans"}),
... 14,
... [pa.py_buffer(validbits)],
... null_count=14,
... )
>>> empty_but_for_nulls

14 nulls
>>> 
>>> pa.parquet.write_table(pa.table({"": empty_but_for_nulls}), "tmp2.parquet")
>>> pa.parquet.read_table("tmp2.parquet")
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", 
line 1941, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", 
line 1776, in read
table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 491, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3235, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 143, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Array of type extension> has 14 
nulls but no null bitmap
{code}
If "nullable type null" were outside the set of types that should be writable 
to Parquet, then it would not work for the non-ExtensionType or it would fail 
on writing, not reading, so I'm quite sure this is a bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14485) ParquetFile.read_row_group looses struct nullability when selecting one column from a struct

2021-10-26 Thread Jim Pivarski (Jira)
Jim Pivarski created ARROW-14485:


 Summary: ParquetFile.read_row_group looses struct nullability when 
selecting one column from a struct
 Key: ARROW-14485
 URL: https://issues.apache.org/jira/browse/ARROW-14485
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 6.0.0
Reporter: Jim Pivarski
 Attachments: test8.parquet

This appeared minutes ago because we have a test suite that saw Arrow 6.0.0 
land in PyPI. (Congrats, by the way! I've been looking forward to this one!)

Below, you'll see one thing that version 6 fixed (asking for one column in a 
nested struct returns only that one column) and a new error (it does not 
preserve nullability of the surrounding struct). Here, I'll write down the 
steps to reproduce and then explain.
{code:python}
Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.parquet
>>> pyarrow.__version__
'5.0.0'
>>> file = pyarrow.parquet.ParquetFile("test8.parquet")
>>> file.schema

required group field_id=-1 schema {
  required group field_id=-1 x (List) {
repeated group field_id=-1 list {
  required group field_id=-1 item {
required int64 field_id=-1 y;
required double field_id=-1 z;
  }
}
  }
}

>>> file.schema_arrow
x: large_list not null> not 
null
  child 0, item: struct not null
  child 0, y: int64 not null
  child 1, z: double not null
>>> file.read_row_group(0, ["x.list.item.y"]).schema
x: large_list not null> not 
null
  child 0, item: struct not null
  child 0, y: int64 not null
  child 1, z: double not null
>>> file.read_row_group(0, ["x.list.item.y", "x.list.item.z"]).schema
x: large_list not null> not 
null
  child 0, item: struct not null
  child 0, y: int64 not null
  child 1, z: double not null
>>> file.read_row_group(0).schema
x: large_list not null> not 
null
  child 0, item: struct not null
  child 0, y: int64 not null
  child 1, z: double not null

Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.parquet
>>> pyarrow.__version__
'6.0.0'
>>> file = pyarrow.parquet.ParquetFile("test8.parquet")
>>> file.schema

required group field_id=-1 schema {
  required group field_id=-1 x (List) {
repeated group field_id=-1 list {
  required group field_id=-1 item {
required int64 field_id=-1 y;
required double field_id=-1 z;
  }
}
  }
}

>>> file.schema_arrow
x: large_list not null> not 
null
  child 0, item: struct not null
  child 0, y: int64 not null
  child 1, z: double not null
>>> file.read_row_group(0, ["x.list.item.y"]).schema
x: large_list> not null
  child 0, item: struct
  child 0, y: int64 not null
>>> file.read_row_group(0, ["x.list.item.y", "x.list.item.z"]).schema
x: large_list not null> not 
null
  child 0, item: struct not null
  child 0, y: int64 not null
  child 1, z: double not null
>>> file.read_row_group(0).schema
x: large_list not null> not 
null
  child 0, item: struct not null
  child 0, y: int64 not null
  child 1, z: double not null
{code}
 In Arrow 5, asking for only column {{"x.list.item.y"}} returns a struct of 
type {{x: large_list not 
null> not null}}, which was undesirable because it has unnecessarily read the 
{{"z"}} column, but it got all of the {{"not null"}} types right. In 
test8.parquet, the data are non-nullable at each level.

 In Arrow 6, asking for only column {{"x.list.item.y"}} returns a struct of 
type {{x: large_list> not null}}, which is 
great because it's not reading the {{"z"}} column, but the struct's nullability 
is wrong: we should see three {{"not nulls"}} here, one for the data in {{y}}, 
one for the {{struct}}, and one for the {{list}}. It's just missing the middle 
one.

When I ask for two columns specifically or don't specify the columns, the 
nullability is correct. I think that can help to narrow it down.

I've attached the file (test8.parquet). It was the same in both of the above 
tests (generated by Arrow 5).

I labeled this as "Python" because I've only seen the symptom in Python, but I 
suspect that the actual error is in C++.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13040) to_pandas_dtype values are wrong or unimplemented for date-time types

2021-06-10 Thread Jim Pivarski (Jira)
Jim Pivarski created ARROW-13040:


 Summary: to_pandas_dtype values are wrong or unimplemented for 
date-time types
 Key: ARROW-13040
 URL: https://issues.apache.org/jira/browse/ARROW-13040
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 4.0.0
Reporter: Jim Pivarski


Most of them mistakenly assume nanoseconds, but some are not implemented.

Here's the complete run-down:

{{date32/date64/time32/time64}}
{{---}}

{{>>> pyarrow.date32()}}
{{DataType(date32[day])}}
{{>>> pyarrow.date32().to_pandas_dtype()}}
{{dtype('>> pyarrow.date64()}}
{{DataType(date64[ms])}}
{{>>> pyarrow.date64().to_pandas_dtype()}}
{{dtype('>> pyarrow.time32("s")}}
{{Time32Type(time32[s])}}
{{>>> pyarrow.time32("s").to_pandas_dtype()}}
{{Traceback (most recent call last):}}
{{ File "", line 1, in }}
{{ File "pyarrow/types.pxi", line 200, in pyarrow.lib.DataType.to_pandas_dtype}}
{{NotImplementedError: time32[s]}}

{{>>> pyarrow.time32("ms")}}
{{Time32Type(time32[ms])}}
{{>>> pyarrow.time32("ms").to_pandas_dtype()}}
{{Traceback (most recent call last):}}
{{ File "", line 1, in }}
{{ File "pyarrow/types.pxi", line 200, in pyarrow.lib.DataType.to_pandas_dtype}}
{{NotImplementedError: time32[ms]}}

{{>>> pyarrow.time64("us")}}
{{Time64Type(time64[us])}}
{{>>> pyarrow.time64("us").to_pandas_dtype()}}
{{Traceback (most recent call last):}}
{{ File "", line 1, in }}
{{ File "pyarrow/types.pxi", line 200, in pyarrow.lib.DataType.to_pandas_dtype}}
{{NotImplementedError: time64[us]}}

{{>>> pyarrow.time64("ns")}}
{{Time64Type(time64[ns])}}
{{>>> pyarrow.time64("ns").to_pandas_dtype()}}
{{Traceback (most recent call last):}}
{{ File "", line 1, in }}
{{ File "pyarrow/types.pxi", line 200, in pyarrow.lib.DataType.to_pandas_dtype}}
{{NotImplementedError: time64[ns]}}

{{timestamp}}
{{-}}

{{>>> pyarrow.timestamp("s")}}
{{TimestampType(timestamp[s])}}
{{>>> pyarrow.timestamp("s").to_pandas_dtype()}}
{{dtype('>> pyarrow.timestamp("ms")}}
{{TimestampType(timestamp[ms])}}
{{>>> pyarrow.timestamp("ms").to_pandas_dtype()}}
{{dtype('>> pyarrow.timestamp("us")}}
{{TimestampType(timestamp[us])}}
{{>>> pyarrow.timestamp("us").to_pandas_dtype()}}
{{dtype('>> pyarrow.timestamp("ns")}}
{{TimestampType(timestamp[ns])}}
{{>>> pyarrow.timestamp("ns").to_pandas_dtype()}}
{{dtype('>> pyarrow.duration("s")}}
{{DurationType(duration[s])}}
{{>>> pyarrow.duration("s").to_pandas_dtype()}}
{{dtype('>> pyarrow.duration("ms")}}
{{DurationType(duration[ms])}}
{{>>> pyarrow.duration("ms").to_pandas_dtype()}}
{{dtype('>> pyarrow.duration("us")}}
{{DurationType(duration[us])}}
{{>>> pyarrow.duration("us").to_pandas_dtype()}}
{{dtype('>> pyarrow.duration("ns")}}
{{DurationType(duration[ns])}}
{{>>> pyarrow.duration("ns").to_pandas_dtype()}}
{{dtype('

[jira] [Created] (ARROW-10930) In pyarrow, LargeListArray doesn't have a value_field

2020-12-15 Thread Jim Pivarski (Jira)
Jim Pivarski created ARROW-10930:


 Summary: In pyarrow, LargeListArray doesn't have a value_field
 Key: ARROW-10930
 URL: https://issues.apache.org/jira/browse/ARROW-10930
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 2.0.0
Reporter: Jim Pivarski


This one is easy: it looks like the LargeListType is just missing this field. 
Here it is for a 32-bit list (the reason I want this is to get at the 
"nullable" field, although the "metadata" would be nice, too):
{code:java}
>>> import pyarrow as pa
>>> small_array = pa.ListArray.from_arrays(pa.array([0, 3, 3, 5]), 
>>> pa.array([1.1, 2.2, 3.3, 4.4, 5.5]))
>>> small_array.type.value_field
pyarrow.Field
>>> small_array.type.value_field.nullable
True{code}
Now with a large list:
{code:java}
>>> large_array = pa.LargeListArray.from_arrays(pa.array([0, 3, 3, 5]), 
>>> pa.array([1.1, 2.2, 3.3, 4.4, 5.5]))
>>> large_array.type.value_field
Traceback (most recent call last):
 File "", line 1, in 
AttributeError: 'pyarrow.lib.LargeListType' object has no attribute 
'value_field'{code}
Verifying version:
{code:java}
>>> pa.__version__
'2.0.0'{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9801) DictionaryArray with non-unique values are silently corrupted when written to a Parquet file

2020-08-19 Thread Jim Pivarski (Jira)
Jim Pivarski created ARROW-9801:
---

 Summary: DictionaryArray with non-unique values are silently 
corrupted when written to a Parquet file
 Key: ARROW-9801
 URL: https://issues.apache.org/jira/browse/ARROW-9801
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 1.0.0
 Environment: pyarrow 1.0.0 installed from conda-forge.
Reporter: Jim Pivarski


Suppose that you have a DictionaryArray with repeated values in the dictionary:

{{>>> import pyarrow as pa}}
{{>>> pa_array = pa.DictionaryArray.from_arrays(}}
{{...     pa.array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]),}}
{{...     pa.array(["one", "two", "three", "one", "two", "three"])}}
{{... )}}
{{>>> pa_array}}
{{}}{{-- dictionary:}}
{{ [}}
{{    "one",}}
{{    }}{{"two",}}
{{    }}{{"three",}}
{{    }}{{"one",}}
{{    }}{{"two",}}
{{    }}{{"three"}}
{{ ]}}
{{-- indices:}}
{{ [}}
{{    }}{{0,}}
{{    }}{{1,}}
{{    }}{{2,}}
{{    }}{{3,}}
{{    }}{{4,}}
{{    }}{{5,}}
{{    }}{{0,}}
{{    }}{{1,}}
{{    }}{{2,}}
{{    }}{{3,}}
{{    }}{{4,}}
{{    }}{{5}}
{{ ]}}

According to [the 
documentation|[https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout]],
{quote}Dictionary encoding is a data representation technique to represent 
values by integers referencing a *dictionary* usually consisting of unique 
values.
{quote}
so a DictionaryArray like the one above is arguably invalid, but if so, then 
I'd expect some error messages, rather than corrupt data, when I try to write 
it to a Parquet file.

{{>>> pa_table = pa.Table.from_batches(}}
{{...     [pa.RecordBatch.from_arrays([pa_array], ["column"])]}}
{{... )}}
{{>>> pa_table}}
{{pyarrow.Table}}
{{column: dictionary}}
{{>>> import pyarrow.parquet}}
{{>>> pyarrow.parquet.write_table(pa_table, "tmp2.parquet")}}

No errors so far. So we try to read it back and view it:

{{​>>> pa_loaded = pyarrow.parquet.read_table("tmp2.parquet")}}
{{>>> pa_loaded}}
{{pyarrow.Table}}
{{column: dictionary}}
{{>>> pa_loaded.to_pydict()}}
{{Traceback (most recent call last):}}
{{ File "", line 1, in }}
{{ File "pyarrow/table.pxi", line 1587, in pyarrow.lib.Table.to_pydict}}
{{ File "pyarrow/table.pxi", line 405, in pyarrow.lib.ChunkedArray.to_pylist}}
{{ File "pyarrow/array.pxi", line 1144, in pyarrow.lib.Array.to_pylist}}
{{ File "pyarrow/scalar.pxi", line 712, in pyarrow.lib.DictionaryScalar.as_py}}
{{ File "pyarrow/scalar.pxi", line 701, in 
pyarrow.lib.DictionaryScalar.value.__get__}}
{{ File "pyarrow/error.pxi", line 122, in 
pyarrow.lib.pyarrow_internal_check_status}}
{{ File "pyarrow/error.pxi", line 111, in pyarrow.lib.check_status}}
{{pyarrow.lib.ArrowIndexError: tried to refer to element 3 but array is only 3 
long}}

Looking more closely at this, we see that the dictionary has been minimized to 
include only unique values, but the indices haven't been correctly translated:

{{>>> pa_loaded["column"]}}
{{}}
{{[}}
{{    }}{{}}{{-- dictionary:}}
{{    }}{{[}}
{{    }}{{    }}{{"one",}}
{{    }}{{    }}{{"two",}}
{{    }}{{    }}{{"three"}}
{{    }}{{]}}
{{    }}{{-- indices:}}
{{    }}{{[}}
{{    }}{{    }}{{0,}}
{{    }}{{    }}{{1,}}
{{    }}{{    }}{{2,}}
{{    }}{{    }}{{3,}}
{{    }}{{    }}{{0,}}
{{    }}{{    }}{{1,}}
{{    }}{{    }}{{1,}}
{{    }}{{    }}{{1,}}
{{    }}{{    }}{{2,}}
{{    }}{{    }}{{3,}}
{{    }}{{    }}{{0,}}
{{    }}{{    }}{{1}}
{{    }}{{]}}
{{]}}

It looks like an attempt was made to minimize it, but the indices ought to be

[0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]

I don't know what your preferred course of action is—adding an error message or 
fixing the attempted conversion—but this is wrong. On my side, I'm adding code 
to prevent the creation of non-unique values in DictionaryArrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9577) posix_madvise error on Debian in pyarrow 1.0.0

2020-07-27 Thread Jim Pivarski (Jira)
Jim Pivarski created ARROW-9577:
---

 Summary: posix_madvise error on Debian in pyarrow 1.0.0
 Key: ARROW-9577
 URL: https://issues.apache.org/jira/browse/ARROW-9577
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 1.0.0
 Environment: Installed with Miniconda (for Debian; used pip for the 
Ubuntu test)
Reporter: Jim Pivarski


The following writes and reads back from a Parquet file in both pyarrow 0.17.0 
and 1.0.0 on Ubuntu 18.04:
 
{code:java}
>>> import pyarrow.parquet
>>> a = pyarrow.array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> t = pyarrow.Table.from_batches([pyarrow.RecordBatch.from_arrays([a], 
>>> ["stuff"])])
>>> pyarrow.parquet.write_table(t, "stuff.parquet")
>>> t2 = pyarrow.parquet.read_table("stuff.parquet") {code}
 
However, the same thing raises the following exception on Debian 9 (stretch) in 
pyarrow 1.0.0 but not in pyarrow 0.17.0:
{code:java}
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/jpivarski/miniconda3/lib/python3.7/site-packages/pyarrow/parquet.py", 
line 1564, in read_table
filters=filters,
  File 
"/home/jpivarski/miniconda3/lib/python3.7/site-packages/pyarrow/parquet.py", 
line 1433, in __init__
partitioning=partitioning)
  File 
"/home/jpivarski/miniconda3/lib/python3.7/site-packages/pyarrow/dataset.py", 
line 667, in dataset
return _filesystem_dataset(source, **kwargs)
  File 
"/home/jpivarski/miniconda3/lib/python3.7/site-packages/pyarrow/dataset.py", 
line 434, in _filesystem_dataset
return factory.finish(schema)
  File "pyarrow/_dataset.pyx", line 1451, in 
pyarrow._dataset.DatasetFactory.finish
  File "pyarrow/error.pxi", line 122, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: posix_madvise failed. Detail: [errno 0] Success{code}
It's a little odd that the error says that it failed with "detail: success". 
That suggests to me that an "if" predicate is backward (missing "not"), which 
might only be triggered on some OS/distributions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9556) Segfaults in UnionArray with null values

2020-07-24 Thread Jim Pivarski (Jira)
Jim Pivarski created ARROW-9556:
---

 Summary: Segfaults in UnionArray with null values
 Key: ARROW-9556
 URL: https://issues.apache.org/jira/browse/ARROW-9556
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 1.0.0
 Environment: Conda, but pyarrow was installed using pip (in the conda 
environment)
Reporter: Jim Pivarski


Extracting null values from a UnionArray containing nulls and constructing a 
UnionArray with a bitmask in pyarrow.Array.from_buffers causes segfaults in 
pyarrow 1.0.0. I have an environment with pyarrow 0.17.0 and all of the 
following run correctly without segfaults in the older version.

Here's a UnionArray that works (because there are no nulls):

 
{code:java}
# GOOD
a = pyarrow.UnionArray.from_sparse(
 pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()),
 [
 pyarrow.array([0.0, 1.1, 2.2, 3.3, 4.4]),
 pyarrow.array([True, True, False, True, False]),
 ],
)
a.to_pylist(){code}
 

Here's one the fails when you try a.to_pylist() or even just a[2], because one 
of the children has a null at 2:

 
{code:java}
# SEGFAULT
a = pyarrow.UnionArray.from_sparse(
 pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()),
 [
 pyarrow.array([0.0, 1.1, None, 3.3, 4.4]),
 pyarrow.array([True, True, False, True, False]),
 ],
)
a.to_pylist() # also just a[2] causes a segfault{code}
 

Here's another that fails because both children have nulls; the segfault occurs 
at both positions with nulls:

 
{code:java}
# SEGFAULT
a = pyarrow.UnionArray.from_sparse(
 pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()),
 [
 pyarrow.array([0.0, 1.1, None, 3.3, 4.4]),
 pyarrow.array([True, None, False, True, False]),
 ],
)
a.to_pylist() # also a[1] and a[2] cause segfaults{code}
 

Here's one that succeeds, but it's dense, rather than sparse:

 
{code:java}
# GOOD
a = pyarrow.UnionArray.from_dense(
 pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()),
 pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()),
 [pyarrow.array([0.0, 1.1, 2.2, 3.3]), pyarrow.array([True, True, False])],
)
a.to_pylist(){code}
 

Here's a dense that fails because one child has a null:

 
{code:java}
# SEGFAULT
a = pyarrow.UnionArray.from_dense(
 pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()),
 pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()),
 [pyarrow.array([0.0, 1.1, None, 3.3]), pyarrow.array([True, True, False])],
)
a.to_pylist() # also just a[3] causes a segfault{code}
 

Here's a dense that fails in two positions because both children have a null:

 
{code:java}
# SEGFAULT
a = pyarrow.UnionArray.from_dense(
 pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()),
 pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()),
 [pyarrow.array([0.0, 1.1, None, 3.3]), pyarrow.array([True, None, False])],
)
a.to_pylist() # also a[3] and a[5] cause segfaults{code}
 

In all of the above, we created the UnionArray using its from_dense method. We 
could instead create it with pyarrow.Array.from_buffers. If created with 
content0 and content1 that have no nulls, it's fine, but if created with nulls 
in the content, it segfaults as soon as you view the null value.

 
{code:java}
# GOOD
content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3, 4.4])
content1 = pyarrow.array([True, True, False, True, False])
# SEGFAULT
content0 = pyarrow.array([0.0, 1.1, 2.2, None, 4.4])
content1 = pyarrow.array([True, True, False, True, False])
types = pyarrow.union(
 [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)],
 "sparse",
 [0, 1],
)
a = pyarrow.Array.from_buffers(
 types,
 5,
 [
 None,
 pyarrow.py_buffer(numpy.array([0, 1, 0, 0, 1], numpy.int8)),
 ],
 children=[content0, content1],
)
a.to_pylist() # also just a[3] causes a segfault{code}
 

Similarly for a dense union.

 
{code:java}
# GOOD
content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3])
content1 = pyarrow.array([True, True, False])
# SEGFAULT
content0 = pyarrow.array([0.0, 1.1, None, 3.3])
content1 = pyarrow.array([True, True, False])
types = pyarrow.union(
 [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)],
 "dense",
 [0, 1],
)
a = pyarrow.Array.from_buffers(
 types,
 7,
 [
 None,
 pyarrow.py_buffer(numpy.array([0, 1, 0, 0, 0, 1, 1], numpy.int8)),
 pyarrow.py_buffer(numpy.array([0, 0, 1, 2, 3, 1, 2], numpy.int32)),
 ],
 children=[content0, content1],
)
a.to_pylist() # also just a[3] causes a segfault{code}
 

The next segfaults are different: instead of putting the null values in the 
content, we put the null value in the UnionArray itself. This time, it 
segfaults when it is being created. It also prints some output (all of the 
above were silent segfaults).

 
{code:java}
# SEGFAULT (even to create)
content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3, 4.4])
content1 = pyarrow.array([True, True, False, True, False])
types = pyarrow.union(
 [pyarrow.field("0", content0.type), pyarrow.field("1", 

[jira] [Commented] (ARROW-5870) Development compile instructions need to include "make" and "re2"

2019-07-07 Thread Jim Pivarski (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879969#comment-16879969
 ] 

Jim Pivarski commented on ARROW-5870:
-

Actually, only "make" is needed; "re2" comes in the "sudo apt-get install" line 
I didn't realize was part of the installation (because it was under the "using 
pip" section and I was using conda, but it is needed for Boost).

But even after installing Boost (and re2) with apt-get and re-running cmake, 
I'm running into "undefined reference to 
`boost::system::detail::generic_category_ncx()'" errors. I think this is due to 
a missing boost_system, but I can't see from the instructions on

[https://arrow.apache.org/docs/python/development.html]

what's missing.

I had thought this was a simple omission from the instructions (and therefore 
an easy "bug" fix), but it's beginning to look like a long installation 
struggle. Should I move this to the Arrow developers mailing iist?

> Development compile instructions need to include "make" and "re2"
> -
>
> Key: ARROW-5870
> URL: https://issues.apache.org/jira/browse/ARROW-5870
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Website
>Affects Versions: 0.14.0
>Reporter: Jim Pivarski
>Priority: Major
>  Labels: documentation
>
> Following the build instructions on
> [https://arrow.apache.org/docs/python/development.html]
> using conda—I additionally needed to install the "make" and "re2" packages 
> for cmake to succeed. These are such common packages, it probably didn't come 
> up in your tests, but I have a minimal system.
> (It's not done with "make", but it looks promising so far.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5870) Development compile instructions need to include "make" and "re2"

2019-07-07 Thread Jim Pivarski (JIRA)
Jim Pivarski created ARROW-5870:
---

 Summary: Development compile instructions need to include "make" 
and "re2"
 Key: ARROW-5870
 URL: https://issues.apache.org/jira/browse/ARROW-5870
 Project: Apache Arrow
  Issue Type: Bug
  Components: Website
Affects Versions: 0.14.0
Reporter: Jim Pivarski


Following the build instructions on

[https://arrow.apache.org/docs/python/development.html]

using conda—I additionally needed to install the "make" and "re2" packages for 
cmake to succeed. These are such common packages, it probably didn't come up in 
your tests, but I have a minimal system.

(It's not done with "make", but it looks promising so far.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5869) [Python] Need a way to access UnionArray's children as Arrays in pyarrow

2019-07-07 Thread Jim Pivarski (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879965#comment-16879965
 ] 

Jim Pivarski commented on ARROW-5869:
-

I'm not currently in a position to do that, never having built pyarrow before. 
I could look for instructions and try it out, but not immediately.

> [Python] Need a way to access UnionArray's children as Arrays in pyarrow
> 
>
> Key: ARROW-5869
> URL: https://issues.apache.org/jira/browse/ARROW-5869
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: Jim Pivarski
>Priority: Major
>
>  
> There doesn't seem to be a way to get to the children of sparse or dense 
> UnionArrays. For other types, there's
>  * ListType: array.flatten()
>  * StructType: array.field("fieldname")
>  * DictionaryType: array.indices and now array.dictionary (in 0.14.0)
>  * (other types have no children, I think...)
> The reason this comes up now is that I have a downstream library that does a 
> zero-copy view of Arrow by recursively walking over its types and 
> interpreting the list of buffers for each type. In the past, I didn't need 
> the _array_ children of each array—I popped the right number of buffers off 
> the list depending on the type—but now the dictionary for DictionaryType has 
> been moved from the type object to the array object (in 0.14.0). Since it's 
> neither in the buffers list, nor in the type tree, I need to walk the tree of 
> arrays in tandem with the tree of types.
> That would be okay, except that I don't see how to descend from a UnionArray 
> to its children.
> This is the function where I do the walk down types (tpe), and now arrays 
> (array), while interpreting the right number of buffers at each step.
> [https://github.com/scikit-hep/awkward-array/blob/7c5961405cc39bbf2b489fad171652019c8de41b/awkward/arrow.py#L228-L364]
> Simply exposing the std::vector named "children" as a Python sequence or a 
> child(int i) method would provide a way to descend UnionTypes and make this 
> kind of access uniform across all types.
> Alternatively, putting the array.dictionary in the list of buffers would also 
> do it (and make it unnecessary for me to walk over the arrays), but in 
> general it seems like a good idea to make arrays accessible. It seems like it 
> belongs in the buffers, but that would probably be a big change, not to be 
> undertaken for minor reasons.
> Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5869) Need a way to access UnionArray's children as Arrays in pyarrow

2019-07-06 Thread Jim Pivarski (JIRA)
Jim Pivarski created ARROW-5869:
---

 Summary: Need a way to access UnionArray's children as Arrays in 
pyarrow
 Key: ARROW-5869
 URL: https://issues.apache.org/jira/browse/ARROW-5869
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.0
Reporter: Jim Pivarski


 

There doesn't seem to be a way to get to the children of sparse or dense 
UnionArrays. For other types, there's
 * ListType: array.flatten()
 * StructType: array.field("fieldname")
 * DictionaryType: array.indices and now array.dictionary (in 0.14.0)
 * (other types have no children, I think...)

The reason this comes up now is that I have a downstream library that does a 
zero-copy view of Arrow by recursively walking over its types and interpreting 
the list of buffers for each type. In the past, I didn't need the _array_ 
children of each array—I popped the right number of buffers off the list 
depending on the type—but now the dictionary for DictionaryType has been moved 
from the type object to the array object (in 0.14.0). Since it's neither in the 
buffers list, nor in the type tree, I need to walk the tree of arrays in tandem 
with the tree of types.

That would be okay, except that I don't see how to descend from a UnionArray to 
its children.

This is the function where I do the walk down types (tpe), and now arrays 
(array), while interpreting the right number of buffers at each step.

[https://github.com/scikit-hep/awkward-array/blob/7c5961405cc39bbf2b489fad171652019c8de41b/awkward/arrow.py#L228-L364]

Simply exposing the std::vector named "children" as a Python sequence or a 
child(int i) method would provide a way to descend UnionTypes and make this 
kind of access uniform across all types.

Alternatively, putting the array.dictionary in the list of buffers would also 
do it (and make it unnecessary for me to walk over the arrays), but in general 
it seems like a good idea to make arrays accessible. It seems like it belongs 
in the buffers, but that would probably be a big change, not to be undertaken 
for minor reasons.

Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2295) Add to_numpy functions

2018-03-10 Thread Jim Pivarski (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394223#comment-16394223
 ] 

Jim Pivarski commented on ARROW-2295:
-

Array.buffers() must be a new feature, after 0.8.0. I'll look for it in the 
next release. Thanks!

> Add to_numpy functions
> --
>
> Key: ARROW-2295
> URL: https://issues.apache.org/jira/browse/ARROW-2295
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Lawrence Chan
>Priority: Minor
>
> There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to 
> propose that we include both.
> Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho 
> is very confusing :). I think it would be more intuitive for the 
> `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` 
> objects, and the `to_numpy()` functions to return `numpy.ndarray` and either 
> a ordered dict of `numpy.ndarray` or a structured `numpy.ndarray` depending 
> on a flag, for example. The `to_pandas()` function is of course welcome to 
> use the `to_numpy()` func to avoid the additional index and whatnot of the 
> `pandas.Series`.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2295) Add to_numpy functions

2018-03-10 Thread Jim Pivarski (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394196#comment-16394196
 ] 

Jim Pivarski commented on ARROW-2295:
-

I second this and would like to request that the Numpy interface has more 
low-level access to Arrow structures. For instance, ListArray is internally 
represented as two arrays: offsets and contents, and there are applications 
where we'd want to get a zero-copy view of these arrays. The to_pandas() 
function constructs a Numpy object array of subarrays, which is a performance 
bottleneck if you really do want the original offsets and contents.

This function could be an inverse of pyarrow.ListArray.from_arrays, something 
that returns the offsets and contents as Numpy arrays for a List and 
something more complex for general cases (a dict from strings representing a 
place in the hierarchy to Numpy arrays?).

A simpler interface that could be implemented immediately would be one that 
returns the raw bytes of the Arrow buffer, to let us identify its contents 
using [the Arrow 
spec|[https://github.com/apache/arrow/blob/master/format/Layout.md].] But that 
doesn't make use of the dtype (probably just set it to uint8) and would 
probably make more sense as a raw __buffer__. (Should that be a separate JIRA 
ticket?)

 

> Add to_numpy functions
> --
>
> Key: ARROW-2295
> URL: https://issues.apache.org/jira/browse/ARROW-2295
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Lawrence Chan
>Priority: Minor
>
> There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to 
> propose that we include both.
> Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho 
> is very confusing :). I think it would be more intuitive for the 
> `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` 
> objects, and the `to_numpy()` functions to return `numpy.ndarray` and either 
> a ordered dict of `numpy.ndarray` or a structured `numpy.ndarray` depending 
> on a flag, for example. The `to_pandas()` function is of course welcome to 
> use the `to_numpy()` func to avoid the additional index and whatnot of the 
> `pandas.Series`.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-230) Python: Do not name modules like native ones (i.e. rename pyarrow.io)

2016-10-19 Thread Jim Pivarski (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589997#comment-15589997
 ] 

Jim Pivarski commented on ARROW-230:


I didn't have any luck with that, but it's a different issue, so I created it 
here: [https://issues.apache.org/jira/browse/ARROW-344].

> Python: Do not name modules like native ones (i.e. rename pyarrow.io)
> -
>
> Key: ARROW-230
> URL: https://issues.apache.org/jira/browse/ARROW-230
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Uwe L. Korn
>
> Although you can rename it in imports, still weird stuff can happen.
> E.g. if you re-run make in the build directory (only happens probably if you 
> change pyarrow's CMakeLists.txt and do not call it via setup.py) you will get 
> the following error:
> {noformat}
> -- Found Python lib /usr/lib/x86_64-linux-gnu/libpython2.7.so
> CMake Error at cmake_modules/FindNumPy.cmake:62 (message):
>   NumPy import failure:
>   Traceback (most recent call last):
> File "", line 1, in 
> File 
> "/home/uwe/.virtualenvs/pyarrow/local/lib/python2.7/site-packages/numpy/__init__.py",
>  line 180, in 
>   from . import add_newdocs
> File 
> "/home/uwe/.virtualenvs/pyarrow/local/lib/python2.7/site-packages/numpy/add_newdocs.py",
>  line 13, in 
>   from numpy.lib import add_newdoc
> File 
> "/home/uwe/.virtualenvs/pyarrow/local/lib/python2.7/site-packages/numpy/lib/__init__.py",
>  line 8, in 
>   from .type_check import *
> File 
> "/home/uwe/.virtualenvs/pyarrow/local/lib/python2.7/site-packages/numpy/lib/type_check.py",
>  line 11, in 
>   import numpy.core.numeric as _nx
> File 
> "/home/uwe/.virtualenvs/pyarrow/local/lib/python2.7/site-packages/numpy/core/__init__.py",
>  line 58, in 
>   from numpy.testing import Tester
> File 
> "/home/uwe/.virtualenvs/pyarrow/local/lib/python2.7/site-packages/numpy/testing/__init__.py",
>  line 14, in 
>   from .utils import *
> File 
> "/home/uwe/.virtualenvs/pyarrow/local/lib/python2.7/site-packages/numpy/testing/utils.py",
>  line 15, in 
>   from tempfile import mkdtemp
> File "/usr/lib/python2.7/tempfile.py", line 32, in 
>   import io as _io
>   ImportError:
>   
> /home/uwe/Development/arrow/python/build/temp.linux-x86_64-2.7/./libpyarrow.so:
>   undefined symbol: pyarrow_ARRAY_API
> Call Stack (most recent call first):
>   CMakeLists.txt:223 (find_package)
> {noformat}
> The actual error message here is confusing but the basic problem is that here 
> the wrong io module is imported. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (ARROW-344) Instructions for building with conda

2016-10-19 Thread Jim Pivarski (JIRA)
Jim Pivarski created ARROW-344:
--

 Summary: Instructions for building with conda
 Key: ARROW-344
 URL: https://issues.apache.org/jira/browse/ARROW-344
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.1.0
Reporter: Jim Pivarski


According to [this 
comment|https://issues.apache.org/jira/browse/ARROW-230?focusedCommentId=15588846=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15588846],
 Arrow 0.1.0 for Python can be installed with Conda. {{arrow-cpp}} is a 
dependency of the Python version, and I can install {{arrow-cpp}} locally with

{noformat}
conda install --channel conda-forge parquet-cpp numpy pandas pytest
cd apache-arrow-0.1.0/cpp
conda-build conda.recipe --channel conda-forge
conda install -c conda-forge --use-local arrow-cpp
cd ../python
{noformat}

but I can't build and locally install the {{conda.recipe}} in the Python 
directory because conda keeps trying to get the {{arrow-cpp}} on 
{{conda-forge}}, rather than the one in the 0.1.0 release. Those versions are 
incompatible due to a changed API:

{noformat}
[ 24%] Building CXX object 
CMakeFiles/pyarrow.dir/src/pyarrow/adapters/builtin.cc.o
/usr/bin/c++   -Dpyarrow_EXPORTS -isystem 
/opt/miniconda2/conda-bld/conda.recipe_1476908391204/_b_env_placehold_placehold_/lib/python2.7/site-packages/numpy/core/include
 -isystem 
/opt/miniconda2/conda-bld/conda.recipe_1476908391204/_b_env_placehold_placehold_/include/python2.7
 -isystem /opt/apache-arrow-0.1.0/python/src -isystem 
/opt/miniconda2/conda-bld/conda.recipe_1476908391204/_b_env_placehold_placehold_/include
  -std=c++11 -Wall -ggdb -O0 -g -fPIC   -fPIC -o 
CMakeFiles/pyarrow.dir/src/pyarrow/adapters/builtin.cc.o -c 
/opt/apache-arrow-0.1.0/python/src/pyarrow/adapters/builtin.cc
/opt/apache-arrow-0.1.0/python/src/pyarrow/adapters/builtin.cc: In function 
'pyarrow::Status pyarrow::ConvertPySequence(PyObject*, 
std::shared_ptr*)':
/opt/apache-arrow-0.1.0/python/src/pyarrow/adapters/builtin.cc:434:26: error: 
no matching function for call to 'arrow::ArrayBuilder::Finish()'
   *out = builder->Finish();
  ^
/opt/apache-arrow-0.1.0/python/src/pyarrow/adapters/builtin.cc:434:26: note: 
candidate is:
In file included from 
/opt/miniconda2/conda-bld/conda.recipe_1476908391204/_b_env_placehold_placehold_/include/arrow/api.h:24:0,
 from 
/opt/apache-arrow-0.1.0/python/src/pyarrow/adapters/builtin.cc:23:
/opt/miniconda2/conda-bld/conda.recipe_1476908391204/_b_env_placehold_placehold_/include/arrow/builder.h:96:18:
 note: virtual arrow::Status 
arrow::ArrayBuilder::Finish(std::shared_ptr*)
   virtual Status Finish(std::shared_ptr* out) = 0;
  ^
/opt/miniconda2/conda-bld/conda.recipe_1476908391204/_b_env_placehold_placehold_/include/arrow/builder.h:96:18:
 note:   candidate expects 1 argument, 0 provided
make[2]: *** [CMakeFiles/pyarrow.dir/src/pyarrow/adapters/builtin.cc.o] Error 1
make[2]: Leaving directory 
`/opt/apache-arrow-0.1.0/python/build/temp.linux-x86_64-2.7'
make[1]: *** [CMakeFiles/pyarrow.dir/all] Error 2
make[1]: Leaving directory 
`/opt/apache-arrow-0.1.0/python/build/temp.linux-x86_64-2.7'
make: *** [all] Error 2
error: command 'make' failed with exit status 2
{noformat}

If I do {{conda-build --channel local --channel conda-forge 
--override-channels}}, it can't find some of the packages I've installed. If I 
don't {{--override-channels}}, it tries to use {{arrow-cpp 0.1.post-1}} from 
{{conda-forge}} as the dependency and I get the compilation error above.

Note: my {{conda list}} is

{noformat}
# packages in environment at /opt/miniconda2:
#
conda-build   2.0.6py27_0
blas  1.1openblasconda-forge
conda 4.1.12   py27_0conda-forge
conda-env 2.5.2py27_0conda-forge
numpy 1.11.2  py27_blas_openblas_200  
[blas_openblas]  conda-forge
openblas  0.2.185conda-forge
pandas0.19.0  np111py27_0conda-forge
parquet-cpp   0.1.pre   3conda-forge
pytest3.0.3py27_0conda-forge
thrift-cpp0.9.3 3conda-forge
enum341.1.6py27_0
filelock  2.0.6py27_0
jinja22.8  py27_1
libgfortran   3.0.0 1
arrow-cpp 0.1   0local
markupsafe0.23 py27_2
mkl   11.3.30
openssl   1.0.2h1
patchelf   

[jira] [Commented] (ARROW-230) Python: Do not name modules like native ones (i.e. rename pyarrow.io)

2016-10-19 Thread Jim Pivarski (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15588659#comment-15588659
 ] 

Jim Pivarski commented on ARROW-230:


I made sure my PYTHONPATH and LD_LIBRARY_PATH were blank, installed a new 
directory, compiled the C++ library from source, and then attempted to compile 
the Python library. I'm including a log of that process below with some of the 
early steps truncated (...) and the Python compilation completely untruncated. 
My prompt is a single percent (%).

{noformat}
% echo $PYTHONPATH  

% echo $LD_LIBRARY_PATH  

% export ARROW_HOME=/opt/apache-arrow-0.1.0/cpp/dist
% cd /opt

% tar -xzvf /tmp/downloads/apache-arrow-0.1.0.tar.gz
apache-arrow-0.1.0/
apache-arrow-0.1.0/.travis.yml
apache-arrow-0.1.0/LICENSE.txt
apache-arrow-0.1.0/NOTICE.txt
apache-arrow-0.1.0/README.md
...

% cd apache-arrow-0.1.0/cpp
% source setup_build_env.sh
+ set -e
+++ dirname ./thirdparty/download_thirdparty.sh
++ cd ./thirdparty
++ pwd
+ TP_DIR=/opt/apache-arrow-0.1.0/cpp/thirdparty
+ source /opt/apache-arrow-0.1.0/cpp/thirdparty/versions.sh
++ GTEST_VERSION=1.7.0
...

% mkdir release
% cd release
% cmake .. -DCMAKE_INSTALL_PREFIX:PATH=$ARROW_HOME
clang-tidy not found
   
clang-format not found
Configured for DEBUG build (set with cmake 
-DCMAKE_BUILD_TYPE={release,debug,...})
-- Build Type: DEBUG
INFO Using built-in specs.
COLLECT_GCC=/usr/bin/c++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.8/lto-wrapper
Target: x86_64-linux-gnu
...
-- Flatbuffers static library: 
/opt/apache-arrow-0.1.0/cpp/thirdparty/installed/lib/libflatbuffers.a
-- Flatbuffers compiler: 
/opt/apache-arrow-0.1.0/cpp/thirdparty/installed/bin/flatc
-- Configuring done
-- Generating done
-- Build files have been written to: /opt/apache-arrow-0.1.0/cpp/release

% make unittest
Scanning dependencies of target metadata_fbs
[  2%] Running flatc compiler on 
/opt/apache-arrow-0.1.0/format/Message.fbs;/opt/apache-arrow-0.1.0/format/File.fbs
[  2%] Built target metadata_fbs
Scanning dependencies of target arrow_objlib
[  4%] Building CXX object CMakeFiles/arrow_objlib.dir/src/arrow/array.cc.o
[  6%] Building CXX object CMakeFiles/arrow_objlib.dir/src/arrow/builder.cc.o
...
17/18 Test #17: ipc-file-test    Passed0.12 sec
  Start 18: ipc-metadata-test
18/18 Test #18: ipc-metadata-test    Passed0.12 sec

100% tests passed, 0 tests failed out of 18

Label Time Summary:
unittest=   2.29 sec

Total Test time (real) =   2.31 sec
[100%] Built target unittest

% make install
[  2%] Built target metadata_fbs
[ 42%] Built target arrow_objlib
[ 42%] Built target arrow_shared
[ 42%] Built target arrow_static
[ 44%] Built target arrow_test_main
[ 46%] Built target array-test
[ 48%] Built target column-test
...
-- Installing: /opt/apache-arrow-0.1.0/cpp/dist/include/arrow/types/union.h
-- Installing: /opt/apache-arrow-0.1.0/cpp/dist/include/arrow/ipc/adapter.h
-- Installing: /opt/apache-arrow-0.1.0/cpp/dist/include/arrow/ipc/file.h
-- Installing: /opt/apache-arrow-0.1.0/cpp/dist/include/arrow/ipc/metadata.h
-- Installing: /opt/apache-arrow-0.1.0/cpp/dist/lib/libarrow_ipc.so
-- Removed runtime path from 
"/opt/apache-arrow-0.1.0/cpp/dist/lib/libarrow_ipc.so"

% cd ../../python
% tree $ARROW_HOME
/opt/apache-arrow-0.1.0/cpp/dist
|-- include
|   `-- arrow
|   |-- api.h
|   |-- array.h
|   |-- builder.h
|   |-- column.h
|   |-- io
|   |   |-- file.h
|   |   |-- hdfs.h
|   |   |-- interfaces.h
|   |   `-- memory.h
|   |-- ipc
|   |   |-- adapter.h
|   |   |-- file.h
|   |   `-- metadata.h
|   |-- schema.h
|   |-- table.h
|   |-- test-util.h
|   |-- type.h
|   |-- types
|   |   |-- collection.h
|   |   |-- construct.h
|   |   |-- datetime.h
|   |   |-- decimal.h
|   |   |-- json.h
|   |   |-- list.h
|   |   |-- primitive.h
|   |   |-- string.h
|   |   |-- struct.h
|   |   `-- union.h
|   `-- util
|   |-- bit-util.h
|   |-- buffer.h
|   |-- logging.h
|   |-- macros.h
|   |-- memory-pool.h
|   |-- random.h
|   |-- status.h
|   `-- visibility.h
`-- lib
|-- libarrow.a
|-- libarrow.so
|-- libarrow_io.so
`-- libarrow_ipc.so

7 directories, 37 files

% python setup.py build_ext --inplace
/home/pivarski/.local/lib/python2.7/site-packages/setuptools/dist.py:331: 
UserWarning: Normalizing '0.1.0dev' to '0.1.0.dev0'
  normalized_version,
running build_ext
creating build
creating build/temp.linux-x86_64-2.7
cmake  -DPYTHON_EXECUTABLE=/usr/bin/python   /opt/apache-arrow-0.1.0/python
-- The C compiler identification is GNU 4.8.4
-- The CXX compiler identification is GNU 4.8.4
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc