[jira] [Created] (ARROW-18428) [Website] Enable github issues on arrow-site repo

2022-12-08 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18428:
-

 Summary: [Website] Enable github issues on arrow-site repo
 Key: ARROW-18428
 URL: https://issues.apache.org/jira/browse/ARROW-18428
 Project: Apache Arrow
  Issue Type: Task
  Components: Website
Reporter: Joris Van den Bossche


Now we are moving to GitHub issues, it probably makes sense to open issues 
about the website in its own arrow-site repo, instead of keeping them in the 
main arrow repo.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18363) [Docs] Include warning when viewing old contributing docs (redirecting to dev docs)

2022-11-18 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18363:
-

 Summary: [Docs] Include warning when viewing old contributing docs 
(redirecting to dev docs)
 Key: ARROW-18363
 URL: https://issues.apache.org/jira/browse/ARROW-18363
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Joris Van den Bossche


Now we have versioned docs, we also have the old versions of the developers 
docs (eg 
https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those 
might be outdated (eg regarding communication channels, build instructions, 
etc), and typically when contributing / developing with the latest arrow, one 
should _always_ check the latest dev version of the contributing docs.

We could add a warning box pointing this out and linking to the dev docs. 

For example similarly how some projects warn about viewing old docs in general 
and point to the stable docs (eg https://mne.tools/1.1/index.html or 
https://scikit-learn.org/1.0/user_guide.html). In this case we could have a 
custom box when at a page in /developers to point to the dev docs instead of 
stable docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18340) [Python] PyArrow C++ header files no longer always included in installed pyarrow

2022-11-16 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18340:
-

 Summary: [Python] PyArrow C++ header files no longer always 
included in installed pyarrow
 Key: ARROW-18340
 URL: https://issues.apache.org/jira/browse/ARROW-18340
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Joris Van den Bossche
Assignee: Alenka Frim
 Fix For: 10.0.1


We have a python build env var to control whether the Arrow C++ header files 
are included in the python package or not 
({{PYARROW_BUNDLE_ARROW_CPP_HEADERS}}). This is set to True by default, and 
only in the conda recipe set to False.

After the cmake refactor, the Python C++ header files no longer live in the 
Arrow C++ package, and so should _always_ be included in the python package, 
regardless of how arrow-cpp is installed. 
Initially this was done, but it seems that 
https://github.com/apache/arrow/pull/13892 removed this unconditional copy of 
the PyArrow header files to {{pyarrow/include}}. Now it is only copied if 
{{PYARROW_BUNDLE_ARROW_CPP_HEADERS}} is enabled.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18329) [Python][CI] Support ORC in Windows wheels

2022-11-15 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18329:
-

 Summary: [Python][CI] Support ORC in Windows wheels
 Key: ARROW-18329
 URL: https://issues.apache.org/jira/browse/ARROW-18329
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Now we support building with ORC enabled on Windows (ARROW-17817), we could 
also add this to the Python wheel packages for Windows (vcpkg seems to have an 
orc port for Windows as well)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18293) [C++] Proxy memory pool crashes with Dataset scanning

2022-11-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18293:
-

 Summary: [C++] Proxy memory pool crashes with Dataset scanning
 Key: ARROW-18293
 URL: https://issues.apache.org/jira/browse/ARROW-18293
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


Discovered while trying to use the proxy memory pool for testing ARROW-18164

See https://github.com/apache/arrow/pull/14516#discussion_r1005433867

This test segfaults (using the fixture in {{test_dataset.py}}:

{code:python}
@pytest.mark.parquet
def test_scanner_proxy_memory_pool(dataset):
proxy_pool = pa.proxy_memory_pool(pa.default_memory_pool())
_ = dataset.to_table(memory_pool=proxy_pool)
{code}

Response of [~westonpace]:

{quote}My guess is that the problem is that the scanner erroneously returns 
before all work is completely finished. Changing the thread pool or the memory 
pool too quickly after a scan can lead to this kind of error. The new scanner 
was created specifically to avoid this problem but it isn't the default yet 
(still working through some follow-up PRs to make sure we have the same 
functionality).{quote}

So once that becomes the default new scanner, we can see if this is fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18164) [C++][Python] Dataset scanner does not follow default memory pool setting

2022-10-26 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18164:
-

 Summary: [C++][Python] Dataset scanner does not follow default 
memory pool setting
 Key: ARROW-18164
 URL: https://issues.apache.org/jira/browse/ARROW-18164
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Joris Van den Bossche


Even if I set the system memory pool as default, it still uses the jemalloc one 
(running this on Ubuntu where jemalloc is the default if not set by the user):

{code}
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq
pq.write_table(pa.table({'a': [1, 2, 3]}), "test.parquet")

In [2]: pa.set_memory_pool(pa.system_memory_pool())

In [3]: pa.total_allocated_bytes()
Out[3]: 0

In [4]: table = ds.dataset("test.parquet").to_table()

In [5]: pa.total_allocated_bytes()
Out[5]: 0

In [6]: pa.set_memory_pool(pa.jemalloc_memory_pool())

In [7]: pa.total_allocated_bytes()
Out[7]: 128
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18127) [CI][Python] Have a way to reproduce hypothesis failures from CI

2022-10-21 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18127:
-

 Summary: [CI][Python] Have a way to reproduce hypothesis failures 
from CI
 Key: ARROW-18127
 URL: https://issues.apache.org/jira/browse/ARROW-18127
 Project: Apache Arrow
  Issue Type: Test
  Components: Continuous Integration, Python
Reporter: Joris Van den Bossche


We have a nightly test build with hypothesis enabled, and those tests fail / 
crash from time to time, eg 
https://github.com/ursacomputing/crossbow/actions/runs/3286024804/jobs/5413689973

Ideally, if there is such a failure, we should actually fix that test case. But 
that requires us to be able to reproduce the failure locally. 
If it's an actual test failure, hypothesis should print some information to 
re-run it locally with the same input 
(https://hypothesis.readthedocs.io/en/latest/reproducing.html#reproducing-an-example-with-reproduce-failure).
 
But if it is segfaulting, this information is not printed by default.

Another idea might to save the ./hypothesis/examples directory as artifact on 
the CI build, to use it locally, but probably that might have the same issue of 
not having the information we need in case of a crash.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18126) [Python] ARROW_BUILD_DIR might be ignored for building pyarrow?

2022-10-21 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18126:
-

 Summary: [Python] ARROW_BUILD_DIR might be ignored for building 
pyarrow?
 Key: ARROW-18126
 URL: https://issues.apache.org/jira/browse/ARROW-18126
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


When building pyarrow, I see the following warning:

{code}
CMake Warning:
  Manually-specified variables were not used by the project:

ARROW_BUILD_DIR
{code}

While we have a note in our docs 
(https://arrow.apache.org/docs/dev/developers/python.html#build-and-test) that 
says:

bq. If you used a different directory name for building Arrow C++ (by default 
it is named “build”), then you should also set the environment variable 
{{ARROW_BUILD_DIR='name_of_build_dir'}}. This way PyArrow can find the Arrow 
C++ built files.

I see in the setup.py code that we check for this env variable and pass it to 
CMake, but it's not actually used in any of the CMakeLists.txt files for 
pyarrow.

This might have been accidentally changed in one of the recent cmake refactors? 
(cc [~kou] [~alenka])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18125) [Python] Handle pytest 8 deprecations about pytest.warns(None)

2022-10-21 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18125:
-

 Summary: [Python] Handle pytest 8 deprecations about 
pytest.warns(None) 
 Key: ARROW-18125
 URL: https://issues.apache.org/jira/browse/ARROW-18125
 Project: Apache Arrow
  Issue Type: Test
Reporter: Joris Van den Bossche
 Fix For: 11.0.0


We have a few warnings about that when running the tests, for example:

{code}
pyarrow/tests/test_pandas.py::TestConvertMetadata::test_rangeindex_doesnt_warn
pyarrow/tests/test_pandas.py::TestConvertMetadata::test_multiindex_doesnt_warn
  
/home/joris/miniconda3/envs/arrow-dev/lib/python3.10/site-packages/_pytest/python.py:192:
 PytestRemovedIn8Warning: Passing None has been deprecated.
  See 
https://docs.pytest.org/en/latest/how-to/capture-warnings.html#additional-use-cases-of-warnings-in-tests
 for alternatives in common use cases.
result = testfunction(**testargs)

{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18124) [Python] Support converting to non-nano datetime64 for pandas >= 2.0

2022-10-21 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18124:
-

 Summary: [Python] Support converting to non-nano datetime64 for 
pandas >= 2.0
 Key: ARROW-18124
 URL: https://issues.apache.org/jira/browse/ARROW-18124
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 11.0.0


Pandas is adding capabilities to store non-nanosecond datetime64 data. At the 
moment, we however always do convert to nanosecond, regardless of the timestamp 
resolution of the arrow table (and regardless of the pandas metadata).

Using the development version of pandas:

{code}
In [1]: df = pd.DataFrame({"col": np.arange("2012-01-01", 10, 
dtype="datetime64[s]")})

In [2]: df.dtypes
Out[2]: 
coldatetime64[s]
dtype: object

In [3]: table = pa.table(df)

In [4]: table.schema
Out[4]: 
col: timestamp[s]
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 423

In [6]: table.to_pandas().dtypes
Out[6]: 
coldatetime64[ns]
dtype: object
{code}

This is because we have a {{coerce_temporal_nanoseconds}} conversion option 
which we hardcode to True (for top-level columns, we hardcode it to False for 
nested data). 

When users have pandas >= 2, we should support converting with preserving the 
resolution. We should certainly do so if the pandas metadata indicates which 
resolution was originally used (to ensure correct roundtrip). 
We _could_ (and at some point also _should_) also do that by default if there 
is no pandas metadata (but maybe only later depending on how stable this new 
feature is in pandas, as it is potentially a breaking change for our users if 
you use eg pyarrow to read a parquet file).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18107) [C++] Provide more informative error when (CSV/JSON) parsing fails

2022-10-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18107:
-

 Summary: [C++] Provide more informative error when (CSV/JSON) 
parsing fails
 Key: ARROW-18107
 URL: https://issues.apache.org/jira/browse/ARROW-18107
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


Related to ARROW-18106 (and derived from 
https://stackoverflow.com/questions/74138746/why-i-cant-parse-timestamp-in-pyarrow).
 

Assume you have the following code to read a JSON file with timestamps. The 
timestamps have a sub-second part in their string, which fails parsing if you 
specify it as second resolution timestamp:

{code:python}
import io
from pyarrow import json

s_json = """{"column":"2022-09-05T08:08:46.000"}"""

opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
pa.timestamp("s"))]), unexpected_field_behavior="ignore")
json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
{code}

gives:

{code}
ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't 
parse:2022-09-05T08:08:46.000
{code}

This error is expected, but I think it could be more informative about the 
reason why it failed parsing (because at first sight it looks like a proper 
timestamp string, so you might be left wondering why this is failing). 

(this might not be that straightforward, though, since there can be many 
reasons why the parsing is failing)







--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18106) [C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer"

2022-10-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18106:
-

 Summary: [C++] JSON reader ignores explicit schema with default 
unexpected_field_behavior="infer"
 Key: ARROW-18106
 URL: https://issues.apache.org/jira/browse/ARROW-18106
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


Not 100% sure this is a "bug", but at least I find it an unexpected interplay 
between two options.

By default, when reading json, we _infer_ the data type of columns, and when 
specifying an explicit schema, we _also_ by default infer the type of columns 
that are not specified in the explicit schema. The docs for 
{{unexpected_field_behavior}}:

> How JSON fields outside of explicit_schema (if given) are treated

But it seems that if you specify a schema, and the parsing of one of the 
columns fails according to that schema, we still fall back to this default of 
inferring the data type (while I would have expected an error, since we should 
only infer for columns _not_ in the schema.

Example code using pyarrow:

{code:python}
import io
import pyarrow as pa
from pyarrow import json

s_json = """{"column":"2022-09-05T08:08:46.000"}"""

opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
pa.timestamp("s"))]))
json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
{code}

The parsing fails here because there are milliseconds and the type is "s", but 
the explicit schema is ignored, and we get a result with a string column as 
result:

{code}
pyarrow.Table
column: string

column: [["2022-09-05T08:08:46.000"]]
{code}

But when adding {{unexpected_field_behaviour="ignore"}}, we actually get the 
expected parse error:

{code:python}
opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
pa.timestamp("s"))]), unexpected_field_behavior="ignore")
json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
{code}

gives

{code}
ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't 
parse:2022-09-05T08:08:46.000
{code}


It might be this is specific to timestamps, I don't directly see a similar 
issue with eg {{"column": "A"}} and setting the schema to "column" being int64.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18098) [C++] Vector kernel for "intersecting" two arrays (all common elements)

2022-10-19 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18098:
-

 Summary: [C++] Vector kernel for "intersecting" two arrays (all 
common elements)
 Key: ARROW-18098
 URL: https://issues.apache.org/jira/browse/ARROW-18098
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Joris Van den Bossche


This would be similar to numpy's {{intersect1d}} 
(https://numpy.org/doc/stable/reference/generated/numpy.intersect1d.html)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18097) [C++] Add a "list_contains" kernel

2022-10-19 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18097:
-

 Summary: [C++] Add a "list_contains" kernel
 Key: ARROW-18097
 URL: https://issues.apache.org/jira/browse/ARROW-18097
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Joris Van den Bossche


Assume you have a list array:

{code}
arr = pa.array([["a", "b"], ["a", "c"], ["b", "c", "d"]])
{code}

And you want to know for each list if it contains a certain value (of the same 
type as the list's values). A "list_contains" function (or other name) would be 
useful for that:

{code}
pc.list_contains(arr, "a")
# -> True, True False
{code}

The current workaround that I found was flattening, checking equality, and then 
reducing again with groupby, but this is quite tedious:

{code}
>>> temp = pa.table({'index': pc.list_parent_indices(arr), 'contains_value': 
>>> pc.equal(pc.list_flatten(arr), "a")})
>>> temp.group_by('index').aggregate([('contains_value', 
>>> 'any')])['contains_value_any'].chunk(0)

[
  true,
  true,
  false
]
{code}

But this also only works if there are no empty or missing list values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18096) [Dev] Remove github user names from merge commit message

2022-10-19 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18096:
-

 Summary: [Dev] Remove github user names from merge commit message
 Key: ARROW-18096
 URL: https://issues.apache.org/jira/browse/ARROW-18096
 Project: Apache Arrow
  Issue Type: Task
  Components: Developer Tools
Reporter: Joris Van den Bossche


We currently use the top post comment body of a github PR as the body of the 
commit message. It is not uncommon to tag someone when opening a PR, but 
retaining those github usernames in the commit message is annoying as that can 
generate additional notifications for the people that were tagged.

It should be straightforward to remove the github user names from the message 
body (for example, just remove the @, so it doesn't work anymore as user name 
link)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18088) [Python][CI] Build with pandas master/nightly failure related to timedelta64 resolution

2022-10-18 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18088:
-

 Summary: [Python][CI] Build with pandas master/nightly failure 
related to timedelta64 resolution
 Key: ARROW-18088
 URL: https://issues.apache.org/jira/browse/ARROW-18088
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche


The nightly python builds using the pandas development version are failing: 
https://github.com/ursacomputing/crossbow/actions/runs/3269767207/jobs/5377649455

Example failure:

{code}
  test_parquet_2_0_roundtrip[None-True] 
_

tempdir = 
PosixPath('/tmp/pytest-of-root/pytest-0/test_parquet_2_0_roundtrip_Non0')
chunk_size = None, use_legacy_dataset = True

@pytest.mark.pandas
@parametrize_legacy_dataset
@pytest.mark.parametrize('chunk_size', [None, 1000])
def test_parquet_2_0_roundtrip(tempdir, chunk_size, use_legacy_dataset):
df = alltypes_sample(size=1, categorical=True)

filename = tempdir / 'pandas_roundtrip.parquet'
arrow_table = pa.Table.from_pandas(df)
assert arrow_table.schema.pandas_metadata is not None

_write_table(arrow_table, filename, version='2.6',
 coerce_timestamps='ms', chunk_size=chunk_size)
table_read = pq.read_pandas(
filename, use_legacy_dataset=use_legacy_dataset)
assert table_read.schema.pandas_metadata is not None

read_metadata = table_read.schema.metadata
assert arrow_table.schema.metadata == read_metadata

df_read = table_read.to_pandas()
>   tm.assert_frame_equal(df, df_read)
E   AssertionError: Attributes of DataFrame.iloc[:, 12] (column 
name="timedelta") are different
E   
E   Attribute "dtype" are different
E   [left]:  timedelta64[s]
E   [right]: timedelta64[ns]

opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/tests/parquet/test_data_types.py:76:
 AssertionError
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18087) [C++] RecordBatch::Equals ignores field names

2022-10-18 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18087:
-

 Summary: [C++] RecordBatch::Equals ignores field names
 Key: ARROW-18087
 URL: https://issues.apache.org/jira/browse/ARROW-18087
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


The {{RecordBatch::Equals}} method only checks the equality of the schema of 
both batches if {{check_metadata=True}}, with a result that it doesn't actually 
check the schema (eg field names) by default.

Python illustration:

{code}
In [3]: batch1 = pa.record_batch(pd.DataFrame({'a': [1, 2, 3]}))

In [4]: batch2 = pa.record_batch(pd.DataFrame({'b': [1, 2, 3]}))

In [5]: batch1.equals(batch2)
Out[5]: True

In [6]: batch1.equals(batch2, check_metadata=True)
Out[6]: False
{code}

My expectation is that RecordBatch equality always requires equal field names 
(as Table::Equals does). And the {{check_metadata}} keyword should only control 
whether the metadata of the schema is considered (as the documentation also 
says), not whether the schema is checked at all.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17989) [C++] Enable struct_field kernel to accept string field names

2022-10-11 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-17989:
-

 Summary: [C++] Enable struct_field kernel to accept string field 
names
 Key: ARROW-17989
 URL: https://issues.apache.org/jira/browse/ARROW-17989
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


Currently the "struct_field" kernel only works for integer indices for the 
child fields. From the StructFieldOption class 
(https://github.com/apache/arrow/blob/3d7f2f22a0fc441a41b8fa971e11c0f4290ebb24/cpp/src/arrow/compute/api_scalar.h#L283-L285):

{code}
  /// The child indices to extract. For instance, to get the 2nd child
  /// of the 1st child of a struct or union, this would be {0, 1}.
  std::vector indices;
{code}

It would be nice if you could also refer to fields by name in addition to by 
position.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17960) [C++] Add kernel for slicing list values

2022-10-07 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-17960:
-

 Summary: [C++] Add kernel for slicing list values
 Key: ARROW-17960
 URL: https://issues.apache.org/jira/browse/ARROW-17960
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


This would be a scalar kernel "List -> List" (or to fixed size list?), 
where you can subset the values in each list element. 

So for example, giving the list array:

{code}
arr = pa.array([[1, 2, 3], [4, 5, 6, 7], [8, 9]])
{code}

we could do something like the following to get the first two elements of each 
list:

{code}
pc.list_slice(arr, start=0, stop=2)
->  pa.array([[1, 2], [4, 5], [8, 9]])
{code}

This would supplement the existing {{list_element}} kernel.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17959) [C++][Dataset]

2022-10-07 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-17959:
-

 Summary: [C++][Dataset] 
 Key: ARROW-17959
 URL: https://issues.apache.org/jira/browse/ARROW-17959
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


Currently, when reading a subfield of a nested column of a Parquet file using 
the Dataset API, we read the full parent column instead of only the requested 
field. This should be optimized to only read the field itself.

This was left as a TODO in ARROW-14658 
(https://github.com/apache/arrow/pull/11704) which added the initial support 
for nested field refs in dataset scanning 
(https://github.com/apache/arrow/blob/c29ca51f44eaf41c3a2f6f72e3e23a7b428211c2/cpp/src/arrow/dataset/file_parquet.cc#L240-L246):

{code}
  if (field) {
// TODO(ARROW-1888): support fine-grained column projection. We should be
// able to materialize only the child fields requested, and not the entire
// top-level field.
// Right now, if enabled, projection/filtering will fail when they cast the
// physical schema to the dataset schema.
AddColumnIndices(*toplevel, columns_selection);
{code}

Some relevant comments at 
https://github.com/apache/arrow/pull/11704#discussion_r749733765. ARROW-1888 
was mentioned as a blocker back then, but this is resolved in the meantime.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17925) [Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas?

2022-10-04 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-17925:
-

 Summary: [Python] Use ExtensionScalar.as_py() as fallback in 
ExtensionArray to_pandas?
 Key: ARROW-17925
 URL: https://issues.apache.org/jira/browse/ARROW-17925
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


This was raised in ARROW-17813 by [~changhiskhan]:

{quote}*ExtensionArray => pandas*

Just for discussion, I was curious whether you had any thoughts around using 
the extension scalar as a fallback mechanism. It's a lot simpler to define an 
ExtensionScalar with `as_py` than a pandas extension dtype. So if an 
ExtensionArray doesn't have an equivalent pandas dtype, would it make sense to 
convert it to just an object series whose elements are the result of `as_py`? 
{quote}

and I also mentioned this in ARROW-17535:

{quote}That actually brings up a question: if an ExtensionType defines an 
ExtensionScalar (but not an associciated pandas dtype, or custom to_numpy 
conversion), should we use this scalar's {{as_py()}} for the to_numpy/to_pandas 
conversion as well for plain extension arrays? (not the nested case) 

Because currently, if you have an ExtensionArray like that (for example using 
the example from the docs: 
https://arrow.apache.org/docs/dev/python/extending_types.html#custom-scalar-conversion),
 we still use the storage type conversion for to_numpy/to_pandas, and only use 
the scalar's conversion in {{to_pylist}}.{quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17924) [Docs] Clarify immutability assumption in the C Data Interface documentation

2022-10-04 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-17924:
-

 Summary: [Docs] Clarify immutability assumption in the C Data 
Interface documentation
 Key: ARROW-17924
 URL: https://issues.apache.org/jira/browse/ARROW-17924
 Project: Apache Arrow
  Issue Type: Task
  Components: Documentation, Format
Reporter: Joris Van den Bossche


The current documentation 
(https://arrow.apache.org/docs/dev/format/CDataInterface.html) is not explicit 
about whether there are any guarantees about (im)mutability. 

My assumption is that the _consumer_ of C Data Interface structs should 
_assume_ the data to be immutable by default (unless they would know that the 
producer is fine with mutating the data). But it would be good to document this.

(as a reference, the DLPack Python docs mention this: 
https://dmlc.github.io/dlpack/latest/python_spec.html#semantics)




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17890) [C++][Python] Allow an ExtensionType to register or implement custom casts

2022-09-29 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-17890:
-

 Summary: [C++][Python] Allow an ExtensionType to register or 
implement custom casts
 Key: ARROW-17890
 URL: https://issues.apache.org/jira/browse/ARROW-17890
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Joris Van den Bossche


With ARROW-14500 and ARROW-15545 (https://github.com/apache/arrow/pull/14106), 
we are allowing to cast "storage_type" -> "extension" (and the 
cast the other way around already worked as well). 

Initially, that PR allowed any cast from "any" -> "extension", as 
long as the input type could be cast to the storage type (so deferring to the 
"any" -> "storage_type" cast). However, because whether a certain cast makes 
sense or not depends on the semantics of the extension type, it was restricted 
to exactly matching storage_type. 

One idea could be to still allow the other casts behind a cast option flag, 
like {{allow_non_storage_extension_casts}} (or a better name), so the user can 
explicitly allow to cast to/from any type (as long as the cast from/to the 
storage type works).

That could help for the user, but for certain casts, the ExtensionType might 
also want to control _how_ such a cast is done. For example, for casting 
to/from string type (which would be useful for reading/writing CSV files, or 
for repr), you typically will want to do something different than casting your 
storage array to string. 

A more general solution could thus be to have a mechanism for the ExtensionType 
to implement a certain cast kernel itself, and register this to the C++ cast 
dispatching.








--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17834) [Python] Allow creating ExtensionArray through pa.array(..) constructor

2022-09-23 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-17834:
-

 Summary: [Python] Allow creating ExtensionArray through 
pa.array(..) constructor
 Key: ARROW-17834
 URL: https://issues.apache.org/jira/browse/ARROW-17834
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Currently, creating an ExtensionArray from a python sequence (or numpy array, 
..) requires the following:

{code:python}
from pyarrow.tests.test_extension_type import IntegerType

storage_array = pa.array([1, 2, 3])
ext_arr = pa.ExtensionArray.from_storage(IntegerType(), storage_array)
{code}

While doing this directly in {{pa.array(..)}} doesn't work:

{code:python}
>>> pa.array([1, 2, 3], type=IntegerType())
ArrowNotImplementedError: extension
{code}

I think it should be possible to basically to the ExtensionArray.from_storage 
under the hood in {{pa.array(..)}} when the specified type is an extension type?

I think this should also enable converting from a pandas DataFrame (with a 
column with matching storage values) to a Table with a specified schema that 
includes an extension type. Like:

{code}
df = pd.DataFrame({'a': [1, 2, 3]})
pa.table(df, schema=pa.schema([('a', IntegerType())]))
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17832) [Python] Construct MapArray from sequence of dicts (instead of list of tuples)

2022-09-23 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-17832:
-

 Summary: [Python] Construct MapArray from sequence of dicts 
(instead of list of tuples)
 Key: ARROW-17832
 URL: https://issues.apache.org/jira/browse/ARROW-17832
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


>From https://github.com/apache/arrow/issues/14116

Creating a MapArray from a python sequence currently requires lists of tuples 
as values:

```
arr = pa.array([[('a', 1), ('b', 2)], [('c', 3)]], pa.map_(pa.string(), 
pa.int64()))
```

While I think it makes sense that the following could also work (using dicts 
instead):

```
arr = pa.array([{'a': 1, 'b': 2}, {'c': 3}], pa.map_(pa.string(), pa.int64()))
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17831) [Python][Docs] PyArrow Architecture page outdated after moving pyarrow C++ code

2022-09-23 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-17831:
-

 Summary: [Python][Docs] PyArrow Architecture page outdated after 
moving pyarrow C++ code
 Key: ARROW-17831
 URL: https://issues.apache.org/jira/browse/ARROW-17831
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


This section is no longer up to date: 
https://arrow.apache.org/docs/dev/python/getting_involved.html#pyarrow-architecture

(it still mentions cpp/src/arrow/python)

cc [~alenka]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17829) [Python] Avoid pandas groupby deprecation warning write_to_dataset

2022-09-23 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-17829:
-

 Summary: [Python] Avoid pandas groupby deprecation warning 
write_to_dataset
 Key: ARROW-17829
 URL: https://issues.apache.org/jira/browse/ARROW-17829
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


I noticed the following warnings in our test builds:

{code}
opt/conda/envs/arrow/lib/python3.10/site-packages/pyarrow/tests/test_dataset.py::test_make_fragment
  
/opt/conda/envs/arrow/lib/python3.10/site-packages/pyarrow/tests/test_dataset.py:197:
 FutureWarning: In a future version of pandas, a length 1 tuple will be 
returned when iterating over a groupby with a grouper equal to a list of length 
1. Don't supply a list with a single grouper to avoid this warning.
for part, chunk in df_d.groupby(["color"]):

opt/conda/envs/arrow/lib/python3.10/site-packages/pyarrow/tests/test_dataset.py::test_legacy_write_to_dataset_drops_null
opt/conda/envs/arrow/lib/python3.10/site-packages/pyarrow/tests/parquet/test_pandas.py::test_write_to_dataset_pandas_preserve_extensiondtypes[True]
opt/conda/envs/arrow/lib/python3.10/site-packages/pyarrow/tests/parquet/test_pandas.py::test_write_to_dataset_pandas_preserve_index[True]
  
/opt/conda/envs/arrow/lib/python3.10/site-packages/pyarrow/parquet/core.py:3326:
 FutureWarning: In a future version of pandas, a length 1 tuple will be 
returned when iterating over a groupby with a grouper equal to a list of length 
1. Don't supply a list with a single grouper to avoid this warning.
for keys, subgroup in data_df.groupby(partition_keys):
{code}

I suppose those are coming from pandas 1.5.0. We should investigate whether 
this is something to fix in our code (or just in the tests)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17827) [Python] Allow calling UDF kernels with field/scalar expressions

2022-09-23 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-17827:
-

 Summary: [Python] Allow calling UDF kernels with field/scalar 
expressions
 Key: ARROW-17827
 URL: https://issues.apache.org/jira/browse/ARROW-17827
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


>From https://github.com/apache/arrow/pull/13687#issuecomment-1240399112, where 
>it came up while adding documentation on how to use UDFs in Python. When just 
>wanting to invoke a UDF with arrays, you can do {{pc.call_function("my_udf", 
>[pc.field("a")])}}. 

But if you want to use your UDF in a context that needs an expression (eg a 
dataset projection), you need to be able to call the UDF with expressions as 
argument. And currently, the {{pc.call_function}} doesn't work that way (it 
expects actual, materialized arrays/scalars as arguments). As a workaround, you 
can use the private {{Expression._call}}:

{code:python}
# doesn't work with expressions
>>> pc.call_function("my_udf", [pc.field("col")])
...
TypeError: Got unexpected argument type  
for compute function
# workaround
>>> pc.Expression._call("my_udf", [pc.field("col")])

{code}

So we should try to improve the usability here. Some options:

* See if we can change {{pc.call_function}} to also accept Expressions as 
arguments
* Make the {{_call}} public, so one can do {{pc.Expression.call("my_udf", 
[..])}}

cc [~westonpace] [~vibhatha]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17826) [Python] Allow scalars when creating expression from compute kernels

2022-09-23 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-17826:
-

 Summary: [Python] Allow scalars when creating expression from 
compute kernels
 Key: ARROW-17826
 URL: https://issues.apache.org/jira/browse/ARROW-17826
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


We can create an expression (eg for a projection) using the compute kernels and 
passing expressions as arguments. But currently, all other arguments need to be 
expressions:

{code:python}
>>> pc.add(pc.field("a"), pc.field("b"))# this works

>>> pc.add(pc.field("a"), 1)   # this fails when passing scalar (same for 
>>> pa.scalar(1))
...
TypeError: only other expressions allowed as arguments
{code}

You can still pass a scalar expression ({{pc.scalar(1)}}, note {{pc.}} not 
{{pa.}}), but I think for scalars it would be a nice usability improvement if 
it's not needed to manually convert your python or pyarrow scalar to a scalar 
expression. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17651) [Python] ResourceWarnings raised by s3 related tests

2022-09-08 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-17651:
-

 Summary: [Python] ResourceWarnings raised by s3 related tests
 Key: ARROW-17651
 URL: https://issues.apache.org/jira/browse/ARROW-17651
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


Running the python tests give a lot of the following warnings:

{code}
opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/tests/test_fs.py::test_s3fs_limited_permissions_create_bucket
  /opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/tests/util.py:439: 
ResourceWarning: unclosed file <_io.TextIOWrapper name=29 encoding='utf-8'>
_run_mc_command(mcdir, 'admin', 'policy', 'add',
  Enable tracemalloc to get traceback where the object was allocated.
  See 
https://docs.pytest.org/en/stable/how-to/capture-warnings.html#resource-warnings
 for more info.
{code}

Ideally we should ensure the tests don't give such warnings (it also makes 
other warning we should notice less visible)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17650) [Dev][CI] Add overview of all tasks (including passing) on crossbow dashboard

2022-09-08 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-17650:
-

 Summary: [Dev][CI] Add overview of all tasks (including passing) 
on crossbow dashboard
 Key: ARROW-17650
 URL: https://issues.apache.org/jira/browse/ARROW-17650
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Joris Van den Bossche


https://crossbow.voltrondata.com/ currently shows the failing tasks, but it 
would still be useful to have an overview of all tasks, including the passing 
builds (+ their logs), as well. 

cc [~raulcd] [~assignUser]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17649) [Python] Remove remaining deprecated APIs from <= 1.0.0

2022-09-08 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-17649:
-

 Summary: [Python] Remove remaining deprecated APIs from <= 1.0.0
 Key: ARROW-17649
 URL: https://issues.apache.org/jira/browse/ARROW-17649
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Python
Reporter: Joris Van den Bossche


Not all deprecations from <=1.0.0 were already done in ARROW-17010, the 
remaining ones:

- Ignoring mismatch between {{ordered}} flag of values and type in {{array(..)}}
- RecordBatchReader {{get_next_batch}} method
- {{DictionaryScalar.index/dictionary_value}} attributes (deprecated since 
1.0.0)
- {{num_children}} field of DataType
- {{add_metadata}} method of Field



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17139) [Python] Add field() method to get field from StructType

2022-07-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-17139:
-

 Summary: [Python] Add field() method to get field from StructType
 Key: ARROW-17139
 URL: https://issues.apache.org/jira/browse/ARROW-17139
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


>From ARROW-17047:

We could also add a {{field()}} method to {{StructType}} that returns you a 
field? (that is more discoverable than [], and would be consistent with a 
Schema and with StructArray (to get the child array for that field))



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17010) [Python] Remove deprecated APIs from <= 1.0.0

2022-07-08 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-17010:
-

 Summary: [Python] Remove deprecated APIs from <= 1.0.0
 Key: ARROW-17010
 URL: https://issues.apache.org/jira/browse/ARROW-17010
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 9.0.0


Some of the APIs listed in ARROW-13555 were deprecated in 1.0.0 or before, and 
are relatively easy to remove:





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-16728) [Python] Switch default and deprecate use_legacy_dataset=True in ParquetDataset

2022-06-02 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16728:
-

 Summary: [Python] Switch default and deprecate 
use_legacy_dataset=True in ParquetDataset
 Key: ARROW-16728
 URL: https://issues.apache.org/jira/browse/ARROW-16728
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 9.0.0


The ParquetDataset() constructor itself still defaults to 
{{use_legacy_dataset=True}} (although using specific attributes or keywords 
related to that will raise a warning). So a next step will be to actually 
deprecate passing that and switching the default, and then only afterwards we 
can remove the code.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16719) [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

2022-06-02 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16719:
-

 Summary: [Python] Add path/URI /+ filesystem handling to 
parquet.read_metadata
 Key: ARROW-16719
 URL: https://issues.apache.org/jira/browse/ARROW-16719
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Currently you can pass a local file path or file-like object, or a URI (eg 
"s3://...") or path+filesystem combo to {{parquet.read_table}}. 
But the {{parquet.read_metadata}} and {{parquet.read_schema}} methods (being a 
small wrapper around {{ParquetFile}} only accept the local file path or 
file-like object. I would propose to add the same path+filesystem handling to 
those functions as happens in {{read_table}} to make the capabilities of those 
consistent.

(I ran into this in geopandas, where we use {{read_table}} to read the actual 
data, but also need {{read_metadata}} to inspect the actual Parquet 
FileMetaData for metadata)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16652) [Python][C++] Cast compute kernel segfaults when called with a Table

2022-05-25 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16652:
-

 Summary: [Python][C++] Cast compute kernel segfaults when called 
with a Table
 Key: ARROW-16652
 URL: https://issues.apache.org/jira/browse/ARROW-16652
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Joris Van den Bossche


Passing a Table to {{{pyarrow.compute.cast}} with a scalar type gives a 
segfault:

{code}
In [1]: table = pa.table({'a': [1, 2]})

In [2]: import pyarrow.compute as pc

In [3]: pc.cast(table, pa.int64())
Segmentation fault (core dumped)
{code}

Backtrace with gdb gives:

{code}
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x7fba01685ada in arrow::DataType::id (this=0x0) at ../src/arrow/type.h:172
172   Type::type id() const { return id_; }
(gdb) bt
#0  0x7fba01685ada in arrow::DataType::id (this=0x0) at 
../src/arrow/type.h:172
#1  0x7fba019e150e in arrow::TypeEquals (left=..., right=..., 
check_metadata=false) at ../src/arrow/compare.cc:1304
#2  0x7fba01b3484a in arrow::DataType::Equals (this=0x0, other=..., 
check_metadata=false) at ../src/arrow/type.cc:374
#3  0x7fba01f31678 in arrow::compute::internal::(anonymous 
namespace)::CastMetaFunction::ExecuteImpl (this=0x55b6ebe63860, args=..., 
options=0x55b6ec377080, ctx=0x7ffcd8cd43a0)
at ../src/arrow/compute/cast.cc:116
#4  0x7fba020d9f39 in arrow::compute::MetaFunction::Execute 
(this=0x55b6ebe63860, args=..., options=0x55b6ec377080, ctx=0x7ffcd8cd43a0) at 
../src/arrow/compute/function.cc:388
#5  0x7fb9ba95c8d9 in __pyx_pf_7pyarrow_8_compute_8Function_6call 
(__pyx_v_self=0x7fb9b7c19af0, __pyx_v_args=[], __pyx_v_options=0x7fb9b7c1c310, 
__pyx_v_memory_pool=0x55b6ea466d60 <_Py_NoneStruct>) at 
/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.8/_compute.cpp:11292
#6  0x7fb9ba95c3d5 in __pyx_pw_7pyarrow_8_compute_8Function_7call 
(__pyx_v_self=, 
__pyx_args=([],), 
__pyx_kwds={'options': , 
'memory_pool': None}) at 
/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.8/_compute.cpp:11165
#7  0x55b6ea1fb814 in cfunction_call_varargs (kwargs=, 
args=, func=)
at 
/home/conda/feedstock_root/build_artifacts/python-split_1606502903469/work/Objects/call.c:772
#8  PyCFunction_Call (func=, args=, kwargs=)
at 
/home/conda/feedstock_root/build_artifacts/python-split_1606502903469/work/Objects/call.c:772
#9  0x7fb9ba9e84e2 in __Pyx_PyObject_Call (func=, 
arg=([],), 
kw={'options': , 'memory_pool': 
None}) at 
/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.8/_compute.cpp:57961
#10 0x7fb9ba961add in __pyx_pf_7pyarrow_8_compute_6call_function 
(__pyx_self=0x0, __pyx_v_name='cast', __pyx_v_args=[], 
__pyx_v_options=, 
__pyx_v_memory_pool=None) at 
/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.8/_compute.cpp:13408
#11 0x7fb9ba961676 in __pyx_pw_7pyarrow_8_compute_7call_function 
(__pyx_self=0x0, __pyx_args=('cast', [], ), __pyx_kwds=0x0)
...

{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16651) [Python] Casting Table to new schema ignores nullability of fields

2022-05-25 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16651:
-

 Summary: [Python] Casting Table to new schema ignores nullability 
of fields
 Key: ARROW-16651
 URL: https://issues.apache.org/jira/browse/ARROW-16651
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Similar to ARROW-15478, but not for nested fields but just for casting a full 
Table (in theory that could be the same code, but currently the Table.cast 
logic is implemented in cython). 

So currently when casting a Table to a new schema, the nullability of the 
fields in the schema is ignored (and as a result you get an "invalid" schema 
indicating a field is non-nullable that actually can have nulls):

{code}
>>> table = pa.table({'a': [None, 1]})
>>> table
pyarrow.Table
a: int64

a: [[null,1]]

>>> new_schema = pa.schema([pa.field("a", "int64", nullable=False)])
>>> table.cast(new_schema)
pyarrow.Table
a: int64 not null

a: [[null,1]]
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16460) [Python] Some dataset tests using PyFileSystem are failing on Windows

2022-05-04 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16460:
-

 Summary: [Python] Some dataset tests using PyFileSystem are 
failing on Windows
 Key: ARROW-16460
 URL: https://issues.apache.org/jira/browse/ARROW-16460
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


We have some dataset tests that are skipped on Windows, because they are 
failing with FileNotFound errors.

* 
https://github.com/apache/arrow/blob/3c3e68c194ca6ac07086ddc1bb44fe153970213e/python/pyarrow/tests/test_dataset.py#L3261-L3264
*https://github.com/apache/arrow/blob/893faa741f34ee450070503566dafb7291e24d9f/python/pyarrow/tests/test_dataset.py#L3124-L3145
 (and see https://github.com/apache/arrow/pull/13033#issuecomment-1116180259 
for some analysis)

In the second case, it seems that for some reason, the file paths of the 
fragments are relative paths to the root of the dataset (while locally for me 
this gives absolute paths). 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16458) [Python] Run S3 tests in the nightly dask integration build

2022-05-04 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16458:
-

 Summary: [Python] Run S3 tests in the nightly dask integration 
build
 Key: ARROW-16458
 URL: https://issues.apache.org/jira/browse/ARROW-16458
 Project: Apache Arrow
  Issue Type: Test
  Components: Continuous Integration, Python
Reporter: Joris Van den Bossche


As a follow-up on https://github.com/apache/arrow/pull/13033 (ARROW-16413), we 
should update the {{integration_dask.sh}} script to also run the S3 tests from 
the dask test suite. 

See 
https://github.com/apache/arrow/pull/13033/commits/1bca56e932434d6b0dc947dd51915d83f9dd3a43
 (in that commit I removed that again, because it was still failing due to some 
moto timeout)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16442) [Python] The fragments for ORC dataset return base Fragment instead of FileFragment

2022-05-03 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16442:
-

 Summary: [Python] The fragments for ORC dataset return base 
Fragment instead of FileFragment
 Key: ARROW-16442
 URL: https://issues.apache.org/jira/browse/ARROW-16442
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 9.0.0


>From https://github.com/dask/dask/pull/8944#issuecomment-1112620037

For the ORC file format, we return base {{Fragment}} objects instead of the 
{{FileFragment}} subclass (which has more functionality):

{code:python}
import pyarrow as pa
import pyarrow.dataset as ds
from pyarrow import orc

table = pa.table({'a': [1, 2, 3]})
orc.write_table(table, "test.orc")
dataset = ds.dataset("test.orc", format="orc")
fragment = list(dataset.get_fragments())[0]
{code}

{code}
In [9]: fragment
Out[9]: 

In [10]: fragment.path
---
AttributeErrorTraceback (most recent call last)
 in 
> 1 fragment.path

AttributeError: 'pyarrow._dataset.Fragment' object has no attribute 'path'
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16413) [C++][Python] FileFormat::GetReaderAsync hangs with an fsspec filesystem

2022-04-29 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16413:
-

 Summary: [C++][Python] FileFormat::GetReaderAsync hangs with an 
fsspec filesystem
 Key: ARROW-16413
 URL: https://issues.apache.org/jira/browse/ARROW-16413
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Joris Van den Bossche
 Fix For: 8.0.0


See https://github.com/dask/dask/pull/8993 for details. 

When using an fsspec filesystem (or maybe more generally a PyFileSystem), 
inspecting a file through the FileFormat.inspect is hanging (this eg happens in 
ParquetDatasetFactory)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16339) [C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to Arrow Schema metadata

2022-04-26 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16339:
-

 Summary: [C++][Parquet] Parquet FileMetaData key_value_metadata 
not always mapped to Arrow Schema metadata
 Key: ARROW-16339
 URL: https://issues.apache.org/jira/browse/ARROW-16339
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Parquet, Python
Reporter: Joris Van den Bossche


Context: I ran into this issue when reading Parquet files created by GDAL 
(using the Arrow C++ APIs, [https://github.com/OSGeo/gdal/pull/5477]), which 
writes files that have custom key_value_metadata, but without storing 
ARROW:schema in those metadata (cc [~paleolimbot]

—

Both in reading and writing files, I expected that we would map Arrow 
{{Schema::metadata}} with Parquet {{{}FileMetaData::key_value_metadata{}}}. But 
apparently this doesn't (always) happen out of the box, and only happens 
through the "ARROW:schema" field (which stores the original Arrow schema, and 
thus the metadata stored in this schema).

For example, when writing a Table with schema metadata, this is not stored 
directly in the Parquet FileMetaData (code below is using branch from 
ARROW-16337 to have the {{store_schema}} keyword):
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({'a': [1, 2, 3]}, metadata={"key": "value"})
pq.write_table(table, "test_metadata_with_arrow_schema.parquet")
pq.write_table(table, "test_metadata_without_arrow_schema.parquet", 
store_schema=False)

# original schema has metadata
>>> table.schema
a: int64
-- schema metadata --
key: 'value'

# reading back only has the metadata in case we stored ARROW:schema
>>> pq.read_table("test_metadata_with_arrow_schema.parquet").schema
a: int64
-- schema metadata --
key: 'value'
# and not if ARROW:schema is absent
>>> pq.read_table("test_metadata_without_arrow_schema.parquet").schema
a: int64
{code}
It seems that if we store the ARROW:schema, we _also_ store the schema metadata 
separately. But if {{store_schema}} is False, we also stop writing those 
metadata (not fully sure if this is the intended behaviour, and that's the 
reason for the above output):
{code:python}
# when storing the ARROW:schema, we ALSO store key:value metadata
>>> pq.read_metadata("test_metadata_with_arrow_schema.parquet").metadata
{b'ARROW:schema': b'/7AQAAAKAA4ABgAFAA...',
 b'key': b'value'}
# when not storing the schema, we also don't store the key:value
>>> pq.read_metadata("test_metadata_without_arrow_schema.parquet").metadata is 
>>> None
True
{code}
On the reading side, it seems that we generally do read custom key/value 
metadata into schema metadata. We don't have the pyarrow APIs at the moment to 
create such a file (given the above), but with a small patch I could create 
such a file:
{code:python}
# a Parquet file with ParquetFileMetaData::metadata that ONLY has a custom key
>>> pq.read_metadata("test_metadata_without_arrow_schema2.parquet").metadata
{b'key': b'value'}

# this metadata is now correctly mapped to the Arrow schema metadata
>>> pq.read_schema("test_metadata_without_arrow_schema2.parquet")
a: int64
-- schema metadata --
key: 'value'
{code}
But if you have a file that has both custom key/value metadata and an 
"ARROW:schema" key, we actually ignore the custom keys, and only look at the 
"ARROW:schema" one. 
This was the case that I ran into with GDAL, where I have a file with both 
keys, but where the custom "geo" key is not also included in the serialized 
arrow schema in the "ARROW:schema" key:
{code:python}
# includes both keys in the Parquet file
>>> pq.read_metadata("test_gdal.parquet").metadata
{b'geo': b'{"version":"0.1.0","...',
 b'ARROW:schema': b'/3gBAAAQ...'}
# the "geo" key is lost in the Arrow schema
>>> pq.read_table("test_gdal.parquet").schema.metadata is None
True
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16337) [Python] Expose parameter that determines to store Arrow schema in Parquet metadata in Python

2022-04-26 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16337:
-

 Summary: [Python] Expose parameter that determines to store Arrow 
schema in Parquet metadata in Python
 Key: ARROW-16337
 URL: https://issues.apache.org/jira/browse/ARROW-16337
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 9.0.0


There is a {{store_schema}} flag that determines whether we store the Arrow 
schema in the Parquet metadata (under the {{ARROW:schema}} key) or not. This is 
exposed in the C++, but not in the Python interface. It would be good to also 
expose this in the Python layer, to more easily experiment with this (eg to 
check the impact of having the schema available or not when reading a file)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16336) [Python] Hide internal (common_)metadata related warnings from the user (ParquetDataset)

2022-04-26 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16336:
-

 Summary: [Python] Hide internal (common_)metadata related warnings 
from the user (ParquetDataset)
 Key: ARROW-16336
 URL: https://issues.apache.org/jira/browse/ARROW-16336
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 8.0.0


Small follow-up on ARROW-16121, we missed a few cases where we are internally 
using those attributes (in the {{equals}} method)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16262) [CI] Kartothek nightly integration build is failing because of Parquet statistics date change

2022-04-21 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16262:
-

 Summary: [CI] Kartothek nightly integration build is failing 
because of Parquet statistics date change
 Key: ARROW-16262
 URL: https://issues.apache.org/jira/browse/ARROW-16262
 Project: Apache Arrow
  Issue Type: Test
  Components: Continuous Integration, Python
Reporter: Joris Van den Bossche


Caused by ARROW-7350, see discussion at 
https://github.com/apache/arrow/pull/12902#issuecomment-1102750381

Upstream issue at https://github.com/JDASoftwareGroup/kartothek/issues/515

On the short term, we should also fix our nightly builds (either temporarily 
disabling them altogether, or ideally on skipping those failing tests)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16231) [C++][Python] IPC failure for dictionary with extension type with struct storage type

2022-04-19 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16231:
-

 Summary: [C++][Python] IPC failure for dictionary with extension 
type with struct storage type
 Key: ARROW-16231
 URL: https://issues.apache.org/jira/browse/ARROW-16231
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Joris Van den Bossche


Report from [https://github.com/apache/arrow/issues/12899]

Roundtripping through IPC/Feather using a dictionary type where the dictionary 
is an extension type with a nested storage type fails. Writing seems to work 
(but no idea if the written file is "correct", as trying to read the schema 
gives an error), but reading it back fails with {_}"ArrowInvalid: Ran out of 
field metadata, likely malformed"{_}.

The original use case was from a pandas extension type (the pandas interval 
dtype is mapped to an arrow extension type with a struct type as storage, and 
in this case this interval type was further wrapped in a categorical 
(dictionary) type). A pandas-based test that reproduces this (can be added like 
this in {{{}test_feather.py{}}}):
{code:python}
@pytest.mark.pandas
def test_dictionary_interval():
df = pd.DataFrame({'a': pd.cut(range(1, 10, 3), [-1, 5, 10])})
_check_pandas_roundtrip(df, version=2)
{code}
this gives:
{code:java}
$ pytest python/pyarrow/tests/test_feather.py::test_dictionary_interval

= FAILURES =
 test_dictionary_interval ___

pyarrow/_feather.pyx:88: in pyarrow._feather.FeatherReader.read

E   pyarrow.lib.ArrowInvalid: Ran out of field metadata, likely malformed
E   ../src/arrow/ipc/reader.cc:266  GetFieldMetadata(field_index_++, out_)
E   ../src/arrow/ipc/reader.cc:283  LoadCommon(type_id)
E   ../src/arrow/ipc/reader.cc:324  Load(child_fields[i].get(), 
parent->child_data[i].get())
E   ../src/arrow/ipc/reader.cc:529  loader.Load(, column.get())
E   ../src/arrow/ipc/reader.cc:1188  ReadRecordBatchInternal( 
*message->metadata(), schema_, field_inclusion_mask_, context, reader.get())
E   ../src/arrow/ipc/feather.cc:730  reader->ReadRecordBatch(i)

pyarrow/error.pxi:100: ArrowInvalid
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16204) [C++][Dataset] Default error existing_data_behaviour for writing dataset ignores "part-{i}.ext" files

2022-04-15 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16204:
-

 Summary: [C++][Dataset] Default error existing_data_behaviour for 
writing dataset ignores "part-{i}.ext" files 
 Key: ARROW-16204
 URL: https://issues.apache.org/jira/browse/ARROW-16204
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 8.0.0


While trying to understand a failing test in 
https://github.com/apache/arrow/pull/12811#discussion_r851128672, I noticed 
that the {{write_dataset}} function does not actually always raise an error by 
default if there is already existing data in the target location.

The documentation says it will raise "if any data exists in the destination" 
(which is also what I would expect), but in practice it seems that it does 
ignore certain file names:

{code:python}
import pyarrow.dataset as ds
table = pa.table({'a': [1, 2, 3]})

# write a first time to new directory: OK
>>> ds.write_dataset(table, "test_overwrite", format="parquet")
>>> !ls test_overwrite
part-0.parquet

# write a second time to the same directory: passes, but should raise?
>>> ds.write_dataset(table, "test_overwrite", format="parquet")
>>> !ls test_overwrite
part-0.parquet

# write a another time to the same directory with different name: still passes
>>> ds.write_dataset(table, "test_overwrite", format="parquet", 
>>> basename_template="data-{i}.parquet")
>>> !ls test_overwrite
data-0.parquet  part-0.parquet

# now writing again finally raises an error
>>> ds.write_dataset(table, "test_overwrite", format="parquet")
...
ArrowInvalid: Could not write to test_overwrite as the directory is not empty 
and existing_data_behavior is to error
{code}

So it seems that when checking if existing data exists, it seems to ignore any 
files that match the basename template pattern.

cc [~westonpace] do you know if this was intentional? (I would find that a 
strange corner case, and in any case it is also not documented)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16140) [Python] zoneinfo timezones failing during type inference

2022-04-07 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16140:
-

 Summary: [Python] zoneinfo timezones failing during type inference
 Key: ARROW-16140
 URL: https://issues.apache.org/jira/browse/ARROW-16140
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


The conversion itself works fine (eg when specifying {{type=pa.timestamp("us", 
tz="America/New_York")}} in the below example), but inferring the type and 
timezone from the first value fails if it has a zoneinfo timezone:

{code}
In [53]: tz = zoneinfo.ZoneInfo(key='America/New_York')

In [54]: dt = datetime.datetime(2013, 11, 3, 10, 3, 14, tzinfo = tz)

In [55]: pa.array([dt])

ArrowInvalid: Object returned by tzinfo.utcoffset(None) is not an instance of 
datetime.timedelta
{code}

cc [~alenkaf]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16123) [Python] Do no include __init__ in the API documentation

2022-04-05 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16123:
-

 Summary: [Python] Do no include __init__ in the API documentation
 Key: ARROW-16123
 URL: https://issues.apache.org/jira/browse/ARROW-16123
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Python
Reporter: Joris Van den Bossche


>From https://github.com/apache/arrow/pull/12698#discussion_r836484176

We should try to instruct sphinx/autodoc/numpydoc to not include 
{{\_\_init\_\_}} functions in the reference docs, as I don't think we have any 
case where this adds value (compared to the class docstring). See eg 
https://arrow.apache.org/docs/dev/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset.__init__

cc [~alenkaf]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16122) [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

2022-04-05 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16122:
-

 Summary: [Python] Deprecate no-longer supported keywords in 
parquet.write_to_dataset
 Key: ARROW-16122
 URL: https://issues.apache.org/jira/browse/ARROW-16122
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 8.0.0


Currently, the {{pq.write_to_dataset}} function also had a 
{{use_legacy_dataset}} keyword, but we should:

1) in case of {{use_legacy_dataset=True}}, ensure we raise deprecation warnings 
for all keywords that won't be supported in the new implementation (eg 
{{partition_filename_cb}})
2) raise a deprecation warning for {{use_legacy_dataset=True}}, and/or already 
switch the default?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16121) [Python] Deprecate the (common_)metadata(_path) attributes of ParquetDataset

2022-04-05 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16121:
-

 Summary: [Python] Deprecate the (common_)metadata(_path) 
attributes of ParquetDataset
 Key: ARROW-16121
 URL: https://issues.apache.org/jira/browse/ARROW-16121
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 8.0.0


The custom python ParquetDataset implementation exposes the {{metadata}}, 
{{metadata_path}}, {{common_metadata}} and {{common_metadata_path}} attributes, 
something for which we didn't add an equivalent to the new dataset API. 

Unless we still want to add something for this, we should deprecate those 
attributes in the legacy ParquetDataset. 

In addition, we should also deprecate passing the {{metadata}} keyword in the 
ParquetDataset constructor. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16120) [Python] ParquetDataset deprecation: change Deprecation to FutureWarnings

2022-04-05 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16120:
-

 Summary: [Python] ParquetDataset deprecation: change Deprecation 
to FutureWarnings
 Key: ARROW-16120
 URL: https://issues.apache.org/jira/browse/ARROW-16120
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 8.0.0


We are currently using DeprecationWarnings for the deprecations, but now they 
are already there for some time, we can change this to the more user-visible 
FutureWarning.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16119) [Python] Deprecate the legacy ParquetDataset custom python-based implementation

2022-04-05 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16119:
-

 Summary: [Python] Deprecate the legacy ParquetDataset custom 
python-based implementation
 Key: ARROW-16119
 URL: https://issues.apache.org/jira/browse/ARROW-16119
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: Joris Van den Bossche


To be able to remove the custom python implementation (ARROW-15868), we first 
need to deprecate the various aspects. 

This issue is meant as a parent issue to keep an overview of the different 
tasks.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16113) [Python] Partitioning.dictionaries in case of a subset of fields are dictionary encoded

2022-04-04 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16113:
-

 Summary: [Python] Partitioning.dictionaries in case of a subset of 
fields are dictionary encoded
 Key: ARROW-16113
 URL: https://issues.apache.org/jira/browse/ARROW-16113
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


Follow-up on ARROW-14612, see the discussion at 
https://github.com/apache/arrow/pull/12530#discussion_r841760449

ARROW-14612 changes the return value of the {{dictionaries}} attribute from 
None to a list in case some of the partitioning schema fields are not 
dictionary encoded. 

But this can result in a non-clear mapping between arrays in 
{{Partitioning.dictionaries}} and fields in {{Partitioning.schema}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16107) [CI][Archery] Fix archery crossbow query to get latest prefix

2022-04-04 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16107:
-

 Summary: [CI][Archery] Fix archery crossbow query to get latest 
prefix
 Key: ARROW-16107
 URL: https://issues.apache.org/jira/browse/ARROW-16107
 Project: Apache Arrow
  Issue Type: Test
  Components: Continuous Integration, Developer Tools
Reporter: Joris Van den Bossche


This feature stopped working when the crossbow builds were splitted into 3 parts



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16018) [Doc][Python] Run doctests on Python docstring examples

2022-03-24 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16018:
-

 Summary: [Doc][Python] Run doctests on Python docstring examples
 Key: ARROW-16018
 URL: https://issues.apache.org/jira/browse/ARROW-16018
 Project: Apache Arrow
  Issue Type: Test
  Components: Documentation, Python
Reporter: Joris Van den Bossche


We start to add more and more examples to the docstrings of Python methods 
(ARROW-15367), and so we could use the doctest functionality to ensure that 
those examples are actually correct (and keep being correct).

Pytest has integration for doctests 
(https://docs.pytest.org/en/6.2.x/doctest.html), and so you can do:

{code}
pytest python/pyarrow --doctest-modules
{code}

This currently fails for me because not having pyarrow.cuda, so we will need to 
find some ways to automatically skips those parts if not available.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15997) [CI] Nightly turbodbc build is failing (C++ compilation error)

2022-03-22 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15997:
-

 Summary: [CI] Nightly turbodbc build is failing (C++ compilation 
error)
 Key: ARROW-15997
 URL: https://issues.apache.org/jira/browse/ARROW-15997
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Joris Van den Bossche


See eg 
https://github.com/ursacomputing/crossbow/runs/5637809188?check_suite_focus=true

The error seems related to boost (and not Arrow), and happens in the C++ code 
of turbodbc. But it is strange that it happens in both the latest and master 
turbodbc build (so it's not caused by a change on turbodbc's side). And I also 
didn't see a change in the boost version compared to the last successful build.

cc [~uwe]

{code}
 [102/156] Building CXX object 
cpp/turbodbc/Test/CMakeFiles/turbodbc_test.dir/tests/field_translator_test.cpp.o
FAILED: 
cpp/turbodbc/Test/CMakeFiles/turbodbc_test.dir/tests/field_translator_test.cpp.o
 
/opt/conda/envs/arrow/bin/x86_64-conda-linux-gnu-c++  
-I/turbodbc/cpp/turbodbc/Library -I/turbodbc/cpp/turbodbc/../cpp_odbc/Library 
-I/turbodbc/cpp/turbodbc/Test -fvisibility-inlines-hidden -std=c++17 
-fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC 
-fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem 
/opt/conda/envs/arrow/include -Wall -Wextra -g -O0 -pedantic   -std=c++11 -MD 
-MT 
cpp/turbodbc/Test/CMakeFiles/turbodbc_test.dir/tests/field_translator_test.cpp.o
 -MF 
cpp/turbodbc/Test/CMakeFiles/turbodbc_test.dir/tests/field_translator_test.cpp.o.d
 -o 
cpp/turbodbc/Test/CMakeFiles/turbodbc_test.dir/tests/field_translator_test.cpp.o
 -c /turbodbc/cpp/turbodbc/Test/tests/field_translator_test.cpp
In file included from 
/opt/conda/envs/arrow/include/boost/type_index/stl_type_index.hpp:32,
 from /opt/conda/envs/arrow/include/boost/type_index.hpp:29,
 from 
/opt/conda/envs/arrow/include/boost/variant/variant.hpp:21,
 from /turbodbc/cpp/turbodbc/Library/turbodbc/field.h:3,
 from 
/turbodbc/cpp/turbodbc/Library/turbodbc/field_translator.h:3,
 from 
/turbodbc/cpp/turbodbc/Test/tests/field_translator_test.cpp:1:
/opt/conda/envs/arrow/include/boost/optional/optional.hpp: In instantiation of 
'std::basic_ostream<_CharT, _Traits>& 
boost::operator<<(std::basic_ostream<_CharT, _Traits>&, const 
boost::optional_detail::optional_tag&) [with CharType = char; CharTrait = 
std::char_traits]':
/opt/conda/envs/arrow/include/gtest/gtest-printers.h:215:9:   required from 
'static void 
testing::internal::internal_stream_operator_without_lexical_name_lookup::StreamPrinter::PrintValue(const
 T&, std::ostream*) [with T = boost::optional, std::allocator 
>, bool, double, boost::gregorian::date, boost::posix_time::ptime> >; 
 = void;  = 
std::basic_ostream&; std::ostream = std::basic_ostream]'
/opt/conda/envs/arrow/include/gtest/gtest-printers.h:312:22:   required from 
'void testing::internal::PrintWithFallback(const T&, std::ostream*) [with T = 
boost::optional, std::allocator >, bool, double, 
boost::gregorian::date, boost::posix_time::ptime> >; std::ostream = 
std::basic_ostream]'
/opt/conda/envs/arrow/include/gtest/gtest-printers.h:441:30:   required from 
'void testing::internal::PrintTo(const T&, std::ostream*) [with T = 
boost::optional, std::allocator >, bool, double, 
boost::gregorian::date, boost::posix_time::ptime> >; std::ostream = 
std::basic_ostream]'
/opt/conda/envs/arrow/include/gtest/gtest-printers.h:691:12:   required from 
'static void testing::internal::UniversalPrinter::Print(const T&, 
std::ostream*) [with T = boost::optional, std::allocator 
>, bool, double, boost::gregorian::date, boost::posix_time::ptime> >; 
std::ostream = std::basic_ostream]'
/opt/conda/envs/arrow/include/gtest/gtest-printers.h:980:30:   required from 
'void testing::internal::UniversalPrint(const T&, std::ostream*) [with T = 
boost::optional, std::allocator >, bool, double, 
boost::gregorian::date, boost::posix_time::ptime> >; std::ostream = 
std::basic_ostream]'
/opt/conda/envs/arrow/include/gtest/gtest-printers.h:865:19:   [ skipping 2 
instantiation contexts, use -ftemplate-backtrace-limit=0 to disable ]
/opt/conda/envs/arrow/include/gtest/gtest-printers.h:334:36:   required from 
'static std::string testing::internal::FormatForComparison::Format(const ToPrint&) [with ToPrint = 
boost::optional, std::allocator >, bool, double, 
boost::gregorian::date, boost::posix_time::ptime> >; OtherOperand = 
boost::optional, std::allocator >, bool, double, 
boost::gregorian::date, boost::posix_time::ptime> >; std::string = 
std::__cxx11::basic_string]'
/opt/conda/envs/arrow/include/gtest/gtest-printers.h:415:45:   required from 
'std::string testing::internal::FormatForComparisonFailureMessage(const T1&, 
const T2&) [with T1 = boost::optional, std::allocator 
>, bool, 

[jira] [Created] (ARROW-15960) [Python] Segfault constructing a fixed size list array of size 0 with dictionary values

2022-03-17 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15960:
-

 Summary: [Python] Segfault constructing a fixed size list array of 
size 0 with dictionary values
 Key: ARROW-15960
 URL: https://issues.apache.org/jira/browse/ARROW-15960
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


The following example constructing a FixedSizeList array with list size 0 and 
dictionary values from an explicit None value (extracted from a segfaulting 
hypothesis test) crashes:

{code}
pa.array([None], pa.list_(pa.dictionary(pa.int32(), pa.string()), 0))
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15884) [C++][Doc] Document that the strptime kernel ignores %Z

2022-03-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15884:
-

 Summary: [C++][Doc] Document that the strptime kernel ignores %Z
 Key: ARROW-15884
 URL: https://issues.apache.org/jira/browse/ARROW-15884
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Documentation
Reporter: Joris Van den Bossche


After ARROW-12820, the {{strptime}} kernel still ignores the {{%Z}} specifier 
(for timezone names), and when using it, it basically ignores any string.

For example:

{code:python}
# the %z specifier now works (after ARROW-12820)
>>> pc.strptime(["2022-03-05 09:00:00+01"], format="%Y-%m-%d %H:%M:%S%z", 
>>> unit="us")

[
  2022-03-05 08:00:00.00
]

# in theory this should give the same result, but %Z is still ignore
>>> pc.strptime(["2022-03-05 09:00:00 CET"], format="%Y-%m-%d %H:%M:%S %Z", 
>>> unit="us")

[
  2022-03-05 09:00:00.00
]

# as a result any garbage in the string is also ignored
>>> pc.strptime(["2022-03-05 09:00:00 blabla"], format="%Y-%m-%d %H:%M:%S %Z", 
>>> unit="us")

[
  2022-03-05 09:00:00.00
]
{code}

I don't think it is easy to actually fix this (at least as long as we use the 
system strptime, see also 
https://github.com/apache/arrow/pull/11358#issue-1020404727). But at least we 
should document this limitation / gotcha.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15883) [C++] Support for fractional seconds in strptime() for ISO format?

2022-03-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15883:
-

 Summary: [C++] Support for fractional seconds in strptime() for 
ISO format?
 Key: ARROW-15883
 URL: https://issues.apache.org/jira/browse/ARROW-15883
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


Currently, we can't parse "our own" string representation of a timestamp array 
with the timestamp parser {{strptime}}:

{code:python}
import datetime
import pyarrow as pa
import pyarrow.compute as pc

>>> pa.array([datetime.datetime(2022, 3, 5, 9)])

[
  2022-03-05 09:00:00.00
]

# trying to parse the above representation as string
>>> pc.strptime(["2022-03-05 09:00:00.00"], format="%Y-%m-%d %H:%M:%S", 
>>> unit="us")
...
ArrowInvalid: Failed to parse string: '2022-03-05 09:00:00.00' as a scalar 
of type timestamp[us]
{code}

The reason for this is the fractional second part, so the following works:

{code:python}
>>> pc.strptime(["2022-03-05 09:00:00"], format="%Y-%m-%d %H:%M:%S", unit="us")

[
  2022-03-05 09:00:00.00
]
{code}

Now, I think the reason that this fails is because {{strptime}} only supports 
parsing seconds as an integer 
(https://man7.org/linux/man-pages/man3/strptime.3.html). 

But, it creates a strange situation where the timestamp parser cannot parse the 
representation we use for timestamps.

In addition, for CSV we have a custom ISO parser (used by default), so when 
parsing the strings while reading a CSV file, the same string with fractional 
seconds does work:

{code:python}
s = b"""a
2022-03-05 09:00:00.00"""

from pyarrow import csv

>>> csv.read_csv(io.BytesIO(s))
pyarrow.Table
a: timestamp[ns]

a: [[2022-03-05 09:00:00.0]]
{code}

cc [~apitrou] [~rokm]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15882) [Ci][Python] Nightly hypothesis build is not actually running the hypothesis tests

2022-03-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15882:
-

 Summary: [Ci][Python] Nightly hypothesis build is not actually 
running the hypothesis tests
 Key: ARROW-15882
 URL: https://issues.apache.org/jira/browse/ARROW-15882
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Python
Reporter: Joris Van den Bossche
 Fix For: 8.0.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15871) [Python] Start raising deprecation warnings for ParquetDataset keywords that won't be supported with the new API

2022-03-08 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15871:
-

 Summary: [Python] Start raising deprecation warnings for 
ParquetDataset keywords that won't be supported with the new API
 Key: ARROW-15871
 URL: https://issues.apache.org/jira/browse/ARROW-15871
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 8.0.0


Currently, the {{ParquetDataset}} API itself still defaults to the legacy 
implementation ({{parquet.read_table}} already defaults to the new) and also 
still supports some keywords that won't be supported with the new 
implementation. 

So if we want to remove the old implementation at some point (ARROW-15868), we 
should start deprecating those options, and also start defaulting to the new 
implementation when possible.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15870) [Python] Start to raise deprecation warnings when using use_legacy_dataset=True in parquet.py

2022-03-08 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15870:
-

 Summary: [Python] Start to raise deprecation warnings when using 
use_legacy_dataset=True in parquet.py
 Key: ARROW-15870
 URL: https://issues.apache.org/jira/browse/ARROW-15870
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 8.0.0


Currently, users can still specify {{use_legacy_dataset=True}} explicitly to 
get the old implementation/behaviour. But if we want to remove that 
implementation at some point (ARROW-15868), we should start deprecating that 
option, to futher nudge people to the new implementation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15868) [Python] Remove the legacy ParquetDataset custom python-based implementation

2022-03-08 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15868:
-

 Summary: [Python] Remove the legacy ParquetDataset custom 
python-based implementation
 Key: ARROW-15868
 URL: https://issues.apache.org/jira/browse/ARROW-15868
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Python
Reporter: Joris Van den Bossche


We might want to keep the actual {{ParquetDataset}} class (ARROW-9720), but we 
should still remove the custom / legacy implementation (which is using the 
deprecated filesystem interface, so this is also blocking ARROW-15761)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15867) [Python] Ignored exception printed when pandas is not installed

2022-03-08 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15867:
-

 Summary: [Python] Ignored exception printed when pandas is not 
installed
 Key: ARROW-15867
 URL: https://issues.apache.org/jira/browse/ARROW-15867
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 8.0.0


When you don't have pandas installed, you can get an "error" like

{code}
Exception ignored in: 'pyarrow.lib._PandasAPIShim._have_pandas_internal'
Traceback (most recent call last):
  File "pyarrow/pandas-shim.pxi", line 110, in 
pyarrow.lib._PandasAPIShim._check_import
  File "pyarrow/pandas-shim.pxi", line 59, in 
pyarrow.lib._PandasAPIShim._import_pandas
AttributeError: module 'pandas' has no attribute '__version__'
{code}

This is not an actual error that interrupts your Python session (it's an 
ignored exception), but we should of course still ensure to not print it.





--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15847) [Python] Building with Parquet but without Parquet encryption fails

2022-03-04 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15847:
-

 Summary: [Python] Building with Parquet but without Parquet 
encryption fails
 Key: ARROW-15847
 URL: https://issues.apache.org/jira/browse/ARROW-15847
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Locally (with Parquet enabled, but no Parquet encryption, both on C++ and 
Python level), I get:

{code}
CMake Error at CMakeLists.txt:643 (target_link_libraries):
  Cannot specify link libraries for target "_parquet_encryption" which is not
  built by this project.


-- Configuring incomplete, errors occurred!
{code}

(also after cleaning up old build files)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15761) [Python] Remove the deprecated pyarrow.filesystem legacy implementations

2022-02-23 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15761:
-

 Summary: [Python] Remove the deprecated pyarrow.filesystem legacy 
implementations
 Key: ARROW-15761
 URL: https://issues.apache.org/jira/browse/ARROW-15761
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 8.0.0


The {{pyarrow.filesystem}} and {{pyarrow.hdfs}} filesystems have been 
deprecated in 2.0.0, and changed from Deprecation to FutureWarning in 4.0.0. I 
think it is time to actually remove them, and I would propose to do so in 8.0.0



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15760) [C++] Avoid hard dependency on git in cmake (download tarballs from github instead)

2022-02-22 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15760:
-

 Summary: [C++] Avoid hard dependency on git in cmake (download 
tarballs from github instead)
 Key: ARROW-15760
 URL: https://issues.apache.org/jira/browse/ARROW-15760
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


See https://github.com/apache/arrow/pull/12322#issuecomment-1048523391



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15720) [CI] Nightly dask build is failing due to wrong usage of Array.to_pandas

2022-02-17 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15720:
-

 Summary: [CI] Nightly dask build is failing due to wrong usage of 
Array.to_pandas
 Key: ARROW-15720
 URL: https://issues.apache.org/jira/browse/ARROW-15720
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Joris Van den Bossche


This failure is triggered by a change in Arrow (addition of {{types_mapper}} 
keyword to {{pa.Array.to_pandas}}), but the cause is a wrong usage of that in 
dask.

I already fixed that on the dask side: https://github.com/dask/dask/pull/8733

But we should still skip the test on our side (will be needed until that PR is 
merged + released)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15711) [C++][Parquet] Extension types with nanosecond timestamp resolution don't roundtrip

2022-02-17 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15711:
-

 Summary: [C++][Parquet] Extension types with nanosecond timestamp 
resolution don't roundtrip
 Key: ARROW-15711
 URL: https://issues.apache.org/jira/browse/ARROW-15711
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Parquet
Reporter: Joris Van den Bossche


Example code:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

class MyTimestampType(pa.PyExtensionType):

def __init__(self):
pa.PyExtensionType.__init__(self, pa.timestamp("ns"))

def __reduce__(self):
return MyTimestampType, ()


arr = MyTimestampType().wrap_array(pa.array([1000, 2000, 3000], 
pa.timestamp("ns")))
table = pa.table({"col": arr})
{code}

{code}
>>> table.schema
col: extension>

>>> pq.write_table(table, "test_parquet_extension_type_timestamp_ns.parquet")
>>> result = pq.read_table("test_parquet_extension_type_timestamp_ns.parquet")
>>> result.schema
col: timestamp[us]
{code}

The reason for this is because we only restore the extension type if the 
inferred storage type (inferred from parquet + after applying any updates based 
on the Arrow schema) exactly equals the original storage type (as stored in the 
Arrow schema):

https://github.com/apache/arrow/blob/afaa92e7e4289d6e4f302cc91810368794e8092b/cpp/src/parquet/arrow/schema.cc#L973-L977

And, with the default options, a timestamp with nanosecond resolution gets 
stored as microsecond resolution in Parquet, and that is something we do not 
restore when updating the read types based on the stored Arrow schema (eg we do 
add a timezone, but we don't change the resolution).

An additional issue is that _if_ you loose the extension type, the field 
metadata about the extension type are also lost. I think that if we cannot 
restore the extension type, we should at least try to keep the ARROW:extension 
field metadata as information.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15652) [C++] GDB plugin printer gives error with extension type

2022-02-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15652:
-

 Summary: [C++] GDB plugin printer gives error with extension type
 Key: ARROW-15652
 URL: https://issues.apache.org/jira/browse/ARROW-15652
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


Copying the example from ARROW-9078

{code}
import pyarrow as pa
import pyarrow.parquet as pq


class MyStructType(pa.PyExtensionType): 
 
def __init__(self): 
pa.PyExtensionType.__init__(self, pa.struct([('left', pa.int64()), 
('right', pa.int64())])) 
 
def __reduce__(self): 
return MyStructType, () 


struct_array = pa.StructArray.from_arrays(
[
pa.array([0, 1], type="int64", from_pandas=True),
pa.array([1, 2], type="int64", from_pandas=True),
],
names=["left", "right"],
)

mystruct_array = pa.ExtensionArray.from_storage(MyStructType(), struct_array)
table = pa.table({'a': mystruct_array})
pq.write_table(table, "test_struct.parquet")
{code}

What I was doing is then reading the table back in, with a breakpoint at 
{{ApplyOriginalMetadata}}. But I suppose any other way to get into the debugger 
is fine as well (and maybe also with a simpler extension type, i.e. not with a 
struct type as storage type, I didn't yet try that).

This gives:

{code}
(gdb) p origin_field
$3 = (const arrow::Field &) @0x555bbb308190: Python Exception  A syntax error in expression, near `) 
(0x555bbb277020)).ToString()'.: 
arrow::field("a", )
{code}

for the field/type being extension type

cc [~apitrou]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15643) [C++] Kernel to select subset of fields of a StructArray

2022-02-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15643:
-

 Summary: [C++] Kernel to select subset of fields of a StructArray
 Key: ARROW-15643
 URL: https://issues.apache.org/jira/browse/ARROW-15643
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


Triggered by 
https://stackoverflow.com/questions/71035754/pyarrow-drop-a-column-in-a-nested-structure.
 I thought there was already an issue about this, but don't directly find one.

Assume you have a struct array with some fields:

{code}
>>> arr = pa.StructArray.from_arrays([[1, 2, 3]]*3, names=['a', 'b', 'c'])
>>> arr.type
StructType(struct)
{code}

We have a kernel to select a single child field:

{code}
>>> pc.struct_field(arr, [0])

[
  1,
  2,
  3
]
{code}

But if you want to subset the StructArray to some of its fields, resulting in a 
new StructArray, that's not possible with {{struct_fields}}, and doing this 
manually is a bit cumbersome:

{code}
>>> fields = ['a', 'c']
>>> arrays = [arr.field(n) for n in fields]
>>> arr_subset = pa.StructArray.from_arrays(arrays, names=fields)
>>> arr_subset.type
StructType(struct)
{code}

(this is still OK, but if you had a ChunkedArray, it certainly gets annoying)





--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15601) [Docs][Release] Update post release script to move stable docs to versioned + keep dev docs

2022-02-07 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15601:
-

 Summary: [Docs][Release] Update post release script to move stable 
docs to versioned + keep dev docs
 Key: ARROW-15601
 URL: https://issues.apache.org/jira/browse/ARROW-15601
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Documentation
Reporter: Joris Van den Bossche
 Fix For: 8.0.0, 7.0.1


xref https://github.com/apache/arrow-site/pull/187

We need to update the {{post-09-docs.sh}} script to keep the dev docs and to 
move the current stable docs to a versioned sub-directory



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15564) [C++] Expose MergeOptions in Concatenate to unify types

2022-02-04 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15564:
-

 Summary: [C++] Expose MergeOptions in Concatenate to unify types
 Key: ARROW-15564
 URL: https://issues.apache.org/jira/browse/ARROW-15564
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


The {{arrow::ConcatenateTables}} function exposes the {{Field::MergeOptions}} 
as a way to indicate how fields with different types should be merged 
("unified" / "common type").

The version to concatenate arrays ({{arrow::Concatenate}}) currently requires 
all same-typed arrays. We could add a MergeOptions option here as well?

(this depends on ARROW-14705 to make this option more useful, currently it only 
handles null -> any upcasts, I think)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15552) [Docs][Format] Unclear wording about base64 encoding requirement of metadata values

2022-02-03 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15552:
-

 Summary: [Docs][Format] Unclear wording about base64 encoding 
requirement of metadata values
 Key: ARROW-15552
 URL: https://issues.apache.org/jira/browse/ARROW-15552
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Format
Reporter: Joris Van den Bossche


The C Data Interface docs indicate that the values in key-value metadata should 
be base64 encoded, which is mentioned in the section about which key-value 
metadata to use for extension types 
(https://arrow.apache.org/docs/format/CDataInterface.html#extension-arrays):

bq. The base64 encoding of metadata values ensures that any possible 
serialization is representable.

This might not be fully correct, though (or at least not required, which is 
implied with the current wording). While a binary blob (like a serialized 
schema) can be base64 encoded, as we do when putting the Arrow schema in the 
Parquet metadata, this is not required?

cc [~apitrou]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15548) [C++][Parquet] Field-level metadata are not supported? (ColumnMetadata.key_value_metadata)

2022-02-03 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15548:
-

 Summary: [C++][Parquet] Field-level metadata are not supported? 
(ColumnMetadata.key_value_metadata)
 Key: ARROW-15548
 URL: https://issues.apache.org/jira/browse/ARROW-15548
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Parquet
Reporter: Joris Van den Bossche


Due to an application where we are considering to use field-level metadata (so 
not schema-level metadata), but also want to be able to save this data to 
Parquet, I was looking into "field-level metadata" for Parquet, which I assumed 
we supported this. 

We can roundtrip Arrow's field-level metadata to/from Parquet, as shown with 
this example:

{code:python}
schema = pa.schema([pa.field("column_name", pa.int64(), metadata={"key": 
"value"})])
table = pa.table({'column_name': [0, 1, 2]}, schema=schema)
pq.write_table(table, "test_field_metadata.parquet")

>>> pq.read_table("test_field_metadata.parquet").schema
column_name: int64
  -- field metadata --
  key: 'value'
{code}

However, the reason this is restored is actually because of this being stored 
in the Arrow schema that we (by default) store in the {{ARROW:schema}} metadata 
in the Parquet FileMetaData.key_value_metadata.

With a small patched version to be able to turn this off (currently this is 
harcoded to be turned on in the python bindings), it is clear this field-level 
metadata is not restored on roundtrip without this stored arrow schema:

{code:python}
pq.write_table(table, "test_field_metadata_without_schema.parquet", 
store_arrow_schema=False)

>>> pq.read_table("test_field_metadata_without_schema.parquet").schema
column_name: int64
{code}

So there is currently no mapping from Arrow's field level metadata to Parquet's 
column-level metadata ({{ColumnMetaData.key_value_metadata}} in Parquet's 
thrift structures). 

(which also means that using field-level metadata roundtripping to parquet only 
works as long as you are using Arrow for writing/reading, but not if you want 
to be able to also exchange such data with non-Arrow Parquet implementations)

In addition, it also seems we don't even expose this field in our C++ or Python 
bindings, to just access that data if you would have a Parquet file (written by 
another implementation) that has key_value_metadata in the ColumnMetaData.

cc [~emkornfield] 






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15545) [C++] Cast dictionary of extension type to extension type

2022-02-03 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15545:
-

 Summary: [C++] Cast dictionary of extension type to extension type
 Key: ARROW-15545
 URL: https://issues.apache.org/jira/browse/ARROW-15545
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


We support casting a DictionaryArray to its dictionary values' type. For 
example:

{code}
>>> arr = pa.array([1, 2, 1]).dictionary_encode()
>>> arr


-- dictionary:
  [
1,
2
  ]
-- indices:
  [
0,
1,
0
  ]

>>> arr.type
DictionaryType(dictionary)
>>> arr.cast(arr.type.value_type)

[
  1,
  2,
  1
]
{code}

However, if the type of the dictionary values is an ExtensionType, this cast is 
not supported:

{code}
>>> from pyarrow.tests.test_extension_type import UuidType
>>> storage = pa.array([b"0123456789abcdef"], type=pa.binary(16))
>>> arr = pa.ExtensionArray.from_storage(UuidType(), storage)
>>> arr

[
  30313233343536373839616263646566
]
>>> dict_arr = pa.DictionaryArray.from_arrays(pa.array([0, 0], pa.int32()), arr)
>>> dict_arr.type
DictionaryType(dictionary>, 
indices=int32, ordered=0>)
>>> dict_arr.cast(UuidType())
...
ArrowNotImplementedError: Unsupported cast from 
dictionary>, indices=int32, 
ordered=0> to extension> (no available cast 
function for target type)
../src/arrow/compute/cast.cc:119  
GetCastFunctionInternal(cast_options->to_type, args[0].type().get())

{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15479) [C++] Cast to fixed size list with different field name

2022-01-27 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15479:
-

 Summary: [C++] Cast to fixed size list with different field name
 Key: ARROW-15479
 URL: https://issues.apache.org/jira/browse/ARROW-15479
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


Casting a FixedSizeListArray to a compatible type but only a different field 
name isn't implemented:

{code:python}
>>> my_type = pa.list_(pa.field("element", pa.int64()), 2)
>>> arr = pa.FixedSizeListArray.from_arrays(pa.array([1, 2, 3, 4, 5, 6]), 2)
>>> arr.type
FixedSizeListType(fixed_size_list[2])
>>> my_type
FixedSizeListType(fixed_size_list[2])

>>> arr.cast(my_type)
...
ArrowNotImplementedError: Unsupported cast from fixed_size_list[2] 
to fixed_size_list using function cast_fixed_size_list
{code}

While the similar operation with a variable sized list actually works:

{code:python}
>>> my_type = pa.list_(pa.field("element", pa.int64()))
>>> arr = pa.array([[1, 2], [3, 4]], pa.list_(pa.int64()))
>>> arr.type
ListType(list)
>>> arr.cast(my_type).type
ListType(list)
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15478) [C++] Creating (or casting to) list array with non-nullable field doesn't check nulls

2022-01-27 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15478:
-

 Summary: [C++] Creating (or casting to) list array with 
non-nullable field doesn't check nulls
 Key: ARROW-15478
 URL: https://issues.apache.org/jira/browse/ARROW-15478
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


When creating a ListArray where you indicate that the values field is not 
nullable, you can actually create the array with nulls without this is being 
validated:

{code:python}
>>> typ = pa.list_(pa.field("element", pa.int64(), nullable=False))
>>> arr = pa.array([[1, 2], [3, 4, None]], typ)
>>> arr

[
  [
1,
2
  ],
  [
3,
4,
null
  ]
]

>>> arr.type
ListType(list)
{code}

Also explicitly validating it doesn't raise:

{code:python}
>>> arr.validate(full=True)
{code}

Is this something we should check?   
What guarantees do we attach to the nullability of a field of a nested type?





--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15477) [C++][Python] Enable ListArray::FromArrays with custom list type (field names/nullability)

2022-01-27 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15477:
-

 Summary: [C++][Python] Enable ListArray::FromArrays with custom 
list type (field names/nullability)
 Key: ARROW-15477
 URL: https://issues.apache.org/jira/browse/ARROW-15477
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Joris Van den Bossche


Currently, when creating a ListArray from the values and offets, you get a 
"default" list type:

{code:python}
>>> arr = pa.ListArray.from_arrays(pa.array([0, 2, 5], pa.int32()), 
>>> pa.array([1, 2, 3, 4, 5]))
>>> arr

[
  [
1,
2
  ],
  [
3,
4,
5
  ]
]

>>> arr.type
ListType(list)
{code}

So a type with default field name ("item") and nullability (true). 
We should allow to specify a type (that needs to be compatible with the passed 
values' type) so you can create a ListArray with specific field names.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15455) [C++] Cast between fixed size list type and variable size list

2022-01-25 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15455:
-

 Summary: [C++] Cast between fixed size list type and variable size 
list 
 Key: ARROW-15455
 URL: https://issues.apache.org/jira/browse/ARROW-15455
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


Casting from fixed size list to variable size list could be possible, I think, 
but currently doesn't work:

{code:python}
>>> fixed_size = pa.array([[1, 2], [3, 4]], type=pa.list_(pa.int64(), 2))
>>> fixed_size.cast(pa.list_(pa.int64()))
...
ArrowNotImplementedError: Unsupported cast from fixed_size_list[2] 
to list using function cast_list
{code}

And in principle, a cast the other way around could also be possible if it is 
checked that each list has the correct length.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15394) [CI][Docs] Doxygen not ran in the docs nightly build

2022-01-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15394:
-

 Summary: [CI][Docs] Doxygen not ran in the docs nightly build
 Key: ARROW-15394
 URL: https://issues.apache.org/jira/browse/ARROW-15394
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, Documentation
Reporter: Joris Van den Bossche


Discovered on the nightly dev docs that the C++ API pages are not working, 
because doxygen is not used in the nightly doc build



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15370) [Python] Regression in empty table to_pandas conversion

2022-01-19 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15370:
-

 Summary: [Python] Regression in empty table to_pandas conversion
 Key: ARROW-15370
 URL: https://issues.apache.org/jira/browse/ARROW-15370
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 7.0.0


Nightly integration tests with kartothek are failing, see eg 
https://github.com/ursacomputing/crossbow/runs/4863725914?check_suite_focus=true

This seems something on our side, and a recent failure (the builds only started 
failing today, and I don't see other differences with the last working build 
yesterday)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15365) [Python] Expose full cast options in the pyarrow.compute.cast function

2022-01-19 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15365:
-

 Summary: [Python] Expose full cast options in the 
pyarrow.compute.cast function
 Key: ARROW-15365
 URL: https://issues.apache.org/jira/browse/ARROW-15365
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Currently, the {{pc.cast}} function has a {{safe=True/False}} option, which 
provides a short-cut to setting the cast options. 

But the actual kernel has more detailed options that can be tuned, and this is 
already exposed in the CastOptions class in python (allow_int_overflow, 
allow_time_truncate, ...). So we should ensure that we can pass such a 
CastOptions object to the {{cast}} kernel directly as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15364) [Python][Doc] Update filesystem entry in read docstrings

2022-01-19 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15364:
-

 Summary: [Python][Doc] Update filesystem entry in read docstrings
 Key: ARROW-15364
 URL: https://issues.apache.org/jira/browse/ARROW-15364
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Python
Reporter: Joris Van den Bossche


In several docstrings (of orc.read_table, 
parquet.read_table/ParquetDataset/write_to_dataset, we have something like:

{code}
filesystem : FileSystem, default None
If nothing passed, paths assumed to be found in the local on-disk
filesystem.
{code}

but this is actually no longer up to date. If filesystem is not specified, it 
will be inferred from the path, which can both be a path to local disk, or be a 
URI.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15326) [CI][Gandiva] Ubuntu release build is failing with failing Gandiva tests

2022-01-13 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15326:
-

 Summary: [CI][Gandiva] Ubuntu release build is failing with 
failing Gandiva tests
 Key: ARROW-15326
 URL: https://issues.apache.org/jira/browse/ARROW-15326
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva, Continuous Integration
Reporter: Joris Van den Bossche
 Fix For: 7.0.0


See eg 
https://github.com/ursacomputing/crossbow/runs/4799525079?check_suite_focus=true

{code}
The following tests FAILED:
 66 - gandiva-internals-test (Failed)
 67 - gandiva-precompiled-test (SEGFAULT)
{code}

cc [~vitor004] [~projjal] [~anthonylouis] (just tagging some people who 
recently contributed to gandiva, I am not familiar with this area myself)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15324) [C++][CI] HDFS test build is failing with segfault (TestLibHdfs::test_mv_rename)

2022-01-13 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15324:
-

 Summary: [C++][CI] HDFS test build is failing with segfault 
(TestLibHdfs::test_mv_rename)
 Key: ARROW-15324
 URL: https://issues.apache.org/jira/browse/ARROW-15324
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration
Reporter: Joris Van den Bossche
 Fix For: 7.0.0


See eg 
https://github.com/ursacomputing/crossbow/runs/4799476838?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15323) [CI] Nightly spark integration builds are failing

2022-01-13 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15323:
-

 Summary: [CI] Nightly spark integration builds are failing
 Key: ARROW-15323
 URL: https://issues.apache.org/jira/browse/ARROW-15323
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Joris Van den Bossche


See eg 

- test-conda-python-3.7-spark-v3.1.2:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2022-01-13-0-github-test-conda-python-3.7-spark-v3.1.2
- test-conda-python-3.8-spark-v3.2.0:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2022-01-13-0-github-test-conda-python-3.8-spark-v3.2.0
- test-conda-python-3.9-spark-master:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2022-01-13-0-github-test-conda-python-3.9-spark-master

The error message:

{code}
 Error:  Failed to execute goal 
pl.project13.maven:git-commit-id-plugin:2.2.2:revision (for-jars) on project 
arrow-java-root: Could not complete Mojo execution... Unable to find commits 
until some tag: Walk failure. Missing commit 
2ec4e999bfa1e54ea6933cb3857ea5edb4235919 -> [Help 1]
Error:  
Error:  To see the full stack trace of the errors, re-run Maven with the -e 
switch.
Error:  Re-run Maven using the -X switch to enable full debug logging.
Error:  
Error:  For more information about the errors and possible solutions, please 
read the following articles:
Error:  [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
{code}


cc [~bryanc]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15321) [Dev][Archery] numpydoc validation doesn't check all class methods

2022-01-13 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15321:
-

 Summary: [Dev][Archery] numpydoc validation doesn't check all 
class methods
 Key: ARROW-15321
 URL: https://issues.apache.org/jira/browse/ARROW-15321
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Reporter: Joris Van den Bossche


>From discussion at 
>https://github.com/apache/arrow/pull/12076#discussion_r783810077

It seems that by default, it doesn't loop over all _methods_ of classes, but 
only module-level objects?

For example, I notice that explicitly asking for {{pyarrow.Table.to_pandas}} 
catches some issues:

{code}
$ archery numpydoc pyarrow.Table.to_pandas --allow-rule PR10
INFO:archery:Running Python docstring linters
PR10: Parameter "categories" requires a space before the colon separating the 
parameter name and type
PR10: Parameter "use_threads" requires a space before the colon separating the 
parameter name and type
{code}

But with the default (check all of pyarrow) with {{archery numpydoc 
--allow-rule PR10}} it doesn't list those errors.

cc [~kszucs] [~amol-]




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15310) [C++][Python][Dataset] Detect (and warn?) when DirectoryPartitioning is parsing an actually hive-style file path?

2022-01-12 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15310:
-

 Summary: [C++][Python][Dataset] Detect (and warn?) when 
DirectoryPartitioning is parsing an actually hive-style file path?
 Key: ARROW-15310
 URL: https://issues.apache.org/jira/browse/ARROW-15310
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Joris Van den Bossche


When you have a hive-style partitioned dataset, with our current 
{{dataset(..)}} API, it's relatively easy to mess up the inferred partitioning 
and get confusing results. 

For example, if you specify the partitioning field names with 
{{partitioning=[...]}} (which is not needed for hive style since those are 
inferred), we actually assume you want directory partitioning. This 
DirectoryPartitioning will then parse the hive-style file paths and take the 
full "key=value" as the data values for the field.  
And then, doing a filter can result in a confusing empty result (because 
"value" doesn't match "key=value").

I am wondering if we can't relatively cheaply detect this case, and eg give an 
informative warning about this to the user. 

Basically what happens is this:

{code:python}
>>> part = ds.DirectoryPartitioning(pa.schema([("part", "string")]))
>>> part.parse("part=a")

{code}

If the parsed value is a string that contains a "=" (and in this case also 
contains the field name), that is I think a clear sign that (in the large 
majority of cases) the user is doing something wrong.

I am not fully sure where and at what stage the check could be done though. 
Doing it for every path in the dataset might be too costly.




Illustrative code example:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

import pathlib

## constructing a small dataset with 1 hive-style partitioning level

basedir = pathlib.Path(".") / "dataset_wrong_partitioning"
basedir.mkdir(exist_ok=True)

(basedir / "part=a").mkdir(exist_ok=True)
(basedir / "part=b").mkdir(exist_ok=True)

table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]})
pq.write_table(table1, basedir / "part=a" / "data.parquet")

table2 = pa.table({'a': [4, 5, 6], 'b': [1, 2, 3]})
pq.write_table(table2, basedir / "part=b" / "data.parquet")
{code}

Reading as is (not specifying a partitioning, so default to no partitioning) 
will at least give an error about a missing field:

{code: python}
>>> dataset = ds.dataset(basedir)
>>> dataset.to_table(filter=ds.field("part") == "a")
...
ArrowInvalid: No match for FieldRef.Name(part) in a: int64
{code}

But specifying the partitioning field name (which currently gets (silently) 
interpreted as directory partitioning) gives a confusing empty result:

{code:python}
>>> dataset = ds.dataset(basedir, partitioning=["part"])
>>> dataset.to_table(filter=ds.field("part") == "a")
pyarrow.Table
a: int64
b: int64
part: string

a: []
b: []
part: []
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15307) [C++][Dataset] Provide more context in error message if cast fails during scanning

2022-01-12 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15307:
-

 Summary: [C++][Dataset] Provide more context in error message if 
cast fails during scanning
 Key: ARROW-15307
 URL: https://issues.apache.org/jira/browse/ARROW-15307
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


If you have a partitioned dataset, and in one of the files there is a column 
with a mismatching type and that cannot be safely casted to the dataset 
schema's type for that column, you get (as expected) get an error about this 
cast. 

Small illustrative example code:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

import pathlib

## constructing a small dataset with two files

basedir = pathlib.Path(".") / "dataset_test_mismatched_schema"
basedir.mkdir(exist_ok=True)

table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]})
pq.write_table(table1, basedir / "data1.parquet")

table2 = pa.table({'a': [1.5, 2.0, 3.0], 'b': [1, 2, 3]})
pq.write_table(table2, basedir / "data2.parquet")

## reading the dataset

dataset = ds.dataset(basedir)
# by default infer dataset schema from first file
dataset.schema
# actually reading gives expected error
dataset.to_table()
{code}

gives

{code:python}
>>> dataset.schema
a: int64
b: int64
>>> dataset.to_table()
---
ArrowInvalid  Traceback (most recent call last)
 in 
 22 dataset.schema
 23 # actually reading gives expected error
---> 24 dataset.to_table()

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset.Dataset.to_table()

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset.Scanner.to_table()

~/scipy/repos/arrow/python/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Float value 1.5 was truncated converting to int64

../src/arrow/compute/kernels/scalar_cast_numeric.cc:177  
CheckFloatToIntTruncation(batch[0], *out)
../src/arrow/compute/exec.cc:700  kernel_->exec(kernel_ctx_, batch, )
../src/arrow/compute/exec.cc:641  ExecuteBatch(batch, listener)
../src/arrow/compute/function.cc:248  executor->Execute(implicitly_cast_args, 
)
../src/arrow/compute/exec/expression.cc:444  compute::Cast(column, 
field->type(), compute::CastOptions::Safe())
../src/arrow/dataset/scanner.cc:816  
compute::MakeExecBatch(*scan_options->dataset_schema, 
partial.record_batch.value)
{code}

So the actual error message (without the extra C++ context) is only 
*"ArrowInvalid: Float value 1.5 was truncated converting to int64"*.

So this error message only says something about the two types and the first 
value that cannot be cast, but if you have a large dataset with many fragments 
and/or many columns, it can be hard to know 1) for which column this is failing 
and 2) for which fragment it is failing.

So it would be nice to add some extra context to the error message.  
The cast itself of course doesn't know it, but I suppose when doing the cast in 
the scanner code, there at least we know eg the physical schema and dataset 
schema, so we could append or prepend the error message with something like 
"Casting from schema1 to schema2 failed with ...". 

cc [~alenkaf]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15137) [Dev] Update archery crossbow latest-prefix to work with nightly dates

2021-12-16 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15137:
-

 Summary: [Dev] Update archery crossbow latest-prefix to work with 
nightly dates
 Key: ARROW-15137
 URL: https://issues.apache.org/jira/browse/ARROW-15137
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15131) [Python] Coerce value_set argument to array in "is_in" kernel

2021-12-16 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15131:
-

 Summary: [Python] Coerce value_set argument to array in "is_in" 
kernel
 Key: ARROW-15131
 URL: https://issues.apache.org/jira/browse/ARROW-15131
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Small example I ran into:

{code:python}
>>> arr = pa.array(['a', 'b', 'c', 'd'])
>>> pc.is_in(arr, ['a', 'c'])
...
TypeError: "['a', 'c']" is not a valid value set
{code}

That's not a super friendly error message (it was not directly clear what is 
not "valid" about this). Passing {{pa.array(['a', 'c']) explicitly works, but I 
expected that the kernel would try this automatically (as we also convert the 
first array argument to an array).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15117) [Docs] Splitting the sphinx-based Arrow docs into separate sphinx projects

2021-12-15 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15117:
-

 Summary: [Docs] Splitting the sphinx-based Arrow docs into 
separate sphinx projects
 Key: ARROW-15117
 URL: https://issues.apache.org/jira/browse/ARROW-15117
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Joris Van den Bossche
 Fix For: 7.0.0


See the mailing list 
(https://mail-archives.apache.org/mod_mbox/arrow-dev/202112.mbox/%3CCALQtMBbiasQtXYc46kpw-TyQ-TQSPjNQ5%2BkoREuKvJ3hJSdWjw%40mail.gmail.com%3E)
 and this google doc 
(https://docs.google.com/document/d/1AXDNwU5CSnZ1cSeUISwy_xgvTzoYWeuqWApC8UEv97Q/edit?usp=sharing)
 for more context.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15077) [Python] Move Expression class from _dataset to _compute cython module

2021-12-13 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15077:
-

 Summary: [Python] Move Expression class from _dataset to _compute 
cython module
 Key: ARROW-15077
 URL: https://issues.apache.org/jira/browse/ARROW-15077
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 7.0.0
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche


To follow the move in the C++ code base, and to make it easier to implement 
ARROW-12060



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15043) [Python][Docs] Update type conversion table for pandas <-> arrow

2021-12-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15043:
-

 Summary: [Python][Docs] Update type conversion table for pandas 
<-> arrow
 Key: ARROW-15043
 URL: https://issues.apache.org/jira/browse/ARROW-15043
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 7.0.0


>From the mailing list: the table at 
>https://arrow.apache.org/docs/python/pandas.html#pandas-arrow-conversion is 
>not fully up to date. For example, it doesn't include {{datetime.time}} 
>conversion to {{time64}} type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15042) [Python] Consolidate shared methods of RecordBatch and Table

2021-12-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15042:
-

 Summary: [Python] Consolidate shared methods of RecordBatch and 
Table
 Key: ARROW-15042
 URL: https://issues.apache.org/jira/browse/ARROW-15042
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


RecordBatch and Table have a bunch of similar methods that don't directly 
interact with the C++ pointer, and thus that could be shared in a common base 
class.

In addition, we also have some methods on Table that would also be useful for 
RecordBatch (eg {{cast}}, {{group_by}}, {{drop}}, {{select}}, {{sort_by}}, 
{{rename_columns}}), which could also be shared with a common mixin.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14990) [CI] Nightly integration for dask is failing because of missing pandas dependency

2021-12-06 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-14990:
-

 Summary: [CI] Nightly integration for dask is failing because of 
missing pandas dependency
 Key: ARROW-14990
 URL: https://issues.apache.org/jira/browse/ARROW-14990
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Joris Van den Bossche


See https://github.com/apache/arrow/pull/11816#discussion_r762961951



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14967) [CI][Python] Ability to include pip packages in the conda environments

2021-12-02 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-14967:
-

 Summary: [CI][Python] Ability to include pip packages in the conda 
environments
 Key: ARROW-14967
 URL: https://issues.apache.org/jira/browse/ARROW-14967
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Python
Reporter: Joris Van den Bossche


For creating various conda environments, we currently have files like 
{{conda_env_cpp.txt}}, {{conda_env_sphinx.txt}}, {{conda_env_python.txt}}, etc 

Those can then be combined to create a specific conda environment with the 
subset of features you want, eg from the python docs:

{code}
conda create -y -n pyarrow-dev -c conda-forge \
--file arrow/ci/conda_env_unix.txt \
--file arrow/ci/conda_env_cpp.txt \
--file arrow/ci/conda_env_python.txt \
--file arrow/ci/conda_env_gandiva.txt \
compilers \
python=3.9 \
pandas
{code}

or installed as additional packages into an existing one (eg {{conda install 
--file arrow/ci/conda_env_python.txt}} in conda-python.dockerfile).

One disadvantage of this approach is that you cannot (as far as I am aware) not 
list pip packages in those .txt files.   
You can do that with environment.yml files, but those then don't really compose 
together as we do with the txt files, I think.

cc [~kszucs]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


  1   2   3   4   5   6   7   8   9   10   >