[jira] [Created] (ARROW-18428) [Website] Enable github issues on arrow-site repo
Joris Van den Bossche created ARROW-18428: - Summary: [Website] Enable github issues on arrow-site repo Key: ARROW-18428 URL: https://issues.apache.org/jira/browse/ARROW-18428 Project: Apache Arrow Issue Type: Task Components: Website Reporter: Joris Van den Bossche Now we are moving to GitHub issues, it probably makes sense to open issues about the website in its own arrow-site repo, instead of keeping them in the main arrow repo. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18363) [Docs] Include warning when viewing old contributing docs (redirecting to dev docs)
Joris Van den Bossche created ARROW-18363: - Summary: [Docs] Include warning when viewing old contributing docs (redirecting to dev docs) Key: ARROW-18363 URL: https://issues.apache.org/jira/browse/ARROW-18363 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: Joris Van den Bossche Now we have versioned docs, we also have the old versions of the developers docs (eg https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those might be outdated (eg regarding communication channels, build instructions, etc), and typically when contributing / developing with the latest arrow, one should _always_ check the latest dev version of the contributing docs. We could add a warning box pointing this out and linking to the dev docs. For example similarly how some projects warn about viewing old docs in general and point to the stable docs (eg https://mne.tools/1.1/index.html or https://scikit-learn.org/1.0/user_guide.html). In this case we could have a custom box when at a page in /developers to point to the dev docs instead of stable docs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18340) [Python] PyArrow C++ header files no longer always included in installed pyarrow
Joris Van den Bossche created ARROW-18340: - Summary: [Python] PyArrow C++ header files no longer always included in installed pyarrow Key: ARROW-18340 URL: https://issues.apache.org/jira/browse/ARROW-18340 Project: Apache Arrow Issue Type: Improvement Reporter: Joris Van den Bossche Assignee: Alenka Frim Fix For: 10.0.1 We have a python build env var to control whether the Arrow C++ header files are included in the python package or not ({{PYARROW_BUNDLE_ARROW_CPP_HEADERS}}). This is set to True by default, and only in the conda recipe set to False. After the cmake refactor, the Python C++ header files no longer live in the Arrow C++ package, and so should _always_ be included in the python package, regardless of how arrow-cpp is installed. Initially this was done, but it seems that https://github.com/apache/arrow/pull/13892 removed this unconditional copy of the PyArrow header files to {{pyarrow/include}}. Now it is only copied if {{PYARROW_BUNDLE_ARROW_CPP_HEADERS}} is enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18329) [Python][CI] Support ORC in Windows wheels
Joris Van den Bossche created ARROW-18329: - Summary: [Python][CI] Support ORC in Windows wheels Key: ARROW-18329 URL: https://issues.apache.org/jira/browse/ARROW-18329 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Now we support building with ORC enabled on Windows (ARROW-17817), we could also add this to the Python wheel packages for Windows (vcpkg seems to have an orc port for Windows as well) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18293) [C++] Proxy memory pool crashes with Dataset scanning
Joris Van den Bossche created ARROW-18293: - Summary: [C++] Proxy memory pool crashes with Dataset scanning Key: ARROW-18293 URL: https://issues.apache.org/jira/browse/ARROW-18293 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Discovered while trying to use the proxy memory pool for testing ARROW-18164 See https://github.com/apache/arrow/pull/14516#discussion_r1005433867 This test segfaults (using the fixture in {{test_dataset.py}}: {code:python} @pytest.mark.parquet def test_scanner_proxy_memory_pool(dataset): proxy_pool = pa.proxy_memory_pool(pa.default_memory_pool()) _ = dataset.to_table(memory_pool=proxy_pool) {code} Response of [~westonpace]: {quote}My guess is that the problem is that the scanner erroneously returns before all work is completely finished. Changing the thread pool or the memory pool too quickly after a scan can lead to this kind of error. The new scanner was created specifically to avoid this problem but it isn't the default yet (still working through some follow-up PRs to make sure we have the same functionality).{quote} So once that becomes the default new scanner, we can see if this is fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18164) [C++][Python] Dataset scanner does not follow default memory pool setting
Joris Van den Bossche created ARROW-18164: - Summary: [C++][Python] Dataset scanner does not follow default memory pool setting Key: ARROW-18164 URL: https://issues.apache.org/jira/browse/ARROW-18164 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Joris Van den Bossche Even if I set the system memory pool as default, it still uses the jemalloc one (running this on Ubuntu where jemalloc is the default if not set by the user): {code} import pyarrow as pa import pyarrow.dataset as ds import pyarrow.parquet as pq pq.write_table(pa.table({'a': [1, 2, 3]}), "test.parquet") In [2]: pa.set_memory_pool(pa.system_memory_pool()) In [3]: pa.total_allocated_bytes() Out[3]: 0 In [4]: table = ds.dataset("test.parquet").to_table() In [5]: pa.total_allocated_bytes() Out[5]: 0 In [6]: pa.set_memory_pool(pa.jemalloc_memory_pool()) In [7]: pa.total_allocated_bytes() Out[7]: 128 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18127) [CI][Python] Have a way to reproduce hypothesis failures from CI
Joris Van den Bossche created ARROW-18127: - Summary: [CI][Python] Have a way to reproduce hypothesis failures from CI Key: ARROW-18127 URL: https://issues.apache.org/jira/browse/ARROW-18127 Project: Apache Arrow Issue Type: Test Components: Continuous Integration, Python Reporter: Joris Van den Bossche We have a nightly test build with hypothesis enabled, and those tests fail / crash from time to time, eg https://github.com/ursacomputing/crossbow/actions/runs/3286024804/jobs/5413689973 Ideally, if there is such a failure, we should actually fix that test case. But that requires us to be able to reproduce the failure locally. If it's an actual test failure, hypothesis should print some information to re-run it locally with the same input (https://hypothesis.readthedocs.io/en/latest/reproducing.html#reproducing-an-example-with-reproduce-failure). But if it is segfaulting, this information is not printed by default. Another idea might to save the ./hypothesis/examples directory as artifact on the CI build, to use it locally, but probably that might have the same issue of not having the information we need in case of a crash. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18126) [Python] ARROW_BUILD_DIR might be ignored for building pyarrow?
Joris Van den Bossche created ARROW-18126: - Summary: [Python] ARROW_BUILD_DIR might be ignored for building pyarrow? Key: ARROW-18126 URL: https://issues.apache.org/jira/browse/ARROW-18126 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche When building pyarrow, I see the following warning: {code} CMake Warning: Manually-specified variables were not used by the project: ARROW_BUILD_DIR {code} While we have a note in our docs (https://arrow.apache.org/docs/dev/developers/python.html#build-and-test) that says: bq. If you used a different directory name for building Arrow C++ (by default it is named “build”), then you should also set the environment variable {{ARROW_BUILD_DIR='name_of_build_dir'}}. This way PyArrow can find the Arrow C++ built files. I see in the setup.py code that we check for this env variable and pass it to CMake, but it's not actually used in any of the CMakeLists.txt files for pyarrow. This might have been accidentally changed in one of the recent cmake refactors? (cc [~kou] [~alenka]) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18125) [Python] Handle pytest 8 deprecations about pytest.warns(None)
Joris Van den Bossche created ARROW-18125: - Summary: [Python] Handle pytest 8 deprecations about pytest.warns(None) Key: ARROW-18125 URL: https://issues.apache.org/jira/browse/ARROW-18125 Project: Apache Arrow Issue Type: Test Reporter: Joris Van den Bossche Fix For: 11.0.0 We have a few warnings about that when running the tests, for example: {code} pyarrow/tests/test_pandas.py::TestConvertMetadata::test_rangeindex_doesnt_warn pyarrow/tests/test_pandas.py::TestConvertMetadata::test_multiindex_doesnt_warn /home/joris/miniconda3/envs/arrow-dev/lib/python3.10/site-packages/_pytest/python.py:192: PytestRemovedIn8Warning: Passing None has been deprecated. See https://docs.pytest.org/en/latest/how-to/capture-warnings.html#additional-use-cases-of-warnings-in-tests for alternatives in common use cases. result = testfunction(**testargs) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18124) [Python] Support converting to non-nano datetime64 for pandas >= 2.0
Joris Van den Bossche created ARROW-18124: - Summary: [Python] Support converting to non-nano datetime64 for pandas >= 2.0 Key: ARROW-18124 URL: https://issues.apache.org/jira/browse/ARROW-18124 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 11.0.0 Pandas is adding capabilities to store non-nanosecond datetime64 data. At the moment, we however always do convert to nanosecond, regardless of the timestamp resolution of the arrow table (and regardless of the pandas metadata). Using the development version of pandas: {code} In [1]: df = pd.DataFrame({"col": np.arange("2012-01-01", 10, dtype="datetime64[s]")}) In [2]: df.dtypes Out[2]: coldatetime64[s] dtype: object In [3]: table = pa.table(df) In [4]: table.schema Out[4]: col: timestamp[s] -- schema metadata -- pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 423 In [6]: table.to_pandas().dtypes Out[6]: coldatetime64[ns] dtype: object {code} This is because we have a {{coerce_temporal_nanoseconds}} conversion option which we hardcode to True (for top-level columns, we hardcode it to False for nested data). When users have pandas >= 2, we should support converting with preserving the resolution. We should certainly do so if the pandas metadata indicates which resolution was originally used (to ensure correct roundtrip). We _could_ (and at some point also _should_) also do that by default if there is no pandas metadata (but maybe only later depending on how stable this new feature is in pandas, as it is potentially a breaking change for our users if you use eg pyarrow to read a parquet file). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18107) [C++] Provide more informative error when (CSV/JSON) parsing fails
Joris Van den Bossche created ARROW-18107: - Summary: [C++] Provide more informative error when (CSV/JSON) parsing fails Key: ARROW-18107 URL: https://issues.apache.org/jira/browse/ARROW-18107 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Related to ARROW-18106 (and derived from https://stackoverflow.com/questions/74138746/why-i-cant-parse-timestamp-in-pyarrow). Assume you have the following code to read a JSON file with timestamps. The timestamps have a sub-second part in their string, which fails parsing if you specify it as second resolution timestamp: {code:python} import io from pyarrow import json s_json = """{"column":"2022-09-05T08:08:46.000"}""" opts = json.ParseOptions(explicit_schema=pa.schema([("column", pa.timestamp("s"))]), unexpected_field_behavior="ignore") json.read_json(io.BytesIO(s_json.encode()), parse_options=opts) {code} gives: {code} ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't parse:2022-09-05T08:08:46.000 {code} This error is expected, but I think it could be more informative about the reason why it failed parsing (because at first sight it looks like a proper timestamp string, so you might be left wondering why this is failing). (this might not be that straightforward, though, since there can be many reasons why the parsing is failing) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18106) [C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer"
Joris Van den Bossche created ARROW-18106: - Summary: [C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer" Key: ARROW-18106 URL: https://issues.apache.org/jira/browse/ARROW-18106 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Not 100% sure this is a "bug", but at least I find it an unexpected interplay between two options. By default, when reading json, we _infer_ the data type of columns, and when specifying an explicit schema, we _also_ by default infer the type of columns that are not specified in the explicit schema. The docs for {{unexpected_field_behavior}}: > How JSON fields outside of explicit_schema (if given) are treated But it seems that if you specify a schema, and the parsing of one of the columns fails according to that schema, we still fall back to this default of inferring the data type (while I would have expected an error, since we should only infer for columns _not_ in the schema. Example code using pyarrow: {code:python} import io import pyarrow as pa from pyarrow import json s_json = """{"column":"2022-09-05T08:08:46.000"}""" opts = json.ParseOptions(explicit_schema=pa.schema([("column", pa.timestamp("s"))])) json.read_json(io.BytesIO(s_json.encode()), parse_options=opts) {code} The parsing fails here because there are milliseconds and the type is "s", but the explicit schema is ignored, and we get a result with a string column as result: {code} pyarrow.Table column: string column: [["2022-09-05T08:08:46.000"]] {code} But when adding {{unexpected_field_behaviour="ignore"}}, we actually get the expected parse error: {code:python} opts = json.ParseOptions(explicit_schema=pa.schema([("column", pa.timestamp("s"))]), unexpected_field_behavior="ignore") json.read_json(io.BytesIO(s_json.encode()), parse_options=opts) {code} gives {code} ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't parse:2022-09-05T08:08:46.000 {code} It might be this is specific to timestamps, I don't directly see a similar issue with eg {{"column": "A"}} and setting the schema to "column" being int64. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18098) [C++] Vector kernel for "intersecting" two arrays (all common elements)
Joris Van den Bossche created ARROW-18098: - Summary: [C++] Vector kernel for "intersecting" two arrays (all common elements) Key: ARROW-18098 URL: https://issues.apache.org/jira/browse/ARROW-18098 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Joris Van den Bossche This would be similar to numpy's {{intersect1d}} (https://numpy.org/doc/stable/reference/generated/numpy.intersect1d.html) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18097) [C++] Add a "list_contains" kernel
Joris Van den Bossche created ARROW-18097: - Summary: [C++] Add a "list_contains" kernel Key: ARROW-18097 URL: https://issues.apache.org/jira/browse/ARROW-18097 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Joris Van den Bossche Assume you have a list array: {code} arr = pa.array([["a", "b"], ["a", "c"], ["b", "c", "d"]]) {code} And you want to know for each list if it contains a certain value (of the same type as the list's values). A "list_contains" function (or other name) would be useful for that: {code} pc.list_contains(arr, "a") # -> True, True False {code} The current workaround that I found was flattening, checking equality, and then reducing again with groupby, but this is quite tedious: {code} >>> temp = pa.table({'index': pc.list_parent_indices(arr), 'contains_value': >>> pc.equal(pc.list_flatten(arr), "a")}) >>> temp.group_by('index').aggregate([('contains_value', >>> 'any')])['contains_value_any'].chunk(0) [ true, true, false ] {code} But this also only works if there are no empty or missing list values. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18096) [Dev] Remove github user names from merge commit message
Joris Van den Bossche created ARROW-18096: - Summary: [Dev] Remove github user names from merge commit message Key: ARROW-18096 URL: https://issues.apache.org/jira/browse/ARROW-18096 Project: Apache Arrow Issue Type: Task Components: Developer Tools Reporter: Joris Van den Bossche We currently use the top post comment body of a github PR as the body of the commit message. It is not uncommon to tag someone when opening a PR, but retaining those github usernames in the commit message is annoying as that can generate additional notifications for the people that were tagged. It should be straightforward to remove the github user names from the message body (for example, just remove the @, so it doesn't work anymore as user name link) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18088) [Python][CI] Build with pandas master/nightly failure related to timedelta64 resolution
Joris Van den Bossche created ARROW-18088: - Summary: [Python][CI] Build with pandas master/nightly failure related to timedelta64 resolution Key: ARROW-18088 URL: https://issues.apache.org/jira/browse/ARROW-18088 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche The nightly python builds using the pandas development version are failing: https://github.com/ursacomputing/crossbow/actions/runs/3269767207/jobs/5377649455 Example failure: {code} test_parquet_2_0_roundtrip[None-True] _ tempdir = PosixPath('/tmp/pytest-of-root/pytest-0/test_parquet_2_0_roundtrip_Non0') chunk_size = None, use_legacy_dataset = True @pytest.mark.pandas @parametrize_legacy_dataset @pytest.mark.parametrize('chunk_size', [None, 1000]) def test_parquet_2_0_roundtrip(tempdir, chunk_size, use_legacy_dataset): df = alltypes_sample(size=1, categorical=True) filename = tempdir / 'pandas_roundtrip.parquet' arrow_table = pa.Table.from_pandas(df) assert arrow_table.schema.pandas_metadata is not None _write_table(arrow_table, filename, version='2.6', coerce_timestamps='ms', chunk_size=chunk_size) table_read = pq.read_pandas( filename, use_legacy_dataset=use_legacy_dataset) assert table_read.schema.pandas_metadata is not None read_metadata = table_read.schema.metadata assert arrow_table.schema.metadata == read_metadata df_read = table_read.to_pandas() > tm.assert_frame_equal(df, df_read) E AssertionError: Attributes of DataFrame.iloc[:, 12] (column name="timedelta") are different E E Attribute "dtype" are different E [left]: timedelta64[s] E [right]: timedelta64[ns] opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/tests/parquet/test_data_types.py:76: AssertionError {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18087) [C++] RecordBatch::Equals ignores field names
Joris Van den Bossche created ARROW-18087: - Summary: [C++] RecordBatch::Equals ignores field names Key: ARROW-18087 URL: https://issues.apache.org/jira/browse/ARROW-18087 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche The {{RecordBatch::Equals}} method only checks the equality of the schema of both batches if {{check_metadata=True}}, with a result that it doesn't actually check the schema (eg field names) by default. Python illustration: {code} In [3]: batch1 = pa.record_batch(pd.DataFrame({'a': [1, 2, 3]})) In [4]: batch2 = pa.record_batch(pd.DataFrame({'b': [1, 2, 3]})) In [5]: batch1.equals(batch2) Out[5]: True In [6]: batch1.equals(batch2, check_metadata=True) Out[6]: False {code} My expectation is that RecordBatch equality always requires equal field names (as Table::Equals does). And the {{check_metadata}} keyword should only control whether the metadata of the schema is considered (as the documentation also says), not whether the schema is checked at all. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17989) [C++] Enable struct_field kernel to accept string field names
Joris Van den Bossche created ARROW-17989: - Summary: [C++] Enable struct_field kernel to accept string field names Key: ARROW-17989 URL: https://issues.apache.org/jira/browse/ARROW-17989 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Currently the "struct_field" kernel only works for integer indices for the child fields. From the StructFieldOption class (https://github.com/apache/arrow/blob/3d7f2f22a0fc441a41b8fa971e11c0f4290ebb24/cpp/src/arrow/compute/api_scalar.h#L283-L285): {code} /// The child indices to extract. For instance, to get the 2nd child /// of the 1st child of a struct or union, this would be {0, 1}. std::vector indices; {code} It would be nice if you could also refer to fields by name in addition to by position. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17960) [C++] Add kernel for slicing list values
Joris Van den Bossche created ARROW-17960: - Summary: [C++] Add kernel for slicing list values Key: ARROW-17960 URL: https://issues.apache.org/jira/browse/ARROW-17960 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche This would be a scalar kernel "List -> List" (or to fixed size list?), where you can subset the values in each list element. So for example, giving the list array: {code} arr = pa.array([[1, 2, 3], [4, 5, 6, 7], [8, 9]]) {code} we could do something like the following to get the first two elements of each list: {code} pc.list_slice(arr, start=0, stop=2) -> pa.array([[1, 2], [4, 5], [8, 9]]) {code} This would supplement the existing {{list_element}} kernel. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17959) [C++][Dataset]
Joris Van den Bossche created ARROW-17959: - Summary: [C++][Dataset] Key: ARROW-17959 URL: https://issues.apache.org/jira/browse/ARROW-17959 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Currently, when reading a subfield of a nested column of a Parquet file using the Dataset API, we read the full parent column instead of only the requested field. This should be optimized to only read the field itself. This was left as a TODO in ARROW-14658 (https://github.com/apache/arrow/pull/11704) which added the initial support for nested field refs in dataset scanning (https://github.com/apache/arrow/blob/c29ca51f44eaf41c3a2f6f72e3e23a7b428211c2/cpp/src/arrow/dataset/file_parquet.cc#L240-L246): {code} if (field) { // TODO(ARROW-1888): support fine-grained column projection. We should be // able to materialize only the child fields requested, and not the entire // top-level field. // Right now, if enabled, projection/filtering will fail when they cast the // physical schema to the dataset schema. AddColumnIndices(*toplevel, columns_selection); {code} Some relevant comments at https://github.com/apache/arrow/pull/11704#discussion_r749733765. ARROW-1888 was mentioned as a blocker back then, but this is resolved in the meantime. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17925) [Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas?
Joris Van den Bossche created ARROW-17925: - Summary: [Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas? Key: ARROW-17925 URL: https://issues.apache.org/jira/browse/ARROW-17925 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche This was raised in ARROW-17813 by [~changhiskhan]: {quote}*ExtensionArray => pandas* Just for discussion, I was curious whether you had any thoughts around using the extension scalar as a fallback mechanism. It's a lot simpler to define an ExtensionScalar with `as_py` than a pandas extension dtype. So if an ExtensionArray doesn't have an equivalent pandas dtype, would it make sense to convert it to just an object series whose elements are the result of `as_py`? {quote} and I also mentioned this in ARROW-17535: {quote}That actually brings up a question: if an ExtensionType defines an ExtensionScalar (but not an associciated pandas dtype, or custom to_numpy conversion), should we use this scalar's {{as_py()}} for the to_numpy/to_pandas conversion as well for plain extension arrays? (not the nested case) Because currently, if you have an ExtensionArray like that (for example using the example from the docs: https://arrow.apache.org/docs/dev/python/extending_types.html#custom-scalar-conversion), we still use the storage type conversion for to_numpy/to_pandas, and only use the scalar's conversion in {{to_pylist}}.{quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17924) [Docs] Clarify immutability assumption in the C Data Interface documentation
Joris Van den Bossche created ARROW-17924: - Summary: [Docs] Clarify immutability assumption in the C Data Interface documentation Key: ARROW-17924 URL: https://issues.apache.org/jira/browse/ARROW-17924 Project: Apache Arrow Issue Type: Task Components: Documentation, Format Reporter: Joris Van den Bossche The current documentation (https://arrow.apache.org/docs/dev/format/CDataInterface.html) is not explicit about whether there are any guarantees about (im)mutability. My assumption is that the _consumer_ of C Data Interface structs should _assume_ the data to be immutable by default (unless they would know that the producer is fine with mutating the data). But it would be good to document this. (as a reference, the DLPack Python docs mention this: https://dmlc.github.io/dlpack/latest/python_spec.html#semantics) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17890) [C++][Python] Allow an ExtensionType to register or implement custom casts
Joris Van den Bossche created ARROW-17890: - Summary: [C++][Python] Allow an ExtensionType to register or implement custom casts Key: ARROW-17890 URL: https://issues.apache.org/jira/browse/ARROW-17890 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Joris Van den Bossche With ARROW-14500 and ARROW-15545 (https://github.com/apache/arrow/pull/14106), we are allowing to cast "storage_type" -> "extension" (and the cast the other way around already worked as well). Initially, that PR allowed any cast from "any" -> "extension", as long as the input type could be cast to the storage type (so deferring to the "any" -> "storage_type" cast). However, because whether a certain cast makes sense or not depends on the semantics of the extension type, it was restricted to exactly matching storage_type. One idea could be to still allow the other casts behind a cast option flag, like {{allow_non_storage_extension_casts}} (or a better name), so the user can explicitly allow to cast to/from any type (as long as the cast from/to the storage type works). That could help for the user, but for certain casts, the ExtensionType might also want to control _how_ such a cast is done. For example, for casting to/from string type (which would be useful for reading/writing CSV files, or for repr), you typically will want to do something different than casting your storage array to string. A more general solution could thus be to have a mechanism for the ExtensionType to implement a certain cast kernel itself, and register this to the C++ cast dispatching. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17834) [Python] Allow creating ExtensionArray through pa.array(..) constructor
Joris Van den Bossche created ARROW-17834: - Summary: [Python] Allow creating ExtensionArray through pa.array(..) constructor Key: ARROW-17834 URL: https://issues.apache.org/jira/browse/ARROW-17834 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Currently, creating an ExtensionArray from a python sequence (or numpy array, ..) requires the following: {code:python} from pyarrow.tests.test_extension_type import IntegerType storage_array = pa.array([1, 2, 3]) ext_arr = pa.ExtensionArray.from_storage(IntegerType(), storage_array) {code} While doing this directly in {{pa.array(..)}} doesn't work: {code:python} >>> pa.array([1, 2, 3], type=IntegerType()) ArrowNotImplementedError: extension {code} I think it should be possible to basically to the ExtensionArray.from_storage under the hood in {{pa.array(..)}} when the specified type is an extension type? I think this should also enable converting from a pandas DataFrame (with a column with matching storage values) to a Table with a specified schema that includes an extension type. Like: {code} df = pd.DataFrame({'a': [1, 2, 3]}) pa.table(df, schema=pa.schema([('a', IntegerType())])) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17832) [Python] Construct MapArray from sequence of dicts (instead of list of tuples)
Joris Van den Bossche created ARROW-17832: - Summary: [Python] Construct MapArray from sequence of dicts (instead of list of tuples) Key: ARROW-17832 URL: https://issues.apache.org/jira/browse/ARROW-17832 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche >From https://github.com/apache/arrow/issues/14116 Creating a MapArray from a python sequence currently requires lists of tuples as values: ``` arr = pa.array([[('a', 1), ('b', 2)], [('c', 3)]], pa.map_(pa.string(), pa.int64())) ``` While I think it makes sense that the following could also work (using dicts instead): ``` arr = pa.array([{'a': 1, 'b': 2}, {'c': 3}], pa.map_(pa.string(), pa.int64())) ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17831) [Python][Docs] PyArrow Architecture page outdated after moving pyarrow C++ code
Joris Van den Bossche created ARROW-17831: - Summary: [Python][Docs] PyArrow Architecture page outdated after moving pyarrow C++ code Key: ARROW-17831 URL: https://issues.apache.org/jira/browse/ARROW-17831 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche This section is no longer up to date: https://arrow.apache.org/docs/dev/python/getting_involved.html#pyarrow-architecture (it still mentions cpp/src/arrow/python) cc [~alenka] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17829) [Python] Avoid pandas groupby deprecation warning write_to_dataset
Joris Van den Bossche created ARROW-17829: - Summary: [Python] Avoid pandas groupby deprecation warning write_to_dataset Key: ARROW-17829 URL: https://issues.apache.org/jira/browse/ARROW-17829 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche I noticed the following warnings in our test builds: {code} opt/conda/envs/arrow/lib/python3.10/site-packages/pyarrow/tests/test_dataset.py::test_make_fragment /opt/conda/envs/arrow/lib/python3.10/site-packages/pyarrow/tests/test_dataset.py:197: FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning. for part, chunk in df_d.groupby(["color"]): opt/conda/envs/arrow/lib/python3.10/site-packages/pyarrow/tests/test_dataset.py::test_legacy_write_to_dataset_drops_null opt/conda/envs/arrow/lib/python3.10/site-packages/pyarrow/tests/parquet/test_pandas.py::test_write_to_dataset_pandas_preserve_extensiondtypes[True] opt/conda/envs/arrow/lib/python3.10/site-packages/pyarrow/tests/parquet/test_pandas.py::test_write_to_dataset_pandas_preserve_index[True] /opt/conda/envs/arrow/lib/python3.10/site-packages/pyarrow/parquet/core.py:3326: FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning. for keys, subgroup in data_df.groupby(partition_keys): {code} I suppose those are coming from pandas 1.5.0. We should investigate whether this is something to fix in our code (or just in the tests) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17827) [Python] Allow calling UDF kernels with field/scalar expressions
Joris Van den Bossche created ARROW-17827: - Summary: [Python] Allow calling UDF kernels with field/scalar expressions Key: ARROW-17827 URL: https://issues.apache.org/jira/browse/ARROW-17827 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche >From https://github.com/apache/arrow/pull/13687#issuecomment-1240399112, where >it came up while adding documentation on how to use UDFs in Python. When just >wanting to invoke a UDF with arrays, you can do {{pc.call_function("my_udf", >[pc.field("a")])}}. But if you want to use your UDF in a context that needs an expression (eg a dataset projection), you need to be able to call the UDF with expressions as argument. And currently, the {{pc.call_function}} doesn't work that way (it expects actual, materialized arrays/scalars as arguments). As a workaround, you can use the private {{Expression._call}}: {code:python} # doesn't work with expressions >>> pc.call_function("my_udf", [pc.field("col")]) ... TypeError: Got unexpected argument type for compute function # workaround >>> pc.Expression._call("my_udf", [pc.field("col")]) {code} So we should try to improve the usability here. Some options: * See if we can change {{pc.call_function}} to also accept Expressions as arguments * Make the {{_call}} public, so one can do {{pc.Expression.call("my_udf", [..])}} cc [~westonpace] [~vibhatha] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17826) [Python] Allow scalars when creating expression from compute kernels
Joris Van den Bossche created ARROW-17826: - Summary: [Python] Allow scalars when creating expression from compute kernels Key: ARROW-17826 URL: https://issues.apache.org/jira/browse/ARROW-17826 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche We can create an expression (eg for a projection) using the compute kernels and passing expressions as arguments. But currently, all other arguments need to be expressions: {code:python} >>> pc.add(pc.field("a"), pc.field("b"))# this works >>> pc.add(pc.field("a"), 1) # this fails when passing scalar (same for >>> pa.scalar(1)) ... TypeError: only other expressions allowed as arguments {code} You can still pass a scalar expression ({{pc.scalar(1)}}, note {{pc.}} not {{pa.}}), but I think for scalars it would be a nice usability improvement if it's not needed to manually convert your python or pyarrow scalar to a scalar expression. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17651) [Python] ResourceWarnings raised by s3 related tests
Joris Van den Bossche created ARROW-17651: - Summary: [Python] ResourceWarnings raised by s3 related tests Key: ARROW-17651 URL: https://issues.apache.org/jira/browse/ARROW-17651 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche Running the python tests give a lot of the following warnings: {code} opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/tests/test_fs.py::test_s3fs_limited_permissions_create_bucket /opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/tests/util.py:439: ResourceWarning: unclosed file <_io.TextIOWrapper name=29 encoding='utf-8'> _run_mc_command(mcdir, 'admin', 'policy', 'add', Enable tracemalloc to get traceback where the object was allocated. See https://docs.pytest.org/en/stable/how-to/capture-warnings.html#resource-warnings for more info. {code} Ideally we should ensure the tests don't give such warnings (it also makes other warning we should notice less visible) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17650) [Dev][CI] Add overview of all tasks (including passing) on crossbow dashboard
Joris Van den Bossche created ARROW-17650: - Summary: [Dev][CI] Add overview of all tasks (including passing) on crossbow dashboard Key: ARROW-17650 URL: https://issues.apache.org/jira/browse/ARROW-17650 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Joris Van den Bossche https://crossbow.voltrondata.com/ currently shows the failing tasks, but it would still be useful to have an overview of all tasks, including the passing builds (+ their logs), as well. cc [~raulcd] [~assignUser] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17649) [Python] Remove remaining deprecated APIs from <= 1.0.0
Joris Van den Bossche created ARROW-17649: - Summary: [Python] Remove remaining deprecated APIs from <= 1.0.0 Key: ARROW-17649 URL: https://issues.apache.org/jira/browse/ARROW-17649 Project: Apache Arrow Issue Type: Sub-task Components: Python Reporter: Joris Van den Bossche Not all deprecations from <=1.0.0 were already done in ARROW-17010, the remaining ones: - Ignoring mismatch between {{ordered}} flag of values and type in {{array(..)}} - RecordBatchReader {{get_next_batch}} method - {{DictionaryScalar.index/dictionary_value}} attributes (deprecated since 1.0.0) - {{num_children}} field of DataType - {{add_metadata}} method of Field -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17139) [Python] Add field() method to get field from StructType
Joris Van den Bossche created ARROW-17139: - Summary: [Python] Add field() method to get field from StructType Key: ARROW-17139 URL: https://issues.apache.org/jira/browse/ARROW-17139 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche >From ARROW-17047: We could also add a {{field()}} method to {{StructType}} that returns you a field? (that is more discoverable than [], and would be consistent with a Schema and with StructArray (to get the child array for that field)) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17010) [Python] Remove deprecated APIs from <= 1.0.0
Joris Van den Bossche created ARROW-17010: - Summary: [Python] Remove deprecated APIs from <= 1.0.0 Key: ARROW-17010 URL: https://issues.apache.org/jira/browse/ARROW-17010 Project: Apache Arrow Issue Type: Sub-task Components: Python Reporter: Joris Van den Bossche Fix For: 9.0.0 Some of the APIs listed in ARROW-13555 were deprecated in 1.0.0 or before, and are relatively easy to remove: -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-16728) [Python] Switch default and deprecate use_legacy_dataset=True in ParquetDataset
Joris Van den Bossche created ARROW-16728: - Summary: [Python] Switch default and deprecate use_legacy_dataset=True in ParquetDataset Key: ARROW-16728 URL: https://issues.apache.org/jira/browse/ARROW-16728 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 9.0.0 The ParquetDataset() constructor itself still defaults to {{use_legacy_dataset=True}} (although using specific attributes or keywords related to that will raise a warning). So a next step will be to actually deprecate passing that and switching the default, and then only afterwards we can remove the code. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16719) [Python] Add path/URI /+ filesystem handling to parquet.read_metadata
Joris Van den Bossche created ARROW-16719: - Summary: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata Key: ARROW-16719 URL: https://issues.apache.org/jira/browse/ARROW-16719 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Currently you can pass a local file path or file-like object, or a URI (eg "s3://...") or path+filesystem combo to {{parquet.read_table}}. But the {{parquet.read_metadata}} and {{parquet.read_schema}} methods (being a small wrapper around {{ParquetFile}} only accept the local file path or file-like object. I would propose to add the same path+filesystem handling to those functions as happens in {{read_table}} to make the capabilities of those consistent. (I ran into this in geopandas, where we use {{read_table}} to read the actual data, but also need {{read_metadata}} to inspect the actual Parquet FileMetaData for metadata) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16652) [Python][C++] Cast compute kernel segfaults when called with a Table
Joris Van den Bossche created ARROW-16652: - Summary: [Python][C++] Cast compute kernel segfaults when called with a Table Key: ARROW-16652 URL: https://issues.apache.org/jira/browse/ARROW-16652 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Joris Van den Bossche Passing a Table to {{{pyarrow.compute.cast}} with a scalar type gives a segfault: {code} In [1]: table = pa.table({'a': [1, 2]}) In [2]: import pyarrow.compute as pc In [3]: pc.cast(table, pa.int64()) Segmentation fault (core dumped) {code} Backtrace with gdb gives: {code} Thread 1 "python" received signal SIGSEGV, Segmentation fault. 0x7fba01685ada in arrow::DataType::id (this=0x0) at ../src/arrow/type.h:172 172 Type::type id() const { return id_; } (gdb) bt #0 0x7fba01685ada in arrow::DataType::id (this=0x0) at ../src/arrow/type.h:172 #1 0x7fba019e150e in arrow::TypeEquals (left=..., right=..., check_metadata=false) at ../src/arrow/compare.cc:1304 #2 0x7fba01b3484a in arrow::DataType::Equals (this=0x0, other=..., check_metadata=false) at ../src/arrow/type.cc:374 #3 0x7fba01f31678 in arrow::compute::internal::(anonymous namespace)::CastMetaFunction::ExecuteImpl (this=0x55b6ebe63860, args=..., options=0x55b6ec377080, ctx=0x7ffcd8cd43a0) at ../src/arrow/compute/cast.cc:116 #4 0x7fba020d9f39 in arrow::compute::MetaFunction::Execute (this=0x55b6ebe63860, args=..., options=0x55b6ec377080, ctx=0x7ffcd8cd43a0) at ../src/arrow/compute/function.cc:388 #5 0x7fb9ba95c8d9 in __pyx_pf_7pyarrow_8_compute_8Function_6call (__pyx_v_self=0x7fb9b7c19af0, __pyx_v_args=[], __pyx_v_options=0x7fb9b7c1c310, __pyx_v_memory_pool=0x55b6ea466d60 <_Py_NoneStruct>) at /home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.8/_compute.cpp:11292 #6 0x7fb9ba95c3d5 in __pyx_pw_7pyarrow_8_compute_8Function_7call (__pyx_v_self=, __pyx_args=([],), __pyx_kwds={'options': , 'memory_pool': None}) at /home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.8/_compute.cpp:11165 #7 0x55b6ea1fb814 in cfunction_call_varargs (kwargs=, args=, func=) at /home/conda/feedstock_root/build_artifacts/python-split_1606502903469/work/Objects/call.c:772 #8 PyCFunction_Call (func=, args=, kwargs=) at /home/conda/feedstock_root/build_artifacts/python-split_1606502903469/work/Objects/call.c:772 #9 0x7fb9ba9e84e2 in __Pyx_PyObject_Call (func=, arg=([],), kw={'options': , 'memory_pool': None}) at /home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.8/_compute.cpp:57961 #10 0x7fb9ba961add in __pyx_pf_7pyarrow_8_compute_6call_function (__pyx_self=0x0, __pyx_v_name='cast', __pyx_v_args=[], __pyx_v_options=, __pyx_v_memory_pool=None) at /home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.8/_compute.cpp:13408 #11 0x7fb9ba961676 in __pyx_pw_7pyarrow_8_compute_7call_function (__pyx_self=0x0, __pyx_args=('cast', [], ), __pyx_kwds=0x0) ... {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16651) [Python] Casting Table to new schema ignores nullability of fields
Joris Van den Bossche created ARROW-16651: - Summary: [Python] Casting Table to new schema ignores nullability of fields Key: ARROW-16651 URL: https://issues.apache.org/jira/browse/ARROW-16651 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Similar to ARROW-15478, but not for nested fields but just for casting a full Table (in theory that could be the same code, but currently the Table.cast logic is implemented in cython). So currently when casting a Table to a new schema, the nullability of the fields in the schema is ignored (and as a result you get an "invalid" schema indicating a field is non-nullable that actually can have nulls): {code} >>> table = pa.table({'a': [None, 1]}) >>> table pyarrow.Table a: int64 a: [[null,1]] >>> new_schema = pa.schema([pa.field("a", "int64", nullable=False)]) >>> table.cast(new_schema) pyarrow.Table a: int64 not null a: [[null,1]] {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16460) [Python] Some dataset tests using PyFileSystem are failing on Windows
Joris Van den Bossche created ARROW-16460: - Summary: [Python] Some dataset tests using PyFileSystem are failing on Windows Key: ARROW-16460 URL: https://issues.apache.org/jira/browse/ARROW-16460 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche We have some dataset tests that are skipped on Windows, because they are failing with FileNotFound errors. * https://github.com/apache/arrow/blob/3c3e68c194ca6ac07086ddc1bb44fe153970213e/python/pyarrow/tests/test_dataset.py#L3261-L3264 *https://github.com/apache/arrow/blob/893faa741f34ee450070503566dafb7291e24d9f/python/pyarrow/tests/test_dataset.py#L3124-L3145 (and see https://github.com/apache/arrow/pull/13033#issuecomment-1116180259 for some analysis) In the second case, it seems that for some reason, the file paths of the fragments are relative paths to the root of the dataset (while locally for me this gives absolute paths). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16458) [Python] Run S3 tests in the nightly dask integration build
Joris Van den Bossche created ARROW-16458: - Summary: [Python] Run S3 tests in the nightly dask integration build Key: ARROW-16458 URL: https://issues.apache.org/jira/browse/ARROW-16458 Project: Apache Arrow Issue Type: Test Components: Continuous Integration, Python Reporter: Joris Van den Bossche As a follow-up on https://github.com/apache/arrow/pull/13033 (ARROW-16413), we should update the {{integration_dask.sh}} script to also run the S3 tests from the dask test suite. See https://github.com/apache/arrow/pull/13033/commits/1bca56e932434d6b0dc947dd51915d83f9dd3a43 (in that commit I removed that again, because it was still failing due to some moto timeout) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16442) [Python] The fragments for ORC dataset return base Fragment instead of FileFragment
Joris Van den Bossche created ARROW-16442: - Summary: [Python] The fragments for ORC dataset return base Fragment instead of FileFragment Key: ARROW-16442 URL: https://issues.apache.org/jira/browse/ARROW-16442 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 9.0.0 >From https://github.com/dask/dask/pull/8944#issuecomment-1112620037 For the ORC file format, we return base {{Fragment}} objects instead of the {{FileFragment}} subclass (which has more functionality): {code:python} import pyarrow as pa import pyarrow.dataset as ds from pyarrow import orc table = pa.table({'a': [1, 2, 3]}) orc.write_table(table, "test.orc") dataset = ds.dataset("test.orc", format="orc") fragment = list(dataset.get_fragments())[0] {code} {code} In [9]: fragment Out[9]: In [10]: fragment.path --- AttributeErrorTraceback (most recent call last) in > 1 fragment.path AttributeError: 'pyarrow._dataset.Fragment' object has no attribute 'path' {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16413) [C++][Python] FileFormat::GetReaderAsync hangs with an fsspec filesystem
Joris Van den Bossche created ARROW-16413: - Summary: [C++][Python] FileFormat::GetReaderAsync hangs with an fsspec filesystem Key: ARROW-16413 URL: https://issues.apache.org/jira/browse/ARROW-16413 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Joris Van den Bossche Fix For: 8.0.0 See https://github.com/dask/dask/pull/8993 for details. When using an fsspec filesystem (or maybe more generally a PyFileSystem), inspecting a file through the FileFormat.inspect is hanging (this eg happens in ParquetDatasetFactory) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16339) [C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to Arrow Schema metadata
Joris Van den Bossche created ARROW-16339: - Summary: [C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to Arrow Schema metadata Key: ARROW-16339 URL: https://issues.apache.org/jira/browse/ARROW-16339 Project: Apache Arrow Issue Type: Improvement Components: C++, Parquet, Python Reporter: Joris Van den Bossche Context: I ran into this issue when reading Parquet files created by GDAL (using the Arrow C++ APIs, [https://github.com/OSGeo/gdal/pull/5477]), which writes files that have custom key_value_metadata, but without storing ARROW:schema in those metadata (cc [~paleolimbot] — Both in reading and writing files, I expected that we would map Arrow {{Schema::metadata}} with Parquet {{{}FileMetaData::key_value_metadata{}}}. But apparently this doesn't (always) happen out of the box, and only happens through the "ARROW:schema" field (which stores the original Arrow schema, and thus the metadata stored in this schema). For example, when writing a Table with schema metadata, this is not stored directly in the Parquet FileMetaData (code below is using branch from ARROW-16337 to have the {{store_schema}} keyword): {code:python} import pyarrow as pa import pyarrow.parquet as pq table = pa.table({'a': [1, 2, 3]}, metadata={"key": "value"}) pq.write_table(table, "test_metadata_with_arrow_schema.parquet") pq.write_table(table, "test_metadata_without_arrow_schema.parquet", store_schema=False) # original schema has metadata >>> table.schema a: int64 -- schema metadata -- key: 'value' # reading back only has the metadata in case we stored ARROW:schema >>> pq.read_table("test_metadata_with_arrow_schema.parquet").schema a: int64 -- schema metadata -- key: 'value' # and not if ARROW:schema is absent >>> pq.read_table("test_metadata_without_arrow_schema.parquet").schema a: int64 {code} It seems that if we store the ARROW:schema, we _also_ store the schema metadata separately. But if {{store_schema}} is False, we also stop writing those metadata (not fully sure if this is the intended behaviour, and that's the reason for the above output): {code:python} # when storing the ARROW:schema, we ALSO store key:value metadata >>> pq.read_metadata("test_metadata_with_arrow_schema.parquet").metadata {b'ARROW:schema': b'/7AQAAAKAA4ABgAFAA...', b'key': b'value'} # when not storing the schema, we also don't store the key:value >>> pq.read_metadata("test_metadata_without_arrow_schema.parquet").metadata is >>> None True {code} On the reading side, it seems that we generally do read custom key/value metadata into schema metadata. We don't have the pyarrow APIs at the moment to create such a file (given the above), but with a small patch I could create such a file: {code:python} # a Parquet file with ParquetFileMetaData::metadata that ONLY has a custom key >>> pq.read_metadata("test_metadata_without_arrow_schema2.parquet").metadata {b'key': b'value'} # this metadata is now correctly mapped to the Arrow schema metadata >>> pq.read_schema("test_metadata_without_arrow_schema2.parquet") a: int64 -- schema metadata -- key: 'value' {code} But if you have a file that has both custom key/value metadata and an "ARROW:schema" key, we actually ignore the custom keys, and only look at the "ARROW:schema" one. This was the case that I ran into with GDAL, where I have a file with both keys, but where the custom "geo" key is not also included in the serialized arrow schema in the "ARROW:schema" key: {code:python} # includes both keys in the Parquet file >>> pq.read_metadata("test_gdal.parquet").metadata {b'geo': b'{"version":"0.1.0","...', b'ARROW:schema': b'/3gBAAAQ...'} # the "geo" key is lost in the Arrow schema >>> pq.read_table("test_gdal.parquet").schema.metadata is None True {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16337) [Python] Expose parameter that determines to store Arrow schema in Parquet metadata in Python
Joris Van den Bossche created ARROW-16337: - Summary: [Python] Expose parameter that determines to store Arrow schema in Parquet metadata in Python Key: ARROW-16337 URL: https://issues.apache.org/jira/browse/ARROW-16337 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 9.0.0 There is a {{store_schema}} flag that determines whether we store the Arrow schema in the Parquet metadata (under the {{ARROW:schema}} key) or not. This is exposed in the C++, but not in the Python interface. It would be good to also expose this in the Python layer, to more easily experiment with this (eg to check the impact of having the schema available or not when reading a file) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16336) [Python] Hide internal (common_)metadata related warnings from the user (ParquetDataset)
Joris Van den Bossche created ARROW-16336: - Summary: [Python] Hide internal (common_)metadata related warnings from the user (ParquetDataset) Key: ARROW-16336 URL: https://issues.apache.org/jira/browse/ARROW-16336 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 8.0.0 Small follow-up on ARROW-16121, we missed a few cases where we are internally using those attributes (in the {{equals}} method) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16262) [CI] Kartothek nightly integration build is failing because of Parquet statistics date change
Joris Van den Bossche created ARROW-16262: - Summary: [CI] Kartothek nightly integration build is failing because of Parquet statistics date change Key: ARROW-16262 URL: https://issues.apache.org/jira/browse/ARROW-16262 Project: Apache Arrow Issue Type: Test Components: Continuous Integration, Python Reporter: Joris Van den Bossche Caused by ARROW-7350, see discussion at https://github.com/apache/arrow/pull/12902#issuecomment-1102750381 Upstream issue at https://github.com/JDASoftwareGroup/kartothek/issues/515 On the short term, we should also fix our nightly builds (either temporarily disabling them altogether, or ideally on skipping those failing tests) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16231) [C++][Python] IPC failure for dictionary with extension type with struct storage type
Joris Van den Bossche created ARROW-16231: - Summary: [C++][Python] IPC failure for dictionary with extension type with struct storage type Key: ARROW-16231 URL: https://issues.apache.org/jira/browse/ARROW-16231 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Joris Van den Bossche Report from [https://github.com/apache/arrow/issues/12899] Roundtripping through IPC/Feather using a dictionary type where the dictionary is an extension type with a nested storage type fails. Writing seems to work (but no idea if the written file is "correct", as trying to read the schema gives an error), but reading it back fails with {_}"ArrowInvalid: Ran out of field metadata, likely malformed"{_}. The original use case was from a pandas extension type (the pandas interval dtype is mapped to an arrow extension type with a struct type as storage, and in this case this interval type was further wrapped in a categorical (dictionary) type). A pandas-based test that reproduces this (can be added like this in {{{}test_feather.py{}}}): {code:python} @pytest.mark.pandas def test_dictionary_interval(): df = pd.DataFrame({'a': pd.cut(range(1, 10, 3), [-1, 5, 10])}) _check_pandas_roundtrip(df, version=2) {code} this gives: {code:java} $ pytest python/pyarrow/tests/test_feather.py::test_dictionary_interval = FAILURES = test_dictionary_interval ___ pyarrow/_feather.pyx:88: in pyarrow._feather.FeatherReader.read E pyarrow.lib.ArrowInvalid: Ran out of field metadata, likely malformed E ../src/arrow/ipc/reader.cc:266 GetFieldMetadata(field_index_++, out_) E ../src/arrow/ipc/reader.cc:283 LoadCommon(type_id) E ../src/arrow/ipc/reader.cc:324 Load(child_fields[i].get(), parent->child_data[i].get()) E ../src/arrow/ipc/reader.cc:529 loader.Load(, column.get()) E ../src/arrow/ipc/reader.cc:1188 ReadRecordBatchInternal( *message->metadata(), schema_, field_inclusion_mask_, context, reader.get()) E ../src/arrow/ipc/feather.cc:730 reader->ReadRecordBatch(i) pyarrow/error.pxi:100: ArrowInvalid {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16204) [C++][Dataset] Default error existing_data_behaviour for writing dataset ignores "part-{i}.ext" files
Joris Van den Bossche created ARROW-16204: - Summary: [C++][Dataset] Default error existing_data_behaviour for writing dataset ignores "part-{i}.ext" files Key: ARROW-16204 URL: https://issues.apache.org/jira/browse/ARROW-16204 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Fix For: 8.0.0 While trying to understand a failing test in https://github.com/apache/arrow/pull/12811#discussion_r851128672, I noticed that the {{write_dataset}} function does not actually always raise an error by default if there is already existing data in the target location. The documentation says it will raise "if any data exists in the destination" (which is also what I would expect), but in practice it seems that it does ignore certain file names: {code:python} import pyarrow.dataset as ds table = pa.table({'a': [1, 2, 3]}) # write a first time to new directory: OK >>> ds.write_dataset(table, "test_overwrite", format="parquet") >>> !ls test_overwrite part-0.parquet # write a second time to the same directory: passes, but should raise? >>> ds.write_dataset(table, "test_overwrite", format="parquet") >>> !ls test_overwrite part-0.parquet # write a another time to the same directory with different name: still passes >>> ds.write_dataset(table, "test_overwrite", format="parquet", >>> basename_template="data-{i}.parquet") >>> !ls test_overwrite data-0.parquet part-0.parquet # now writing again finally raises an error >>> ds.write_dataset(table, "test_overwrite", format="parquet") ... ArrowInvalid: Could not write to test_overwrite as the directory is not empty and existing_data_behavior is to error {code} So it seems that when checking if existing data exists, it seems to ignore any files that match the basename template pattern. cc [~westonpace] do you know if this was intentional? (I would find that a strange corner case, and in any case it is also not documented) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16140) [Python] zoneinfo timezones failing during type inference
Joris Van den Bossche created ARROW-16140: - Summary: [Python] zoneinfo timezones failing during type inference Key: ARROW-16140 URL: https://issues.apache.org/jira/browse/ARROW-16140 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche The conversion itself works fine (eg when specifying {{type=pa.timestamp("us", tz="America/New_York")}} in the below example), but inferring the type and timezone from the first value fails if it has a zoneinfo timezone: {code} In [53]: tz = zoneinfo.ZoneInfo(key='America/New_York') In [54]: dt = datetime.datetime(2013, 11, 3, 10, 3, 14, tzinfo = tz) In [55]: pa.array([dt]) ArrowInvalid: Object returned by tzinfo.utcoffset(None) is not an instance of datetime.timedelta {code} cc [~alenkaf] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16123) [Python] Do no include __init__ in the API documentation
Joris Van den Bossche created ARROW-16123: - Summary: [Python] Do no include __init__ in the API documentation Key: ARROW-16123 URL: https://issues.apache.org/jira/browse/ARROW-16123 Project: Apache Arrow Issue Type: Improvement Components: Documentation, Python Reporter: Joris Van den Bossche >From https://github.com/apache/arrow/pull/12698#discussion_r836484176 We should try to instruct sphinx/autodoc/numpydoc to not include {{\_\_init\_\_}} functions in the reference docs, as I don't think we have any case where this adds value (compared to the class docstring). See eg https://arrow.apache.org/docs/dev/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset.__init__ cc [~alenkaf] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16122) [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset
Joris Van den Bossche created ARROW-16122: - Summary: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset Key: ARROW-16122 URL: https://issues.apache.org/jira/browse/ARROW-16122 Project: Apache Arrow Issue Type: Sub-task Components: Python Reporter: Joris Van den Bossche Fix For: 8.0.0 Currently, the {{pq.write_to_dataset}} function also had a {{use_legacy_dataset}} keyword, but we should: 1) in case of {{use_legacy_dataset=True}}, ensure we raise deprecation warnings for all keywords that won't be supported in the new implementation (eg {{partition_filename_cb}}) 2) raise a deprecation warning for {{use_legacy_dataset=True}}, and/or already switch the default? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16121) [Python] Deprecate the (common_)metadata(_path) attributes of ParquetDataset
Joris Van den Bossche created ARROW-16121: - Summary: [Python] Deprecate the (common_)metadata(_path) attributes of ParquetDataset Key: ARROW-16121 URL: https://issues.apache.org/jira/browse/ARROW-16121 Project: Apache Arrow Issue Type: Sub-task Components: Python Reporter: Joris Van den Bossche Fix For: 8.0.0 The custom python ParquetDataset implementation exposes the {{metadata}}, {{metadata_path}}, {{common_metadata}} and {{common_metadata_path}} attributes, something for which we didn't add an equivalent to the new dataset API. Unless we still want to add something for this, we should deprecate those attributes in the legacy ParquetDataset. In addition, we should also deprecate passing the {{metadata}} keyword in the ParquetDataset constructor. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16120) [Python] ParquetDataset deprecation: change Deprecation to FutureWarnings
Joris Van den Bossche created ARROW-16120: - Summary: [Python] ParquetDataset deprecation: change Deprecation to FutureWarnings Key: ARROW-16120 URL: https://issues.apache.org/jira/browse/ARROW-16120 Project: Apache Arrow Issue Type: Sub-task Components: Python Reporter: Joris Van den Bossche Fix For: 8.0.0 We are currently using DeprecationWarnings for the deprecations, but now they are already there for some time, we can change this to the more user-visible FutureWarning. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16119) [Python] Deprecate the legacy ParquetDataset custom python-based implementation
Joris Van den Bossche created ARROW-16119: - Summary: [Python] Deprecate the legacy ParquetDataset custom python-based implementation Key: ARROW-16119 URL: https://issues.apache.org/jira/browse/ARROW-16119 Project: Apache Arrow Issue Type: Task Components: Python Reporter: Joris Van den Bossche To be able to remove the custom python implementation (ARROW-15868), we first need to deprecate the various aspects. This issue is meant as a parent issue to keep an overview of the different tasks. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16113) [Python] Partitioning.dictionaries in case of a subset of fields are dictionary encoded
Joris Van den Bossche created ARROW-16113: - Summary: [Python] Partitioning.dictionaries in case of a subset of fields are dictionary encoded Key: ARROW-16113 URL: https://issues.apache.org/jira/browse/ARROW-16113 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche Follow-up on ARROW-14612, see the discussion at https://github.com/apache/arrow/pull/12530#discussion_r841760449 ARROW-14612 changes the return value of the {{dictionaries}} attribute from None to a list in case some of the partitioning schema fields are not dictionary encoded. But this can result in a non-clear mapping between arrays in {{Partitioning.dictionaries}} and fields in {{Partitioning.schema}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16107) [CI][Archery] Fix archery crossbow query to get latest prefix
Joris Van den Bossche created ARROW-16107: - Summary: [CI][Archery] Fix archery crossbow query to get latest prefix Key: ARROW-16107 URL: https://issues.apache.org/jira/browse/ARROW-16107 Project: Apache Arrow Issue Type: Test Components: Continuous Integration, Developer Tools Reporter: Joris Van den Bossche This feature stopped working when the crossbow builds were splitted into 3 parts -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16018) [Doc][Python] Run doctests on Python docstring examples
Joris Van den Bossche created ARROW-16018: - Summary: [Doc][Python] Run doctests on Python docstring examples Key: ARROW-16018 URL: https://issues.apache.org/jira/browse/ARROW-16018 Project: Apache Arrow Issue Type: Test Components: Documentation, Python Reporter: Joris Van den Bossche We start to add more and more examples to the docstrings of Python methods (ARROW-15367), and so we could use the doctest functionality to ensure that those examples are actually correct (and keep being correct). Pytest has integration for doctests (https://docs.pytest.org/en/6.2.x/doctest.html), and so you can do: {code} pytest python/pyarrow --doctest-modules {code} This currently fails for me because not having pyarrow.cuda, so we will need to find some ways to automatically skips those parts if not available. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15997) [CI] Nightly turbodbc build is failing (C++ compilation error)
Joris Van den Bossche created ARROW-15997: - Summary: [CI] Nightly turbodbc build is failing (C++ compilation error) Key: ARROW-15997 URL: https://issues.apache.org/jira/browse/ARROW-15997 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration Reporter: Joris Van den Bossche See eg https://github.com/ursacomputing/crossbow/runs/5637809188?check_suite_focus=true The error seems related to boost (and not Arrow), and happens in the C++ code of turbodbc. But it is strange that it happens in both the latest and master turbodbc build (so it's not caused by a change on turbodbc's side). And I also didn't see a change in the boost version compared to the last successful build. cc [~uwe] {code} [102/156] Building CXX object cpp/turbodbc/Test/CMakeFiles/turbodbc_test.dir/tests/field_translator_test.cpp.o FAILED: cpp/turbodbc/Test/CMakeFiles/turbodbc_test.dir/tests/field_translator_test.cpp.o /opt/conda/envs/arrow/bin/x86_64-conda-linux-gnu-c++ -I/turbodbc/cpp/turbodbc/Library -I/turbodbc/cpp/turbodbc/../cpp_odbc/Library -I/turbodbc/cpp/turbodbc/Test -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/envs/arrow/include -Wall -Wextra -g -O0 -pedantic -std=c++11 -MD -MT cpp/turbodbc/Test/CMakeFiles/turbodbc_test.dir/tests/field_translator_test.cpp.o -MF cpp/turbodbc/Test/CMakeFiles/turbodbc_test.dir/tests/field_translator_test.cpp.o.d -o cpp/turbodbc/Test/CMakeFiles/turbodbc_test.dir/tests/field_translator_test.cpp.o -c /turbodbc/cpp/turbodbc/Test/tests/field_translator_test.cpp In file included from /opt/conda/envs/arrow/include/boost/type_index/stl_type_index.hpp:32, from /opt/conda/envs/arrow/include/boost/type_index.hpp:29, from /opt/conda/envs/arrow/include/boost/variant/variant.hpp:21, from /turbodbc/cpp/turbodbc/Library/turbodbc/field.h:3, from /turbodbc/cpp/turbodbc/Library/turbodbc/field_translator.h:3, from /turbodbc/cpp/turbodbc/Test/tests/field_translator_test.cpp:1: /opt/conda/envs/arrow/include/boost/optional/optional.hpp: In instantiation of 'std::basic_ostream<_CharT, _Traits>& boost::operator<<(std::basic_ostream<_CharT, _Traits>&, const boost::optional_detail::optional_tag&) [with CharType = char; CharTrait = std::char_traits]': /opt/conda/envs/arrow/include/gtest/gtest-printers.h:215:9: required from 'static void testing::internal::internal_stream_operator_without_lexical_name_lookup::StreamPrinter::PrintValue(const T&, std::ostream*) [with T = boost::optional, std::allocator >, bool, double, boost::gregorian::date, boost::posix_time::ptime> >; = void; = std::basic_ostream&; std::ostream = std::basic_ostream]' /opt/conda/envs/arrow/include/gtest/gtest-printers.h:312:22: required from 'void testing::internal::PrintWithFallback(const T&, std::ostream*) [with T = boost::optional, std::allocator >, bool, double, boost::gregorian::date, boost::posix_time::ptime> >; std::ostream = std::basic_ostream]' /opt/conda/envs/arrow/include/gtest/gtest-printers.h:441:30: required from 'void testing::internal::PrintTo(const T&, std::ostream*) [with T = boost::optional, std::allocator >, bool, double, boost::gregorian::date, boost::posix_time::ptime> >; std::ostream = std::basic_ostream]' /opt/conda/envs/arrow/include/gtest/gtest-printers.h:691:12: required from 'static void testing::internal::UniversalPrinter::Print(const T&, std::ostream*) [with T = boost::optional, std::allocator >, bool, double, boost::gregorian::date, boost::posix_time::ptime> >; std::ostream = std::basic_ostream]' /opt/conda/envs/arrow/include/gtest/gtest-printers.h:980:30: required from 'void testing::internal::UniversalPrint(const T&, std::ostream*) [with T = boost::optional, std::allocator >, bool, double, boost::gregorian::date, boost::posix_time::ptime> >; std::ostream = std::basic_ostream]' /opt/conda/envs/arrow/include/gtest/gtest-printers.h:865:19: [ skipping 2 instantiation contexts, use -ftemplate-backtrace-limit=0 to disable ] /opt/conda/envs/arrow/include/gtest/gtest-printers.h:334:36: required from 'static std::string testing::internal::FormatForComparison::Format(const ToPrint&) [with ToPrint = boost::optional, std::allocator >, bool, double, boost::gregorian::date, boost::posix_time::ptime> >; OtherOperand = boost::optional, std::allocator >, bool, double, boost::gregorian::date, boost::posix_time::ptime> >; std::string = std::__cxx11::basic_string]' /opt/conda/envs/arrow/include/gtest/gtest-printers.h:415:45: required from 'std::string testing::internal::FormatForComparisonFailureMessage(const T1&, const T2&) [with T1 = boost::optional, std::allocator >, bool,
[jira] [Created] (ARROW-15960) [Python] Segfault constructing a fixed size list array of size 0 with dictionary values
Joris Van den Bossche created ARROW-15960: - Summary: [Python] Segfault constructing a fixed size list array of size 0 with dictionary values Key: ARROW-15960 URL: https://issues.apache.org/jira/browse/ARROW-15960 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche The following example constructing a FixedSizeList array with list size 0 and dictionary values from an explicit None value (extracted from a segfaulting hypothesis test) crashes: {code} pa.array([None], pa.list_(pa.dictionary(pa.int32(), pa.string()), 0)) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15884) [C++][Doc] Document that the strptime kernel ignores %Z
Joris Van den Bossche created ARROW-15884: - Summary: [C++][Doc] Document that the strptime kernel ignores %Z Key: ARROW-15884 URL: https://issues.apache.org/jira/browse/ARROW-15884 Project: Apache Arrow Issue Type: Improvement Components: C++, Documentation Reporter: Joris Van den Bossche After ARROW-12820, the {{strptime}} kernel still ignores the {{%Z}} specifier (for timezone names), and when using it, it basically ignores any string. For example: {code:python} # the %z specifier now works (after ARROW-12820) >>> pc.strptime(["2022-03-05 09:00:00+01"], format="%Y-%m-%d %H:%M:%S%z", >>> unit="us") [ 2022-03-05 08:00:00.00 ] # in theory this should give the same result, but %Z is still ignore >>> pc.strptime(["2022-03-05 09:00:00 CET"], format="%Y-%m-%d %H:%M:%S %Z", >>> unit="us") [ 2022-03-05 09:00:00.00 ] # as a result any garbage in the string is also ignored >>> pc.strptime(["2022-03-05 09:00:00 blabla"], format="%Y-%m-%d %H:%M:%S %Z", >>> unit="us") [ 2022-03-05 09:00:00.00 ] {code} I don't think it is easy to actually fix this (at least as long as we use the system strptime, see also https://github.com/apache/arrow/pull/11358#issue-1020404727). But at least we should document this limitation / gotcha. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15883) [C++] Support for fractional seconds in strptime() for ISO format?
Joris Van den Bossche created ARROW-15883: - Summary: [C++] Support for fractional seconds in strptime() for ISO format? Key: ARROW-15883 URL: https://issues.apache.org/jira/browse/ARROW-15883 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Currently, we can't parse "our own" string representation of a timestamp array with the timestamp parser {{strptime}}: {code:python} import datetime import pyarrow as pa import pyarrow.compute as pc >>> pa.array([datetime.datetime(2022, 3, 5, 9)]) [ 2022-03-05 09:00:00.00 ] # trying to parse the above representation as string >>> pc.strptime(["2022-03-05 09:00:00.00"], format="%Y-%m-%d %H:%M:%S", >>> unit="us") ... ArrowInvalid: Failed to parse string: '2022-03-05 09:00:00.00' as a scalar of type timestamp[us] {code} The reason for this is the fractional second part, so the following works: {code:python} >>> pc.strptime(["2022-03-05 09:00:00"], format="%Y-%m-%d %H:%M:%S", unit="us") [ 2022-03-05 09:00:00.00 ] {code} Now, I think the reason that this fails is because {{strptime}} only supports parsing seconds as an integer (https://man7.org/linux/man-pages/man3/strptime.3.html). But, it creates a strange situation where the timestamp parser cannot parse the representation we use for timestamps. In addition, for CSV we have a custom ISO parser (used by default), so when parsing the strings while reading a CSV file, the same string with fractional seconds does work: {code:python} s = b"""a 2022-03-05 09:00:00.00""" from pyarrow import csv >>> csv.read_csv(io.BytesIO(s)) pyarrow.Table a: timestamp[ns] a: [[2022-03-05 09:00:00.0]] {code} cc [~apitrou] [~rokm] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15882) [Ci][Python] Nightly hypothesis build is not actually running the hypothesis tests
Joris Van den Bossche created ARROW-15882: - Summary: [Ci][Python] Nightly hypothesis build is not actually running the hypothesis tests Key: ARROW-15882 URL: https://issues.apache.org/jira/browse/ARROW-15882 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Python Reporter: Joris Van den Bossche Fix For: 8.0.0 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15871) [Python] Start raising deprecation warnings for ParquetDataset keywords that won't be supported with the new API
Joris Van den Bossche created ARROW-15871: - Summary: [Python] Start raising deprecation warnings for ParquetDataset keywords that won't be supported with the new API Key: ARROW-15871 URL: https://issues.apache.org/jira/browse/ARROW-15871 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 8.0.0 Currently, the {{ParquetDataset}} API itself still defaults to the legacy implementation ({{parquet.read_table}} already defaults to the new) and also still supports some keywords that won't be supported with the new implementation. So if we want to remove the old implementation at some point (ARROW-15868), we should start deprecating those options, and also start defaulting to the new implementation when possible. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15870) [Python] Start to raise deprecation warnings when using use_legacy_dataset=True in parquet.py
Joris Van den Bossche created ARROW-15870: - Summary: [Python] Start to raise deprecation warnings when using use_legacy_dataset=True in parquet.py Key: ARROW-15870 URL: https://issues.apache.org/jira/browse/ARROW-15870 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 8.0.0 Currently, users can still specify {{use_legacy_dataset=True}} explicitly to get the old implementation/behaviour. But if we want to remove that implementation at some point (ARROW-15868), we should start deprecating that option, to futher nudge people to the new implementation. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15868) [Python] Remove the legacy ParquetDataset custom python-based implementation
Joris Van den Bossche created ARROW-15868: - Summary: [Python] Remove the legacy ParquetDataset custom python-based implementation Key: ARROW-15868 URL: https://issues.apache.org/jira/browse/ARROW-15868 Project: Apache Arrow Issue Type: Sub-task Components: Python Reporter: Joris Van den Bossche We might want to keep the actual {{ParquetDataset}} class (ARROW-9720), but we should still remove the custom / legacy implementation (which is using the deprecated filesystem interface, so this is also blocking ARROW-15761) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15867) [Python] Ignored exception printed when pandas is not installed
Joris Van den Bossche created ARROW-15867: - Summary: [Python] Ignored exception printed when pandas is not installed Key: ARROW-15867 URL: https://issues.apache.org/jira/browse/ARROW-15867 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 8.0.0 When you don't have pandas installed, you can get an "error" like {code} Exception ignored in: 'pyarrow.lib._PandasAPIShim._have_pandas_internal' Traceback (most recent call last): File "pyarrow/pandas-shim.pxi", line 110, in pyarrow.lib._PandasAPIShim._check_import File "pyarrow/pandas-shim.pxi", line 59, in pyarrow.lib._PandasAPIShim._import_pandas AttributeError: module 'pandas' has no attribute '__version__' {code} This is not an actual error that interrupts your Python session (it's an ignored exception), but we should of course still ensure to not print it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15847) [Python] Building with Parquet but without Parquet encryption fails
Joris Van den Bossche created ARROW-15847: - Summary: [Python] Building with Parquet but without Parquet encryption fails Key: ARROW-15847 URL: https://issues.apache.org/jira/browse/ARROW-15847 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Locally (with Parquet enabled, but no Parquet encryption, both on C++ and Python level), I get: {code} CMake Error at CMakeLists.txt:643 (target_link_libraries): Cannot specify link libraries for target "_parquet_encryption" which is not built by this project. -- Configuring incomplete, errors occurred! {code} (also after cleaning up old build files) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15761) [Python] Remove the deprecated pyarrow.filesystem legacy implementations
Joris Van den Bossche created ARROW-15761: - Summary: [Python] Remove the deprecated pyarrow.filesystem legacy implementations Key: ARROW-15761 URL: https://issues.apache.org/jira/browse/ARROW-15761 Project: Apache Arrow Issue Type: Task Components: Python Reporter: Joris Van den Bossche Fix For: 8.0.0 The {{pyarrow.filesystem}} and {{pyarrow.hdfs}} filesystems have been deprecated in 2.0.0, and changed from Deprecation to FutureWarning in 4.0.0. I think it is time to actually remove them, and I would propose to do so in 8.0.0 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15760) [C++] Avoid hard dependency on git in cmake (download tarballs from github instead)
Joris Van den Bossche created ARROW-15760: - Summary: [C++] Avoid hard dependency on git in cmake (download tarballs from github instead) Key: ARROW-15760 URL: https://issues.apache.org/jira/browse/ARROW-15760 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche See https://github.com/apache/arrow/pull/12322#issuecomment-1048523391 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15720) [CI] Nightly dask build is failing due to wrong usage of Array.to_pandas
Joris Van den Bossche created ARROW-15720: - Summary: [CI] Nightly dask build is failing due to wrong usage of Array.to_pandas Key: ARROW-15720 URL: https://issues.apache.org/jira/browse/ARROW-15720 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration Reporter: Joris Van den Bossche This failure is triggered by a change in Arrow (addition of {{types_mapper}} keyword to {{pa.Array.to_pandas}}), but the cause is a wrong usage of that in dask. I already fixed that on the dask side: https://github.com/dask/dask/pull/8733 But we should still skip the test on our side (will be needed until that PR is merged + released) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15711) [C++][Parquet] Extension types with nanosecond timestamp resolution don't roundtrip
Joris Van den Bossche created ARROW-15711: - Summary: [C++][Parquet] Extension types with nanosecond timestamp resolution don't roundtrip Key: ARROW-15711 URL: https://issues.apache.org/jira/browse/ARROW-15711 Project: Apache Arrow Issue Type: Bug Components: C++, Parquet Reporter: Joris Van den Bossche Example code: {code:python} import pyarrow as pa import pyarrow.parquet as pq class MyTimestampType(pa.PyExtensionType): def __init__(self): pa.PyExtensionType.__init__(self, pa.timestamp("ns")) def __reduce__(self): return MyTimestampType, () arr = MyTimestampType().wrap_array(pa.array([1000, 2000, 3000], pa.timestamp("ns"))) table = pa.table({"col": arr}) {code} {code} >>> table.schema col: extension> >>> pq.write_table(table, "test_parquet_extension_type_timestamp_ns.parquet") >>> result = pq.read_table("test_parquet_extension_type_timestamp_ns.parquet") >>> result.schema col: timestamp[us] {code} The reason for this is because we only restore the extension type if the inferred storage type (inferred from parquet + after applying any updates based on the Arrow schema) exactly equals the original storage type (as stored in the Arrow schema): https://github.com/apache/arrow/blob/afaa92e7e4289d6e4f302cc91810368794e8092b/cpp/src/parquet/arrow/schema.cc#L973-L977 And, with the default options, a timestamp with nanosecond resolution gets stored as microsecond resolution in Parquet, and that is something we do not restore when updating the read types based on the stored Arrow schema (eg we do add a timezone, but we don't change the resolution). An additional issue is that _if_ you loose the extension type, the field metadata about the extension type are also lost. I think that if we cannot restore the extension type, we should at least try to keep the ARROW:extension field metadata as information. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15652) [C++] GDB plugin printer gives error with extension type
Joris Van den Bossche created ARROW-15652: - Summary: [C++] GDB plugin printer gives error with extension type Key: ARROW-15652 URL: https://issues.apache.org/jira/browse/ARROW-15652 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Copying the example from ARROW-9078 {code} import pyarrow as pa import pyarrow.parquet as pq class MyStructType(pa.PyExtensionType): def __init__(self): pa.PyExtensionType.__init__(self, pa.struct([('left', pa.int64()), ('right', pa.int64())])) def __reduce__(self): return MyStructType, () struct_array = pa.StructArray.from_arrays( [ pa.array([0, 1], type="int64", from_pandas=True), pa.array([1, 2], type="int64", from_pandas=True), ], names=["left", "right"], ) mystruct_array = pa.ExtensionArray.from_storage(MyStructType(), struct_array) table = pa.table({'a': mystruct_array}) pq.write_table(table, "test_struct.parquet") {code} What I was doing is then reading the table back in, with a breakpoint at {{ApplyOriginalMetadata}}. But I suppose any other way to get into the debugger is fine as well (and maybe also with a simpler extension type, i.e. not with a struct type as storage type, I didn't yet try that). This gives: {code} (gdb) p origin_field $3 = (const arrow::Field &) @0x555bbb308190: Python Exception A syntax error in expression, near `) (0x555bbb277020)).ToString()'.: arrow::field("a", ) {code} for the field/type being extension type cc [~apitrou] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15643) [C++] Kernel to select subset of fields of a StructArray
Joris Van den Bossche created ARROW-15643: - Summary: [C++] Kernel to select subset of fields of a StructArray Key: ARROW-15643 URL: https://issues.apache.org/jira/browse/ARROW-15643 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Triggered by https://stackoverflow.com/questions/71035754/pyarrow-drop-a-column-in-a-nested-structure. I thought there was already an issue about this, but don't directly find one. Assume you have a struct array with some fields: {code} >>> arr = pa.StructArray.from_arrays([[1, 2, 3]]*3, names=['a', 'b', 'c']) >>> arr.type StructType(struct) {code} We have a kernel to select a single child field: {code} >>> pc.struct_field(arr, [0]) [ 1, 2, 3 ] {code} But if you want to subset the StructArray to some of its fields, resulting in a new StructArray, that's not possible with {{struct_fields}}, and doing this manually is a bit cumbersome: {code} >>> fields = ['a', 'c'] >>> arrays = [arr.field(n) for n in fields] >>> arr_subset = pa.StructArray.from_arrays(arrays, names=fields) >>> arr_subset.type StructType(struct) {code} (this is still OK, but if you had a ChunkedArray, it certainly gets annoying) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15601) [Docs][Release] Update post release script to move stable docs to versioned + keep dev docs
Joris Van den Bossche created ARROW-15601: - Summary: [Docs][Release] Update post release script to move stable docs to versioned + keep dev docs Key: ARROW-15601 URL: https://issues.apache.org/jira/browse/ARROW-15601 Project: Apache Arrow Issue Type: Sub-task Components: Documentation Reporter: Joris Van den Bossche Fix For: 8.0.0, 7.0.1 xref https://github.com/apache/arrow-site/pull/187 We need to update the {{post-09-docs.sh}} script to keep the dev docs and to move the current stable docs to a versioned sub-directory -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15564) [C++] Expose MergeOptions in Concatenate to unify types
Joris Van den Bossche created ARROW-15564: - Summary: [C++] Expose MergeOptions in Concatenate to unify types Key: ARROW-15564 URL: https://issues.apache.org/jira/browse/ARROW-15564 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche The {{arrow::ConcatenateTables}} function exposes the {{Field::MergeOptions}} as a way to indicate how fields with different types should be merged ("unified" / "common type"). The version to concatenate arrays ({{arrow::Concatenate}}) currently requires all same-typed arrays. We could add a MergeOptions option here as well? (this depends on ARROW-14705 to make this option more useful, currently it only handles null -> any upcasts, I think) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15552) [Docs][Format] Unclear wording about base64 encoding requirement of metadata values
Joris Van den Bossche created ARROW-15552: - Summary: [Docs][Format] Unclear wording about base64 encoding requirement of metadata values Key: ARROW-15552 URL: https://issues.apache.org/jira/browse/ARROW-15552 Project: Apache Arrow Issue Type: Improvement Components: Documentation, Format Reporter: Joris Van den Bossche The C Data Interface docs indicate that the values in key-value metadata should be base64 encoded, which is mentioned in the section about which key-value metadata to use for extension types (https://arrow.apache.org/docs/format/CDataInterface.html#extension-arrays): bq. The base64 encoding of metadata values ensures that any possible serialization is representable. This might not be fully correct, though (or at least not required, which is implied with the current wording). While a binary blob (like a serialized schema) can be base64 encoded, as we do when putting the Arrow schema in the Parquet metadata, this is not required? cc [~apitrou] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15548) [C++][Parquet] Field-level metadata are not supported? (ColumnMetadata.key_value_metadata)
Joris Van den Bossche created ARROW-15548: - Summary: [C++][Parquet] Field-level metadata are not supported? (ColumnMetadata.key_value_metadata) Key: ARROW-15548 URL: https://issues.apache.org/jira/browse/ARROW-15548 Project: Apache Arrow Issue Type: Improvement Components: C++, Parquet Reporter: Joris Van den Bossche Due to an application where we are considering to use field-level metadata (so not schema-level metadata), but also want to be able to save this data to Parquet, I was looking into "field-level metadata" for Parquet, which I assumed we supported this. We can roundtrip Arrow's field-level metadata to/from Parquet, as shown with this example: {code:python} schema = pa.schema([pa.field("column_name", pa.int64(), metadata={"key": "value"})]) table = pa.table({'column_name': [0, 1, 2]}, schema=schema) pq.write_table(table, "test_field_metadata.parquet") >>> pq.read_table("test_field_metadata.parquet").schema column_name: int64 -- field metadata -- key: 'value' {code} However, the reason this is restored is actually because of this being stored in the Arrow schema that we (by default) store in the {{ARROW:schema}} metadata in the Parquet FileMetaData.key_value_metadata. With a small patched version to be able to turn this off (currently this is harcoded to be turned on in the python bindings), it is clear this field-level metadata is not restored on roundtrip without this stored arrow schema: {code:python} pq.write_table(table, "test_field_metadata_without_schema.parquet", store_arrow_schema=False) >>> pq.read_table("test_field_metadata_without_schema.parquet").schema column_name: int64 {code} So there is currently no mapping from Arrow's field level metadata to Parquet's column-level metadata ({{ColumnMetaData.key_value_metadata}} in Parquet's thrift structures). (which also means that using field-level metadata roundtripping to parquet only works as long as you are using Arrow for writing/reading, but not if you want to be able to also exchange such data with non-Arrow Parquet implementations) In addition, it also seems we don't even expose this field in our C++ or Python bindings, to just access that data if you would have a Parquet file (written by another implementation) that has key_value_metadata in the ColumnMetaData. cc [~emkornfield] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15545) [C++] Cast dictionary of extension type to extension type
Joris Van den Bossche created ARROW-15545: - Summary: [C++] Cast dictionary of extension type to extension type Key: ARROW-15545 URL: https://issues.apache.org/jira/browse/ARROW-15545 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche We support casting a DictionaryArray to its dictionary values' type. For example: {code} >>> arr = pa.array([1, 2, 1]).dictionary_encode() >>> arr -- dictionary: [ 1, 2 ] -- indices: [ 0, 1, 0 ] >>> arr.type DictionaryType(dictionary) >>> arr.cast(arr.type.value_type) [ 1, 2, 1 ] {code} However, if the type of the dictionary values is an ExtensionType, this cast is not supported: {code} >>> from pyarrow.tests.test_extension_type import UuidType >>> storage = pa.array([b"0123456789abcdef"], type=pa.binary(16)) >>> arr = pa.ExtensionArray.from_storage(UuidType(), storage) >>> arr [ 30313233343536373839616263646566 ] >>> dict_arr = pa.DictionaryArray.from_arrays(pa.array([0, 0], pa.int32()), arr) >>> dict_arr.type DictionaryType(dictionary>, indices=int32, ordered=0>) >>> dict_arr.cast(UuidType()) ... ArrowNotImplementedError: Unsupported cast from dictionary>, indices=int32, ordered=0> to extension> (no available cast function for target type) ../src/arrow/compute/cast.cc:119 GetCastFunctionInternal(cast_options->to_type, args[0].type().get()) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15479) [C++] Cast to fixed size list with different field name
Joris Van den Bossche created ARROW-15479: - Summary: [C++] Cast to fixed size list with different field name Key: ARROW-15479 URL: https://issues.apache.org/jira/browse/ARROW-15479 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Casting a FixedSizeListArray to a compatible type but only a different field name isn't implemented: {code:python} >>> my_type = pa.list_(pa.field("element", pa.int64()), 2) >>> arr = pa.FixedSizeListArray.from_arrays(pa.array([1, 2, 3, 4, 5, 6]), 2) >>> arr.type FixedSizeListType(fixed_size_list[2]) >>> my_type FixedSizeListType(fixed_size_list[2]) >>> arr.cast(my_type) ... ArrowNotImplementedError: Unsupported cast from fixed_size_list[2] to fixed_size_list using function cast_fixed_size_list {code} While the similar operation with a variable sized list actually works: {code:python} >>> my_type = pa.list_(pa.field("element", pa.int64())) >>> arr = pa.array([[1, 2], [3, 4]], pa.list_(pa.int64())) >>> arr.type ListType(list) >>> arr.cast(my_type).type ListType(list) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15478) [C++] Creating (or casting to) list array with non-nullable field doesn't check nulls
Joris Van den Bossche created ARROW-15478: - Summary: [C++] Creating (or casting to) list array with non-nullable field doesn't check nulls Key: ARROW-15478 URL: https://issues.apache.org/jira/browse/ARROW-15478 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche When creating a ListArray where you indicate that the values field is not nullable, you can actually create the array with nulls without this is being validated: {code:python} >>> typ = pa.list_(pa.field("element", pa.int64(), nullable=False)) >>> arr = pa.array([[1, 2], [3, 4, None]], typ) >>> arr [ [ 1, 2 ], [ 3, 4, null ] ] >>> arr.type ListType(list) {code} Also explicitly validating it doesn't raise: {code:python} >>> arr.validate(full=True) {code} Is this something we should check? What guarantees do we attach to the nullability of a field of a nested type? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15477) [C++][Python] Enable ListArray::FromArrays with custom list type (field names/nullability)
Joris Van den Bossche created ARROW-15477: - Summary: [C++][Python] Enable ListArray::FromArrays with custom list type (field names/nullability) Key: ARROW-15477 URL: https://issues.apache.org/jira/browse/ARROW-15477 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Joris Van den Bossche Currently, when creating a ListArray from the values and offets, you get a "default" list type: {code:python} >>> arr = pa.ListArray.from_arrays(pa.array([0, 2, 5], pa.int32()), >>> pa.array([1, 2, 3, 4, 5])) >>> arr [ [ 1, 2 ], [ 3, 4, 5 ] ] >>> arr.type ListType(list) {code} So a type with default field name ("item") and nullability (true). We should allow to specify a type (that needs to be compatible with the passed values' type) so you can create a ListArray with specific field names. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15455) [C++] Cast between fixed size list type and variable size list
Joris Van den Bossche created ARROW-15455: - Summary: [C++] Cast between fixed size list type and variable size list Key: ARROW-15455 URL: https://issues.apache.org/jira/browse/ARROW-15455 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Casting from fixed size list to variable size list could be possible, I think, but currently doesn't work: {code:python} >>> fixed_size = pa.array([[1, 2], [3, 4]], type=pa.list_(pa.int64(), 2)) >>> fixed_size.cast(pa.list_(pa.int64())) ... ArrowNotImplementedError: Unsupported cast from fixed_size_list[2] to list using function cast_list {code} And in principle, a cast the other way around could also be possible if it is checked that each list has the correct length. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15394) [CI][Docs] Doxygen not ran in the docs nightly build
Joris Van den Bossche created ARROW-15394: - Summary: [CI][Docs] Doxygen not ran in the docs nightly build Key: ARROW-15394 URL: https://issues.apache.org/jira/browse/ARROW-15394 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, Documentation Reporter: Joris Van den Bossche Discovered on the nightly dev docs that the C++ API pages are not working, because doxygen is not used in the nightly doc build -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15370) [Python] Regression in empty table to_pandas conversion
Joris Van den Bossche created ARROW-15370: - Summary: [Python] Regression in empty table to_pandas conversion Key: ARROW-15370 URL: https://issues.apache.org/jira/browse/ARROW-15370 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 7.0.0 Nightly integration tests with kartothek are failing, see eg https://github.com/ursacomputing/crossbow/runs/4863725914?check_suite_focus=true This seems something on our side, and a recent failure (the builds only started failing today, and I don't see other differences with the last working build yesterday) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15365) [Python] Expose full cast options in the pyarrow.compute.cast function
Joris Van den Bossche created ARROW-15365: - Summary: [Python] Expose full cast options in the pyarrow.compute.cast function Key: ARROW-15365 URL: https://issues.apache.org/jira/browse/ARROW-15365 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Currently, the {{pc.cast}} function has a {{safe=True/False}} option, which provides a short-cut to setting the cast options. But the actual kernel has more detailed options that can be tuned, and this is already exposed in the CastOptions class in python (allow_int_overflow, allow_time_truncate, ...). So we should ensure that we can pass such a CastOptions object to the {{cast}} kernel directly as well. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15364) [Python][Doc] Update filesystem entry in read docstrings
Joris Van den Bossche created ARROW-15364: - Summary: [Python][Doc] Update filesystem entry in read docstrings Key: ARROW-15364 URL: https://issues.apache.org/jira/browse/ARROW-15364 Project: Apache Arrow Issue Type: Improvement Components: Documentation, Python Reporter: Joris Van den Bossche In several docstrings (of orc.read_table, parquet.read_table/ParquetDataset/write_to_dataset, we have something like: {code} filesystem : FileSystem, default None If nothing passed, paths assumed to be found in the local on-disk filesystem. {code} but this is actually no longer up to date. If filesystem is not specified, it will be inferred from the path, which can both be a path to local disk, or be a URI. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15326) [CI][Gandiva] Ubuntu release build is failing with failing Gandiva tests
Joris Van den Bossche created ARROW-15326: - Summary: [CI][Gandiva] Ubuntu release build is failing with failing Gandiva tests Key: ARROW-15326 URL: https://issues.apache.org/jira/browse/ARROW-15326 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva, Continuous Integration Reporter: Joris Van den Bossche Fix For: 7.0.0 See eg https://github.com/ursacomputing/crossbow/runs/4799525079?check_suite_focus=true {code} The following tests FAILED: 66 - gandiva-internals-test (Failed) 67 - gandiva-precompiled-test (SEGFAULT) {code} cc [~vitor004] [~projjal] [~anthonylouis] (just tagging some people who recently contributed to gandiva, I am not familiar with this area myself) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15324) [C++][CI] HDFS test build is failing with segfault (TestLibHdfs::test_mv_rename)
Joris Van den Bossche created ARROW-15324: - Summary: [C++][CI] HDFS test build is failing with segfault (TestLibHdfs::test_mv_rename) Key: ARROW-15324 URL: https://issues.apache.org/jira/browse/ARROW-15324 Project: Apache Arrow Issue Type: Bug Components: C++, Continuous Integration Reporter: Joris Van den Bossche Fix For: 7.0.0 See eg https://github.com/ursacomputing/crossbow/runs/4799476838?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15323) [CI] Nightly spark integration builds are failing
Joris Van den Bossche created ARROW-15323: - Summary: [CI] Nightly spark integration builds are failing Key: ARROW-15323 URL: https://issues.apache.org/jira/browse/ARROW-15323 Project: Apache Arrow Issue Type: Bug Reporter: Joris Van den Bossche See eg - test-conda-python-3.7-spark-v3.1.2: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2022-01-13-0-github-test-conda-python-3.7-spark-v3.1.2 - test-conda-python-3.8-spark-v3.2.0: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2022-01-13-0-github-test-conda-python-3.8-spark-v3.2.0 - test-conda-python-3.9-spark-master: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2022-01-13-0-github-test-conda-python-3.9-spark-master The error message: {code} Error: Failed to execute goal pl.project13.maven:git-commit-id-plugin:2.2.2:revision (for-jars) on project arrow-java-root: Could not complete Mojo execution... Unable to find commits until some tag: Walk failure. Missing commit 2ec4e999bfa1e54ea6933cb3857ea5edb4235919 -> [Help 1] Error: Error: To see the full stack trace of the errors, re-run Maven with the -e switch. Error: Re-run Maven using the -X switch to enable full debug logging. Error: Error: For more information about the errors and possible solutions, please read the following articles: Error: [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException {code} cc [~bryanc] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15321) [Dev][Archery] numpydoc validation doesn't check all class methods
Joris Van den Bossche created ARROW-15321: - Summary: [Dev][Archery] numpydoc validation doesn't check all class methods Key: ARROW-15321 URL: https://issues.apache.org/jira/browse/ARROW-15321 Project: Apache Arrow Issue Type: Bug Components: Developer Tools Reporter: Joris Van den Bossche >From discussion at >https://github.com/apache/arrow/pull/12076#discussion_r783810077 It seems that by default, it doesn't loop over all _methods_ of classes, but only module-level objects? For example, I notice that explicitly asking for {{pyarrow.Table.to_pandas}} catches some issues: {code} $ archery numpydoc pyarrow.Table.to_pandas --allow-rule PR10 INFO:archery:Running Python docstring linters PR10: Parameter "categories" requires a space before the colon separating the parameter name and type PR10: Parameter "use_threads" requires a space before the colon separating the parameter name and type {code} But with the default (check all of pyarrow) with {{archery numpydoc --allow-rule PR10}} it doesn't list those errors. cc [~kszucs] [~amol-] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15310) [C++][Python][Dataset] Detect (and warn?) when DirectoryPartitioning is parsing an actually hive-style file path?
Joris Van den Bossche created ARROW-15310: - Summary: [C++][Python][Dataset] Detect (and warn?) when DirectoryPartitioning is parsing an actually hive-style file path? Key: ARROW-15310 URL: https://issues.apache.org/jira/browse/ARROW-15310 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Joris Van den Bossche When you have a hive-style partitioned dataset, with our current {{dataset(..)}} API, it's relatively easy to mess up the inferred partitioning and get confusing results. For example, if you specify the partitioning field names with {{partitioning=[...]}} (which is not needed for hive style since those are inferred), we actually assume you want directory partitioning. This DirectoryPartitioning will then parse the hive-style file paths and take the full "key=value" as the data values for the field. And then, doing a filter can result in a confusing empty result (because "value" doesn't match "key=value"). I am wondering if we can't relatively cheaply detect this case, and eg give an informative warning about this to the user. Basically what happens is this: {code:python} >>> part = ds.DirectoryPartitioning(pa.schema([("part", "string")])) >>> part.parse("part=a") {code} If the parsed value is a string that contains a "=" (and in this case also contains the field name), that is I think a clear sign that (in the large majority of cases) the user is doing something wrong. I am not fully sure where and at what stage the check could be done though. Doing it for every path in the dataset might be too costly. Illustrative code example: {code:python} import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds import pathlib ## constructing a small dataset with 1 hive-style partitioning level basedir = pathlib.Path(".") / "dataset_wrong_partitioning" basedir.mkdir(exist_ok=True) (basedir / "part=a").mkdir(exist_ok=True) (basedir / "part=b").mkdir(exist_ok=True) table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]}) pq.write_table(table1, basedir / "part=a" / "data.parquet") table2 = pa.table({'a': [4, 5, 6], 'b': [1, 2, 3]}) pq.write_table(table2, basedir / "part=b" / "data.parquet") {code} Reading as is (not specifying a partitioning, so default to no partitioning) will at least give an error about a missing field: {code: python} >>> dataset = ds.dataset(basedir) >>> dataset.to_table(filter=ds.field("part") == "a") ... ArrowInvalid: No match for FieldRef.Name(part) in a: int64 {code} But specifying the partitioning field name (which currently gets (silently) interpreted as directory partitioning) gives a confusing empty result: {code:python} >>> dataset = ds.dataset(basedir, partitioning=["part"]) >>> dataset.to_table(filter=ds.field("part") == "a") pyarrow.Table a: int64 b: int64 part: string a: [] b: [] part: [] {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15307) [C++][Dataset] Provide more context in error message if cast fails during scanning
Joris Van den Bossche created ARROW-15307: - Summary: [C++][Dataset] Provide more context in error message if cast fails during scanning Key: ARROW-15307 URL: https://issues.apache.org/jira/browse/ARROW-15307 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche If you have a partitioned dataset, and in one of the files there is a column with a mismatching type and that cannot be safely casted to the dataset schema's type for that column, you get (as expected) get an error about this cast. Small illustrative example code: {code:python} import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds import pathlib ## constructing a small dataset with two files basedir = pathlib.Path(".") / "dataset_test_mismatched_schema" basedir.mkdir(exist_ok=True) table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]}) pq.write_table(table1, basedir / "data1.parquet") table2 = pa.table({'a': [1.5, 2.0, 3.0], 'b': [1, 2, 3]}) pq.write_table(table2, basedir / "data2.parquet") ## reading the dataset dataset = ds.dataset(basedir) # by default infer dataset schema from first file dataset.schema # actually reading gives expected error dataset.to_table() {code} gives {code:python} >>> dataset.schema a: int64 b: int64 >>> dataset.to_table() --- ArrowInvalid Traceback (most recent call last) in 22 dataset.schema 23 # actually reading gives expected error ---> 24 dataset.to_table() ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table() ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Float value 1.5 was truncated converting to int64 ../src/arrow/compute/kernels/scalar_cast_numeric.cc:177 CheckFloatToIntTruncation(batch[0], *out) ../src/arrow/compute/exec.cc:700 kernel_->exec(kernel_ctx_, batch, ) ../src/arrow/compute/exec.cc:641 ExecuteBatch(batch, listener) ../src/arrow/compute/function.cc:248 executor->Execute(implicitly_cast_args, ) ../src/arrow/compute/exec/expression.cc:444 compute::Cast(column, field->type(), compute::CastOptions::Safe()) ../src/arrow/dataset/scanner.cc:816 compute::MakeExecBatch(*scan_options->dataset_schema, partial.record_batch.value) {code} So the actual error message (without the extra C++ context) is only *"ArrowInvalid: Float value 1.5 was truncated converting to int64"*. So this error message only says something about the two types and the first value that cannot be cast, but if you have a large dataset with many fragments and/or many columns, it can be hard to know 1) for which column this is failing and 2) for which fragment it is failing. So it would be nice to add some extra context to the error message. The cast itself of course doesn't know it, but I suppose when doing the cast in the scanner code, there at least we know eg the physical schema and dataset schema, so we could append or prepend the error message with something like "Casting from schema1 to schema2 failed with ...". cc [~alenkaf] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15137) [Dev] Update archery crossbow latest-prefix to work with nightly dates
Joris Van den Bossche created ARROW-15137: - Summary: [Dev] Update archery crossbow latest-prefix to work with nightly dates Key: ARROW-15137 URL: https://issues.apache.org/jira/browse/ARROW-15137 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15131) [Python] Coerce value_set argument to array in "is_in" kernel
Joris Van den Bossche created ARROW-15131: - Summary: [Python] Coerce value_set argument to array in "is_in" kernel Key: ARROW-15131 URL: https://issues.apache.org/jira/browse/ARROW-15131 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Small example I ran into: {code:python} >>> arr = pa.array(['a', 'b', 'c', 'd']) >>> pc.is_in(arr, ['a', 'c']) ... TypeError: "['a', 'c']" is not a valid value set {code} That's not a super friendly error message (it was not directly clear what is not "valid" about this). Passing {{pa.array(['a', 'c']) explicitly works, but I expected that the kernel would try this automatically (as we also convert the first array argument to an array). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15117) [Docs] Splitting the sphinx-based Arrow docs into separate sphinx projects
Joris Van den Bossche created ARROW-15117: - Summary: [Docs] Splitting the sphinx-based Arrow docs into separate sphinx projects Key: ARROW-15117 URL: https://issues.apache.org/jira/browse/ARROW-15117 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: Joris Van den Bossche Fix For: 7.0.0 See the mailing list (https://mail-archives.apache.org/mod_mbox/arrow-dev/202112.mbox/%3CCALQtMBbiasQtXYc46kpw-TyQ-TQSPjNQ5%2BkoREuKvJ3hJSdWjw%40mail.gmail.com%3E) and this google doc (https://docs.google.com/document/d/1AXDNwU5CSnZ1cSeUISwy_xgvTzoYWeuqWApC8UEv97Q/edit?usp=sharing) for more context. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15077) [Python] Move Expression class from _dataset to _compute cython module
Joris Van den Bossche created ARROW-15077: - Summary: [Python] Move Expression class from _dataset to _compute cython module Key: ARROW-15077 URL: https://issues.apache.org/jira/browse/ARROW-15077 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 7.0.0 Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche To follow the move in the C++ code base, and to make it easier to implement ARROW-12060 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15043) [Python][Docs] Update type conversion table for pandas <-> arrow
Joris Van den Bossche created ARROW-15043: - Summary: [Python][Docs] Update type conversion table for pandas <-> arrow Key: ARROW-15043 URL: https://issues.apache.org/jira/browse/ARROW-15043 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 7.0.0 >From the mailing list: the table at >https://arrow.apache.org/docs/python/pandas.html#pandas-arrow-conversion is >not fully up to date. For example, it doesn't include {{datetime.time}} >conversion to {{time64}} type. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15042) [Python] Consolidate shared methods of RecordBatch and Table
Joris Van den Bossche created ARROW-15042: - Summary: [Python] Consolidate shared methods of RecordBatch and Table Key: ARROW-15042 URL: https://issues.apache.org/jira/browse/ARROW-15042 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche RecordBatch and Table have a bunch of similar methods that don't directly interact with the C++ pointer, and thus that could be shared in a common base class. In addition, we also have some methods on Table that would also be useful for RecordBatch (eg {{cast}}, {{group_by}}, {{drop}}, {{select}}, {{sort_by}}, {{rename_columns}}), which could also be shared with a common mixin. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14990) [CI] Nightly integration for dask is failing because of missing pandas dependency
Joris Van den Bossche created ARROW-14990: - Summary: [CI] Nightly integration for dask is failing because of missing pandas dependency Key: ARROW-14990 URL: https://issues.apache.org/jira/browse/ARROW-14990 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Joris Van den Bossche See https://github.com/apache/arrow/pull/11816#discussion_r762961951 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14967) [CI][Python] Ability to include pip packages in the conda environments
Joris Van den Bossche created ARROW-14967: - Summary: [CI][Python] Ability to include pip packages in the conda environments Key: ARROW-14967 URL: https://issues.apache.org/jira/browse/ARROW-14967 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Python Reporter: Joris Van den Bossche For creating various conda environments, we currently have files like {{conda_env_cpp.txt}}, {{conda_env_sphinx.txt}}, {{conda_env_python.txt}}, etc Those can then be combined to create a specific conda environment with the subset of features you want, eg from the python docs: {code} conda create -y -n pyarrow-dev -c conda-forge \ --file arrow/ci/conda_env_unix.txt \ --file arrow/ci/conda_env_cpp.txt \ --file arrow/ci/conda_env_python.txt \ --file arrow/ci/conda_env_gandiva.txt \ compilers \ python=3.9 \ pandas {code} or installed as additional packages into an existing one (eg {{conda install --file arrow/ci/conda_env_python.txt}} in conda-python.dockerfile). One disadvantage of this approach is that you cannot (as far as I am aware) not list pip packages in those .txt files. You can do that with environment.yml files, but those then don't really compose together as we do with the txt files, I think. cc [~kszucs] -- This message was sent by Atlassian Jira (v8.20.1#820001)