[jira] [Created] (ARROW-12827) [C++] [Dataset] Review error pass-through in the datasets API
Weston Pace created ARROW-12827: --- Summary: [C++] [Dataset] Review error pass-through in the datasets API Key: ARROW-12827 URL: https://issues.apache.org/jira/browse/ARROW-12827 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace There is at least one (and I think there are actually several) places in the datasets API where we are bubbling up errors without attaching the necessary context. For example, in the discussion here [https://github.com/apache/arrow/pull/10326#pullrequestreview-662095548] a call to "DatasetFactory::Create" (where the user incorrectly assigned a default file format of parquet) is returning "Parquet magic bytes not found in footer" instead of something like "Dataset creation failed. The fragment '/2019/July/myfile.csv' did not match the expected 'parquet'' format: Parquet magic bytes not found in footer" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12826) [R] [CI] Add caching to revdepchecks
Jonathan Keane created ARROW-12826: -- Summary: [R] [CI] Add caching to revdepchecks Key: ARROW-12826 URL: https://issues.apache.org/jira/browse/ARROW-12826 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Reporter: Jonathan Keane With ARROW-12569 we added a (manual) reverse dependency check job. This runs fine (if slow) for a one-off run. It should be possible to cache between runs. There are a few issues with this currently: * {revdepcheck} does not (yet) [support only running new runs|https://github.com/r-lib/revdepcheck/issues/94] * The cache doesn't cache some of the longest running tasks (installing the reverse dependencies) * If we cache the revdeps directory, we will need to re-add packages that should be re-checked. We should investigate contributing to revdepcheck to resolve the run-only-new and possibly also add features for cacheing the installations (and only change when the crancache is invalidated / finds a new package?) https://github.com/HenrikBengtsson/revdepcheck.extras might also be helpful For posterity, the following is ~what we would need to add to dev/tasks/r/github.linux.revdepcheck.yml ``` - name: Cache crancache and revdeps directory uses: actions/cache@v2 with: key: {{ "r-revdep-cache-${{ some-way-to-get-arrow-version }}" }} path: | arrow/r/revdep arrow/.crancache ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12825) [Python] PyArrow doesn't compile on upcoming Cython version
Alessandro Molina created ARROW-12825: - Summary: [Python] PyArrow doesn't compile on upcoming Cython version Key: ARROW-12825 URL: https://issues.apache.org/jira/browse/ARROW-12825 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 4.0.0 Reporter: Alessandro Molina Fix For: 5.0.0 Trying to build PyArrow with the current master checkout of Cython results in some compile errors on {{for}} loops. {code} Error compiling Cython file: ... def column_types(self): """ Explicitly map column names to column types. """ d = {frombytes(item.first): pyarrow_wrap_data_type(item.second) for item in self.options.column_types} ^ pyarrow/_csv.pyx:491:25: Cannot assign type 'pair[string,shared_ptr[CDataType]]' to 'shared_ptr[CDataType]' {code} It seems that Cython is going to be less permissive about autodetecting type of iterated items, this can probably be fixed by explicitly declaring the types -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12824) [R][CI] Upgrade builds for R 4.1 release
Neal Richardson created ARROW-12824: --- Summary: [R][CI] Upgrade builds for R 4.1 release Key: ARROW-12824 URL: https://issues.apache.org/jira/browse/ARROW-12824 Project: Apache Arrow Issue Type: New Feature Components: Continuous Integration, R Reporter: Neal Richardson Fix For: 5.0.0 Also add 3.6 to the test-r-versions matrix, and possibly move the Rtools35 build to crossbow. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12823) [Parquet][Python] Read and write file/column metadata using pandas attrs
Alan Snow created ARROW-12823: - Summary: [Parquet][Python] Read and write file/column metadata using pandas attrs Key: ARROW-12823 URL: https://issues.apache.org/jira/browse/ARROW-12823 Project: Apache Arrow Issue Type: Improvement Components: Parquet, Python Reporter: Alan Snow Related: https://github.com/pandas-dev/pandas/issues/20521 What the general thoughts are to use [DataFrame.attrs|https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.attrs.html#pandas-dataframe-attrs] and [Series.attrs|https://pandas.pydata.org/pandas-docs/stable//reference/api/pandas.Series.attrs.html#pandas-series-attrs] for reading and writing metadata to/from parquet? For example, here is how the metadata would be written: {code:python} pdf = pandas.DataFrame({"a": [1]}) pdf.attrs = {"name": "my custom dataset"} pdf.a.attrs = {"long_name": "Description about data", "nodata": -1, "units": "metre"} pdf.to_parquet("file.parquet"){code} Then, when loading in the data: {code:python} pdf = pandas.read_parquet("file.parquet") pdf.attrs{code} {"name": "my custom dataset"} {code:java} pdf.a.attrs{code} {"long_name": "Description about data", "nodata": -1, "units": "metre"} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12822) [CI] Consider sending the nightly build report in HTML format
Krisztian Szucs created ARROW-12822: --- Summary: [CI] Consider sending the nightly build report in HTML format Key: ARROW-12822 URL: https://issues.apache.org/jira/browse/ARROW-12822 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Krisztian Szucs We're having an increasing number of nightly builds which makes the nightly report harder to read. Not mentioning the long URLs we have which is hard to express in a plaintext email. The Apache mailing lists prefer plaintext format [1], though we could make an exception for the nightly build report (assuming it wouldn't bounce). [1]: https://infra.apache.org/contrib-email-tips#nohtml -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12821) [CI] Include the first occurrence of a task failure in the nightly report
Krisztian Szucs created ARROW-12821: --- Summary: [CI] Include the first occurrence of a task failure in the nightly report Key: ARROW-12821 URL: https://issues.apache.org/jira/browse/ARROW-12821 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Krisztian Szucs Fix For: 5.0.0 It would be useful if the nightly report would help to identify the time of the regression when a nightly build has started to fail. We can automatize this during the report generation by filtering the crossbow branches for the task's name, like on the github UI: https://github.com/ursacomputing/crossbow/branches/all?query=test-conda-python-3.7-turbodbc-latest In order to avoid reaching the github rate limit, we could limit this search for the last one or two weeks and mark the older issues as persistent. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12820) [C++] Strptime ignores timezone information
Rok Mihevc created ARROW-12820: -- Summary: [C++] Strptime ignores timezone information Key: ARROW-12820 URL: https://issues.apache.org/jira/browse/ARROW-12820 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Rok Mihevc ParseTimestampStrptime currently ignores the timezone information. So timestamps are read as if they were all in UTC. This can be unexpected. See [discussion|https://github.com/apache/arrow/pull/10334#discussion_r634269138] for details. It would be useful to either capture timezone information or convert timestamp to UTC when parsing it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12819) [CI] Include build log's url in the nightly crossbow report
Krisztian Szucs created ARROW-12819: --- Summary: [CI] Include build log's url in the nightly crossbow report Key: ARROW-12819 URL: https://issues.apache.org/jira/browse/ARROW-12819 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Krisztian Szucs Fix For: 5.0.0 The github statuses/checks API contains additional context about the builds, including the build log's URL (though this may depend on the actual CI service). We should extend the nightly report to include the build log's URL only for the failing tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12818) Int64Array can not be casted to DoubleArray?
lf-shaw created ARROW-12818: --- Summary: Int64Array can not be casted to DoubleArray? Key: ARROW-12818 URL: https://issues.apache.org/jira/browse/ARROW-12818 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 4.0.0 Reporter: lf-shaw In numpy, we can cast int64 to float64. But in pyarrow, we can't. ```python import numpy as np import pandas as pd import pyarrow as pa # timestamp dt = pd.date_range('2021-01-01', periods=10) # int64 arr = dt.asi8 # cast to float64 arr_double = arr.astype(np.float64) # to arrow array ts = pa.array(dt.asi8, type=pa.timestamp('ns')) # to int64 array ts_int64 = ts.cast(pa.int64()) # cast to float64 ts_double = ts_int64.cast(pa.float64()) ``` the last line raise an exception ```python --- ArrowInvalid Traceback (most recent call last) in > 1 pa.array(dt.asi8, type=pa.timestamp('ns')).cast(pa.int64()).cast(pa.float64()) /opt/anaconda3/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.Array.cast() /opt/anaconda3/lib/python3.8/site-packages/pyarrow/compute.py in cast(arr, target_type, safe) 287 else: 288 options = CastOptions.unsafe(target_type) --> 289 return call_function("cast", [arr], options) 290 291 /opt/anaconda3/lib/python3.8/site-packages/pyarrow/_compute.pyx in pyarrow._compute.call_function() /opt/anaconda3/lib/python3.8/site-packages/pyarrow/_compute.pyx in pyarrow._compute.Function.call() /opt/anaconda3/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() /opt/anaconda3/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Integer value 16094592000 not in range: -9007199254740992 to 9007199254740992 ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)