[jira] [Created] (ARROW-12827) [C++] [Dataset] Review error pass-through in the datasets API

2021-05-18 Thread Weston Pace (Jira)
Weston Pace created ARROW-12827:
---

 Summary: [C++] [Dataset] Review error pass-through in the datasets 
API
 Key: ARROW-12827
 URL: https://issues.apache.org/jira/browse/ARROW-12827
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


There is at least one (and I think there are actually several) places in the 
datasets API where we are bubbling up errors without attaching the necessary 
context.  For example, in the discussion here 
[https://github.com/apache/arrow/pull/10326#pullrequestreview-662095548] a call 
to "DatasetFactory::Create" (where the user incorrectly assigned a default file 
format of parquet) is returning "Parquet magic bytes not found in footer" 
instead of something like "Dataset creation failed. The fragment 
'/2019/July/myfile.csv' did not match the expected 'parquet'' format: Parquet 
magic bytes not found in footer"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12826) [R] [CI] Add caching to revdepchecks

2021-05-18 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-12826:
--

 Summary: [R] [CI] Add caching to revdepchecks
 Key: ARROW-12826
 URL: https://issues.apache.org/jira/browse/ARROW-12826
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Jonathan Keane


With ARROW-12569 we added a (manual) reverse dependency check job. This runs 
fine (if slow) for a one-off run. It should be possible to cache between runs. 
There are a few issues with this currently:

* {revdepcheck} does not (yet) [support only running new 
runs|https://github.com/r-lib/revdepcheck/issues/94]
* The cache doesn't cache some of the longest running tasks (installing the 
reverse dependencies)
* If we cache the revdeps directory, we will need to re-add packages that 
should be re-checked.

We should investigate contributing to revdepcheck to resolve the run-only-new 
and possibly also add features for cacheing the installations (and only change 
when the crancache is invalidated / finds a new package?) 
https://github.com/HenrikBengtsson/revdepcheck.extras might also be helpful

For posterity, the following is ~what we would need to add to 
dev/tasks/r/github.linux.revdepcheck.yml
```
  - name: Cache crancache and revdeps directory
uses: actions/cache@v2
with:
  key: {{ "r-revdep-cache-${{ some-way-to-get-arrow-version }}" }}
  path: |
arrow/r/revdep
arrow/.crancache
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12825) [Python] PyArrow doesn't compile on upcoming Cython version

2021-05-18 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-12825:
-

 Summary: [Python] PyArrow doesn't compile on upcoming Cython 
version
 Key: ARROW-12825
 URL: https://issues.apache.org/jira/browse/ARROW-12825
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 4.0.0
Reporter: Alessandro Molina
 Fix For: 5.0.0


Trying to build PyArrow with the current master checkout of Cython results in 
some compile errors on {{for}} loops.

{code}
Error compiling Cython file:

...
def column_types(self):
"""
Explicitly map column names to column types.
"""
d = {frombytes(item.first): pyarrow_wrap_data_type(item.second)
 for item in self.options.column_types}
^


pyarrow/_csv.pyx:491:25: Cannot assign type 
'pair[string,shared_ptr[CDataType]]' to 'shared_ptr[CDataType]'
{code}

It seems that Cython is going to be less permissive about autodetecting type of 
iterated items, this can probably be fixed by explicitly declaring the types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12824) [R][CI] Upgrade builds for R 4.1 release

2021-05-18 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-12824:
---

 Summary: [R][CI] Upgrade builds for R 4.1 release
 Key: ARROW-12824
 URL: https://issues.apache.org/jira/browse/ARROW-12824
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Continuous Integration, R
Reporter: Neal Richardson
 Fix For: 5.0.0


Also add 3.6 to the test-r-versions matrix, and possibly move the Rtools35 
build to crossbow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12823) [Parquet][Python] Read and write file/column metadata using pandas attrs

2021-05-18 Thread Alan Snow (Jira)
Alan Snow created ARROW-12823:
-

 Summary: [Parquet][Python] Read and write file/column metadata 
using pandas attrs
 Key: ARROW-12823
 URL: https://issues.apache.org/jira/browse/ARROW-12823
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Parquet, Python
Reporter: Alan Snow


Related: https://github.com/pandas-dev/pandas/issues/20521

What the general thoughts are to use 
[DataFrame.attrs|https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.attrs.html#pandas-dataframe-attrs]
 and 
[Series.attrs|https://pandas.pydata.org/pandas-docs/stable//reference/api/pandas.Series.attrs.html#pandas-series-attrs]
 for reading and writing metadata to/from parquet?

For example, here is how the metadata would be written:
{code:python}
pdf = pandas.DataFrame({"a": [1]})
pdf.attrs = {"name": "my custom dataset"}
pdf.a.attrs = {"long_name": "Description about data", "nodata": -1, "units": 
"metre"}
pdf.to_parquet("file.parquet"){code}

Then, when loading in the data:
{code:python}
pdf = pandas.read_parquet("file.parquet")
pdf.attrs{code}
{"name": "my custom dataset"}
{code:java}
pdf.a.attrs{code}
{"long_name": "Description about data", "nodata": -1, "units": "metre"}



 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12822) [CI] Consider sending the nightly build report in HTML format

2021-05-18 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-12822:
---

 Summary: [CI] Consider sending the nightly build report in HTML 
format
 Key: ARROW-12822
 URL: https://issues.apache.org/jira/browse/ARROW-12822
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Krisztian Szucs


We're having an increasing number of nightly builds which makes the nightly 
report harder to read. Not mentioning the long URLs we have which is hard to 
express in a plaintext email.

The Apache mailing lists prefer plaintext format [1], though we could make an 
exception for the nightly build report (assuming it wouldn't bounce).

[1]: https://infra.apache.org/contrib-email-tips#nohtml



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12821) [CI] Include the first occurrence of a task failure in the nightly report

2021-05-18 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-12821:
---

 Summary: [CI] Include the first occurrence of a task failure in 
the nightly report
 Key: ARROW-12821
 URL: https://issues.apache.org/jira/browse/ARROW-12821
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Krisztian Szucs
 Fix For: 5.0.0


It would be useful if the nightly report would help to identify the time of the 
regression when a nightly build has started to fail. 
We can automatize this during the report generation by filtering the crossbow 
branches for the task's name, like on the github UI: 
https://github.com/ursacomputing/crossbow/branches/all?query=test-conda-python-3.7-turbodbc-latest

In order to avoid reaching the github rate limit, we could limit this search 
for the last one or two weeks and mark the older issues as persistent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12820) [C++] Strptime ignores timezone information

2021-05-18 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-12820:
--

 Summary: [C++] Strptime ignores timezone information
 Key: ARROW-12820
 URL: https://issues.apache.org/jira/browse/ARROW-12820
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Rok Mihevc


ParseTimestampStrptime currently ignores the timezone information. So 
timestamps are read as if they were all in UTC. This can be unexpected. See 
[discussion|https://github.com/apache/arrow/pull/10334#discussion_r634269138] 
for details.
It would be useful to either capture timezone information or convert timestamp 
to UTC when parsing it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12819) [CI] Include build log's url in the nightly crossbow report

2021-05-18 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-12819:
---

 Summary: [CI] Include build log's url in the nightly crossbow 
report
 Key: ARROW-12819
 URL: https://issues.apache.org/jira/browse/ARROW-12819
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Krisztian Szucs
 Fix For: 5.0.0


The github statuses/checks API contains additional context about the builds, 
including the build log's URL (though this may depend on the actual CI service).

We should extend the nightly report to include the build log's URL only for the 
failing tasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12818) Int64Array can not be casted to DoubleArray?

2021-05-18 Thread lf-shaw (Jira)
lf-shaw created ARROW-12818:
---

 Summary: Int64Array can not be casted to DoubleArray?
 Key: ARROW-12818
 URL: https://issues.apache.org/jira/browse/ARROW-12818
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 4.0.0
Reporter: lf-shaw


In numpy, we can cast int64 to float64. But in pyarrow, we can't.

```python
import numpy as np
import pandas as pd
import pyarrow as pa

# timestamp
dt = pd.date_range('2021-01-01', periods=10)

# int64
arr = dt.asi8

# cast to float64
arr_double = arr.astype(np.float64)

# to arrow array
ts = pa.array(dt.asi8, type=pa.timestamp('ns'))

# to int64 array
ts_int64 = ts.cast(pa.int64())

# cast to float64
ts_double = ts_int64.cast(pa.float64())
```

the last line raise an exception

```python
---
ArrowInvalid Traceback (most recent call last)
 in 
> 1 pa.array(dt.asi8, 
type=pa.timestamp('ns')).cast(pa.int64()).cast(pa.float64())

/opt/anaconda3/lib/python3.8/site-packages/pyarrow/array.pxi in 
pyarrow.lib.Array.cast()

/opt/anaconda3/lib/python3.8/site-packages/pyarrow/compute.py in cast(arr, 
target_type, safe)
 287 else:
 288 options = CastOptions.unsafe(target_type)
--> 289 return call_function("cast", [arr], options)
 290 
 291

/opt/anaconda3/lib/python3.8/site-packages/pyarrow/_compute.pyx in 
pyarrow._compute.call_function()

/opt/anaconda3/lib/python3.8/site-packages/pyarrow/_compute.pyx in 
pyarrow._compute.Function.call()

/opt/anaconda3/lib/python3.8/site-packages/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

/opt/anaconda3/lib/python3.8/site-packages/pyarrow/error.pxi in 
pyarrow.lib.check_status()

ArrowInvalid: Integer value 16094592000 not in range: -9007199254740992 
to 9007199254740992
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)