[jira] [Created] (ARROW-14600) [Docs]Fix broken link in Python Development page

2021-11-04 Thread Daijiro Fukuda (Jira)
Daijiro Fukuda created ARROW-14600:
--

 Summary: [Docs]Fix broken link in Python Development page
 Key: ARROW-14600
 URL: https://issues.apache.org/jira/browse/ARROW-14600
 Project: Apache Arrow
  Issue Type: Bug
  Components: Documentation
Reporter: Daijiro Fukuda


The link in the third note is broken here:
https://arrow.apache.org/docs/developers/python.html#build-and-test

It's simply a matter of incorrect rst.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14599) [Release][Java] Upload .jar to Artifacts

2021-11-04 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-14599:


 Summary: [Release][Java] Upload .jar to Artifacts
 Key: ARROW-14599
 URL: https://issues.apache.org/jira/browse/ARROW-14599
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


This doesn't upload to any Maven repository. It's out-of-scope of this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14598) [C++][Flight] protoc generation for example doesn't have correct dependencies

2021-11-04 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-14598:


 Summary: [C++][Flight] protoc generation for example doesn't have 
correct dependencies
 Key: ARROW-14598
 URL: https://issues.apache.org/jira/browse/ARROW-14598
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, FlightRPC
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14597) Github actions install r-arrow with snappy compression

2021-11-04 Thread Dyfan Jones (Jira)
Dyfan Jones created ARROW-14597:
---

 Summary: Github actions install r-arrow with snappy compression
 Key: ARROW-14597
 URL: https://issues.apache.org/jira/browse/ARROW-14597
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Dyfan Jones


Hi All,

I am having difficutly installing r-arrow with snappy compression on github 
action. I have set environment variable `ARROW_WITH_SNAPPY: ON` 
([https://github.com/DyfanJones/noctua/blob/0079bf997737516fd3e1b61dbde7510044f79a2f/.github/workflows/R-CMD-check.yaml]
 ). However I get the following error in my unit tests:


{code:java}
Error: Error: NotImplemented: Support for codec 'snappy' not built
In order to read this file, you will need to reinstall arrow with 
additional features enabled.
Set one of these environment variables before installing:
 * LIBARROW_MINIMAL=false (for all optional features, including 'snappy')  
 * ARROW_WITH_SNAPPY=ON (for just 'snappy')

See https://arrow.apache.org/docs/r/articles/install.html for detail{code}

arrow version: 6.0.0.2

My PR [https://github.com/DyfanJones/noctua/pull/169] with the github actions 
issue.

Any advice is much appericated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14596) [Python] parquet.read_table nested fields in columns does not work for use_legacy_dataset=False

2021-11-04 Thread Tom Scheffers (Jira)
Tom Scheffers created ARROW-14596:
-

 Summary: [Python] parquet.read_table nested fields in columns does 
not work for use_legacy_dataset=False
 Key: ARROW-14596
 URL: https://issues.apache.org/jira/browse/ARROW-14596
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Tom Scheffers
 Fix For: 5.0.0


Reading nested field does not work with use_legacy_dataset=False.

This works:

 
{code:java}
import pyarrow.parquet as pq
t = pq.read_table(
 source=*filename*,
 columns=['store_key', 'properties.country'], 
 use_legacy_dataset=True,
).to_pandas()
{code}
This does not work (for the same parquet file):

 
{code:java}
import pyarrow.parquet as pq

t = pq.read_table(
 source=*filename*,
 columns=['store_key', 'properties.country'], 
 use_legacy_dataset=False,
).to_pandas(){code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14595) [R] Set system defaults to auto

2021-11-04 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-14595:
--

 Summary: [R] Set system defaults to auto 
 Key: ARROW-14595
 URL: https://issues.apache.org/jira/browse/ARROW-14595
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Jonathan Keane
Assignee: Jonathan Keane
 Fix For: 7.0.0


We should not include this in 6.0.1, but enable it shortly after so we have a 
long time on CI to see if something messes up with it. We also should ensure we 
have all the CI we want with it (e.g. do we want this + an offline build? We 
should ensure that [the 
system-only|https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml#L1063-L1071]
 dependency job tests all of our features)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14594) [R] Enable snappy by default

2021-11-04 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-14594:
--

 Summary: [R] Enable snappy by default
 Key: ARROW-14594
 URL: https://issues.apache.org/jira/browse/ARROW-14594
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Jonathan Keane
Assignee: Jonathan Keane






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [arrow-testing] pitrou merged pull request #66: ARROW-14593: Add fuzz regression files

2021-11-04 Thread GitBox


pitrou merged pull request #66:
URL: https://github.com/apache/arrow-testing/pull/66


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (ARROW-14593) [C++] Fix crashes on invalid IPC file (OSS-Fuzz)

2021-11-04 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-14593:
--

 Summary: [C++] Fix crashes on invalid IPC file (OSS-Fuzz)
 Key: ARROW-14593
 URL: https://issues.apache.org/jira/browse/ARROW-14593
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


Fix the following issues:
- https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=39978
- https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=40653




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14592) [C++] list_parent_indices output type should not depend on input type

2021-11-04 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-14592:
--

 Summary: [C++] list_parent_indices output type should not depend 
on input type
 Key: ARROW-14592
 URL: https://issues.apache.org/jira/browse/ARROW-14592
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


Currently, the {{list_parent_indices}} output type is the input list offset 
type. But that doesn't really make sense: the parent indices are simply indices 
in the list array, they are not bounded by the offset size (even a list array 
with 32-bit offsets can have a length larger than 2**32, for example if it's 
mostly nulls).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14590) [R] Implement lubridate::week

2021-11-04 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14590:


 Summary: [R] Implement lubridate::week
 Key: ARROW-14590
 URL: https://issues.apache.org/jira/browse/ARROW-14590
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14591) [R] Implement lubridate duration types

2021-11-04 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14591:


 Summary: [R] Implement lubridate duration types
 Key: ARROW-14591
 URL: https://issues.apache.org/jira/browse/ARROW-14591
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14589) [CI][Go] CGo tests crash on Windows

2021-11-04 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-14589:
--

 Summary: [CI][Go] CGo tests crash on Windows
 Key: ARROW-14589
 URL: https://issues.apache.org/jira/browse/ARROW-14589
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, Go
Reporter: Antoine Pitrou


They seem to crash quite consistently.

See e.g. https://github.com/apache/arrow/runs/4094939443?check_suite_focus=true

{code}
+ for d in $(go list ./... | grep -v vendor)
+ go test -tags assert,test,ccalloc github.com/apache/arrow/go/arrow
exit status 3221225785
FAILgithub.com/apache/arrow/go/arrow0.119s
FAIL
Error: Process completed with exit code 1.
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14588) Create an arrow-specific checklist for a CRAN release

2021-11-04 Thread Jira
Dragoș Moldovan-Grünfeld created ARROW-14588:


 Summary: Create an arrow-specific checklist for a CRAN release  
 Key: ARROW-14588
 URL: https://issues.apache.org/jira/browse/ARROW-14588
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Dragoș Moldovan-Grünfeld
Assignee: Dragoș Moldovan-Grünfeld


This would adapt and implement the functionality of 
{{usethis::use_release_issue()}} for {{arrow}}'s specific context.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14587) [CI][Crossbow] Fetch a single crossbow branch instead of the full repo on Azure

2021-11-04 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-14587:
---

 Summary: [CI][Crossbow] Fetch a single crossbow branch instead of 
the full repo on Azure
 Key: ARROW-14587
 URL: https://issues.apache.org/jira/browse/ARROW-14587
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Krisztian Szucs
 Fix For: 7.0.0


Since crossbow has a lot of references the checkout step can take a long time, 
see build 
https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=14952=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=5bbb8710-d4c1-5a8b-fc80-a388730cf6ac

We should alter the azure crossbow template to explicitly check out the task's 
branch using 
{{ {{ task.branch }} }} jinja variable.

See azure documentation: 
https://docs.microsoft.com/en-us/azure/devops/pipelines/repos/multi-repo-checkout?view=azure-devops#checking-out-a-specific-ref



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14586) [R] summarise() with nested aggregate expressions has a confusing error

2021-11-04 Thread Dewey Dunnington (Jira)
Dewey Dunnington created ARROW-14586:


 Summary: [R] summarise() with nested aggregate expressions has a 
confusing error
 Key: ARROW-14586
 URL: https://issues.apache.org/jira/browse/ARROW-14586
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Dewey Dunnington
Assignee: Dewey Dunnington


This affects code along the lines of {{summarise(mean(mean(var))}} where the 
inner expression is an aggregate function. This is probably not useful but the 
error it gives is not particularly helpful:

{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
RecordBatch$create(x = 4) %>% 
  summarise(y = mean(mean(x)))
#> Warning: Error in mean(..temp0) : object '..temp0' not found; pulling data 
into
#> R
#> # A tibble: 1 × 1
#>   y
#>   
#> 1 4
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14585) [C++] Detect gRPC::grpc++_reflection in FindgRPCAlt

2021-11-04 Thread David Li (Jira)
David Li created ARROW-14585:


 Summary: [C++] Detect gRPC::grpc++_reflection in FindgRPCAlt
 Key: ARROW-14585
 URL: https://issues.apache.org/jira/browse/ARROW-14585
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: David Li
Assignee: David Li


A follow-up to ARROW-14440.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14584) [Python][CI] Python sdist installation fails with latest setuptools 58.5

2021-11-04 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-14584:
-

 Summary: [Python][CI] Python sdist installation fails with latest 
setuptools 58.5
 Key: ARROW-14584
 URL: https://issues.apache.org/jira/browse/ARROW-14584
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


See the output of the nightly builds: 
https://github.com/ursacomputing/crossbow/runs/4101736097?check_suite_focus=true
 (it might also be numpy that fails installing, but since that's a build 
requirement this thus also fails installing pyarrow's sdist)

Upstream issue: https://github.com/pypa/setuptools/issues/2849 (we see the same 
error {{AttributeError: get_data_files_without_manifest}})



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14583) RStudio IDE crash

2021-11-04 Thread Zsolt Kegyes-Brassai (Jira)
Zsolt Kegyes-Brassai created ARROW-14583:


 Summary: RStudio IDE crash
 Key: ARROW-14583
 URL: https://issues.apache.org/jira/browse/ARROW-14583
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 6.0.0
 Environment: I am using a windows 10 machine, R 4.1.0, up to date R 
packages, and latest RStudio IDE.
Reporter: Zsolt Kegyes-Brassai


I was trying the new features introduced in latest {{arrow (6.0.2)}} package 
based on examples from the “New Directions for Apache Arrow” talk.

The RStudio IDE was crashing and the R session was aborted.

Looking closely I found that I downloaded only 2 years of data (2018 & 2019) 
and after the first filter ({{year == 2015}}) no data remains to be processed 
further.

After some debugging, by replacing the collect() function, it turns out that 
the {{summarize()}} is the one which function is causing the crash.

 
{code:java}
as_dataset <- open_dataset("c:/Rproj_learn/nyc-taxi/", 
partitioning = c("year", "month")) %>%
  filter(total_amount > 100 & year == 2015) %>%
  select(tip_amount, total_amount, passenger_count) %>%
  mutate(tip_pct = tip_amount / total_amount * 100) %>%
  group_by(passenger_count) %>%
  summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
  filter(n > 5000) %>%
  arrange(desc(avg_tip_pct)) %>%
  collect(){code}
 

I would expect to get an error message (without crashing the IDE), which can be 
handled in code.

Another alternative result would be an empty data.frame, like in case when the 
parquet file was read in as a data.frame. I simulated this situation by setting 
a high {{total_amount}} value when filtering. Note: when using an Arrow table 
an error message is generated.

 
{code:java}
 library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.1
#> Warning: package 'tidyr' was built under R version 4.1.1
#> Warning: package 'readr' was built under R version 4.1.1
library(arrow)
#> Warning: package 'arrow' was built under R version 4.1.1
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#> timestamp

read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", 
 as_data_frame = FALSE) %>%
  # filter(total_amount > 100) %>%
  filter(total_amount > 1e10) %>%
  select(tip_amount, total_amount, passenger_count) %>%
  mutate(tip_pct = tip_amount / total_amount * 100) %>%
  group_by(passenger_count) %>%
  summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
  filter(n > 500) %>%
  arrange(desc(avg_tip_pct)) %>%
  collect()

#> Error: Invalid: Must pass at least one array


read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", 
 as_data_frame = TRUE) %>%
  # filter(total_amount > 100) %>%
  filter(total_amount > 1e10) %>%
  select(tip_amount, total_amount, passenger_count) %>%
  mutate(tip_pct = tip_amount / total_amount * 100) %>%
  group_by(passenger_count) %>%
  summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
  filter(n > 500) %>%
  arrange(desc(avg_tip_pct)) %>%
  collect()

#> # A tibble: 0 x 3
#> # ... with 3 variables: passenger_count , avg_tip_pct , n 
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)