[jira] [Created] (ARROW-14600) [Docs]Fix broken link in Python Development page
Daijiro Fukuda created ARROW-14600: -- Summary: [Docs]Fix broken link in Python Development page Key: ARROW-14600 URL: https://issues.apache.org/jira/browse/ARROW-14600 Project: Apache Arrow Issue Type: Bug Components: Documentation Reporter: Daijiro Fukuda The link in the third note is broken here: https://arrow.apache.org/docs/developers/python.html#build-and-test It's simply a matter of incorrect rst. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14599) [Release][Java] Upload .jar to Artifacts
Kouhei Sutou created ARROW-14599: Summary: [Release][Java] Upload .jar to Artifacts Key: ARROW-14599 URL: https://issues.apache.org/jira/browse/ARROW-14599 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Kouhei Sutou Assignee: Kouhei Sutou This doesn't upload to any Maven repository. It's out-of-scope of this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14598) [C++][Flight] protoc generation for example doesn't have correct dependencies
Kouhei Sutou created ARROW-14598: Summary: [C++][Flight] protoc generation for example doesn't have correct dependencies Key: ARROW-14598 URL: https://issues.apache.org/jira/browse/ARROW-14598 Project: Apache Arrow Issue Type: Bug Components: C++, FlightRPC Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14597) Github actions install r-arrow with snappy compression
Dyfan Jones created ARROW-14597: --- Summary: Github actions install r-arrow with snappy compression Key: ARROW-14597 URL: https://issues.apache.org/jira/browse/ARROW-14597 Project: Apache Arrow Issue Type: New Feature Reporter: Dyfan Jones Hi All, I am having difficutly installing r-arrow with snappy compression on github action. I have set environment variable `ARROW_WITH_SNAPPY: ON` ([https://github.com/DyfanJones/noctua/blob/0079bf997737516fd3e1b61dbde7510044f79a2f/.github/workflows/R-CMD-check.yaml] ). However I get the following error in my unit tests: {code:java} Error: Error: NotImplemented: Support for codec 'snappy' not built In order to read this file, you will need to reinstall arrow with additional features enabled. Set one of these environment variables before installing: * LIBARROW_MINIMAL=false (for all optional features, including 'snappy') * ARROW_WITH_SNAPPY=ON (for just 'snappy') See https://arrow.apache.org/docs/r/articles/install.html for detail{code} arrow version: 6.0.0.2 My PR [https://github.com/DyfanJones/noctua/pull/169] with the github actions issue. Any advice is much appericated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14596) [Python] parquet.read_table nested fields in columns does not work for use_legacy_dataset=False
Tom Scheffers created ARROW-14596: - Summary: [Python] parquet.read_table nested fields in columns does not work for use_legacy_dataset=False Key: ARROW-14596 URL: https://issues.apache.org/jira/browse/ARROW-14596 Project: Apache Arrow Issue Type: Bug Reporter: Tom Scheffers Fix For: 5.0.0 Reading nested field does not work with use_legacy_dataset=False. This works: {code:java} import pyarrow.parquet as pq t = pq.read_table( source=*filename*, columns=['store_key', 'properties.country'], use_legacy_dataset=True, ).to_pandas() {code} This does not work (for the same parquet file): {code:java} import pyarrow.parquet as pq t = pq.read_table( source=*filename*, columns=['store_key', 'properties.country'], use_legacy_dataset=False, ).to_pandas(){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14595) [R] Set system defaults to auto
Jonathan Keane created ARROW-14595: -- Summary: [R] Set system defaults to auto Key: ARROW-14595 URL: https://issues.apache.org/jira/browse/ARROW-14595 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Jonathan Keane Assignee: Jonathan Keane Fix For: 7.0.0 We should not include this in 6.0.1, but enable it shortly after so we have a long time on CI to see if something messes up with it. We also should ensure we have all the CI we want with it (e.g. do we want this + an offline build? We should ensure that [the system-only|https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml#L1063-L1071] dependency job tests all of our features) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14594) [R] Enable snappy by default
Jonathan Keane created ARROW-14594: -- Summary: [R] Enable snappy by default Key: ARROW-14594 URL: https://issues.apache.org/jira/browse/ARROW-14594 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Jonathan Keane Assignee: Jonathan Keane -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [arrow-testing] pitrou merged pull request #66: ARROW-14593: Add fuzz regression files
pitrou merged pull request #66: URL: https://github.com/apache/arrow-testing/pull/66 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (ARROW-14593) [C++] Fix crashes on invalid IPC file (OSS-Fuzz)
Antoine Pitrou created ARROW-14593: -- Summary: [C++] Fix crashes on invalid IPC file (OSS-Fuzz) Key: ARROW-14593 URL: https://issues.apache.org/jira/browse/ARROW-14593 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix the following issues: - https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=39978 - https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=40653 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14592) [C++] list_parent_indices output type should not depend on input type
Antoine Pitrou created ARROW-14592: -- Summary: [C++] list_parent_indices output type should not depend on input type Key: ARROW-14592 URL: https://issues.apache.org/jira/browse/ARROW-14592 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou Currently, the {{list_parent_indices}} output type is the input list offset type. But that doesn't really make sense: the parent indices are simply indices in the list array, they are not bounded by the offset size (even a list array with 32-bit offsets can have a length larger than 2**32, for example if it's mostly nulls). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14590) [R] Implement lubridate::week
Nicola Crane created ARROW-14590: Summary: [R] Implement lubridate::week Key: ARROW-14590 URL: https://issues.apache.org/jira/browse/ARROW-14590 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14591) [R] Implement lubridate duration types
Nicola Crane created ARROW-14591: Summary: [R] Implement lubridate duration types Key: ARROW-14591 URL: https://issues.apache.org/jira/browse/ARROW-14591 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14589) [CI][Go] CGo tests crash on Windows
Antoine Pitrou created ARROW-14589: -- Summary: [CI][Go] CGo tests crash on Windows Key: ARROW-14589 URL: https://issues.apache.org/jira/browse/ARROW-14589 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, Go Reporter: Antoine Pitrou They seem to crash quite consistently. See e.g. https://github.com/apache/arrow/runs/4094939443?check_suite_focus=true {code} + for d in $(go list ./... | grep -v vendor) + go test -tags assert,test,ccalloc github.com/apache/arrow/go/arrow exit status 3221225785 FAILgithub.com/apache/arrow/go/arrow0.119s FAIL Error: Process completed with exit code 1. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14588) Create an arrow-specific checklist for a CRAN release
Dragoș Moldovan-Grünfeld created ARROW-14588: Summary: Create an arrow-specific checklist for a CRAN release Key: ARROW-14588 URL: https://issues.apache.org/jira/browse/ARROW-14588 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Dragoș Moldovan-Grünfeld Assignee: Dragoș Moldovan-Grünfeld This would adapt and implement the functionality of {{usethis::use_release_issue()}} for {{arrow}}'s specific context. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14587) [CI][Crossbow] Fetch a single crossbow branch instead of the full repo on Azure
Krisztian Szucs created ARROW-14587: --- Summary: [CI][Crossbow] Fetch a single crossbow branch instead of the full repo on Azure Key: ARROW-14587 URL: https://issues.apache.org/jira/browse/ARROW-14587 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Krisztian Szucs Fix For: 7.0.0 Since crossbow has a lot of references the checkout step can take a long time, see build https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=14952=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=5bbb8710-d4c1-5a8b-fc80-a388730cf6ac We should alter the azure crossbow template to explicitly check out the task's branch using {{ {{ task.branch }} }} jinja variable. See azure documentation: https://docs.microsoft.com/en-us/azure/devops/pipelines/repos/multi-repo-checkout?view=azure-devops#checking-out-a-specific-ref -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14586) [R] summarise() with nested aggregate expressions has a confusing error
Dewey Dunnington created ARROW-14586: Summary: [R] summarise() with nested aggregate expressions has a confusing error Key: ARROW-14586 URL: https://issues.apache.org/jira/browse/ARROW-14586 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Dewey Dunnington Assignee: Dewey Dunnington This affects code along the lines of {{summarise(mean(mean(var))}} where the inner expression is an aggregate function. This is probably not useful but the error it gives is not particularly helpful: {code:R} library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) RecordBatch$create(x = 4) %>% summarise(y = mean(mean(x))) #> Warning: Error in mean(..temp0) : object '..temp0' not found; pulling data into #> R #> # A tibble: 1 × 1 #> y #> #> 1 4 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14585) [C++] Detect gRPC::grpc++_reflection in FindgRPCAlt
David Li created ARROW-14585: Summary: [C++] Detect gRPC::grpc++_reflection in FindgRPCAlt Key: ARROW-14585 URL: https://issues.apache.org/jira/browse/ARROW-14585 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: David Li Assignee: David Li A follow-up to ARROW-14440. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14584) [Python][CI] Python sdist installation fails with latest setuptools 58.5
Joris Van den Bossche created ARROW-14584: - Summary: [Python][CI] Python sdist installation fails with latest setuptools 58.5 Key: ARROW-14584 URL: https://issues.apache.org/jira/browse/ARROW-14584 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche See the output of the nightly builds: https://github.com/ursacomputing/crossbow/runs/4101736097?check_suite_focus=true (it might also be numpy that fails installing, but since that's a build requirement this thus also fails installing pyarrow's sdist) Upstream issue: https://github.com/pypa/setuptools/issues/2849 (we see the same error {{AttributeError: get_data_files_without_manifest}}) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14583) RStudio IDE crash
Zsolt Kegyes-Brassai created ARROW-14583: Summary: RStudio IDE crash Key: ARROW-14583 URL: https://issues.apache.org/jira/browse/ARROW-14583 Project: Apache Arrow Issue Type: Bug Affects Versions: 6.0.0 Environment: I am using a windows 10 machine, R 4.1.0, up to date R packages, and latest RStudio IDE. Reporter: Zsolt Kegyes-Brassai I was trying the new features introduced in latest {{arrow (6.0.2)}} package based on examples from the “New Directions for Apache Arrow” talk. The RStudio IDE was crashing and the R session was aborted. Looking closely I found that I downloaded only 2 years of data (2018 & 2019) and after the first filter ({{year == 2015}}) no data remains to be processed further. After some debugging, by replacing the collect() function, it turns out that the {{summarize()}} is the one which function is causing the crash. {code:java} as_dataset <- open_dataset("c:/Rproj_learn/nyc-taxi/", partitioning = c("year", "month")) %>% filter(total_amount > 100 & year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% mutate(tip_pct = tip_amount / total_amount * 100) %>% group_by(passenger_count) %>% summarize(avg_tip_pct = mean(tip_pct), n = n()) %>% filter(n > 5000) %>% arrange(desc(avg_tip_pct)) %>% collect(){code} I would expect to get an error message (without crashing the IDE), which can be handled in code. Another alternative result would be an empty data.frame, like in case when the parquet file was read in as a data.frame. I simulated this situation by setting a high {{total_amount}} value when filtering. Note: when using an Arrow table an error message is generated. {code:java} library(tidyverse) #> Warning: package 'tibble' was built under R version 4.1.1 #> Warning: package 'tidyr' was built under R version 4.1.1 #> Warning: package 'readr' was built under R version 4.1.1 library(arrow) #> Warning: package 'arrow' was built under R version 4.1.1 #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", as_data_frame = FALSE) %>% # filter(total_amount > 100) %>% filter(total_amount > 1e10) %>% select(tip_amount, total_amount, passenger_count) %>% mutate(tip_pct = tip_amount / total_amount * 100) %>% group_by(passenger_count) %>% summarize(avg_tip_pct = mean(tip_pct), n = n()) %>% filter(n > 500) %>% arrange(desc(avg_tip_pct)) %>% collect() #> Error: Invalid: Must pass at least one array read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", as_data_frame = TRUE) %>% # filter(total_amount > 100) %>% filter(total_amount > 1e10) %>% select(tip_amount, total_amount, passenger_count) %>% mutate(tip_pct = tip_amount / total_amount * 100) %>% group_by(passenger_count) %>% summarize(avg_tip_pct = mean(tip_pct), n = n()) %>% filter(n > 500) %>% arrange(desc(avg_tip_pct)) %>% collect() #> # A tibble: 0 x 3 #> # ... with 3 variables: passenger_count , avg_tip_pct , n {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)