[jira] [Created] (ARROW-13171) [R] Add binding for str_pad()
Ian Cook created ARROW-13171: Summary: [R] Add binding for str_pad() Key: ARROW-13171 URL: https://issues.apache.org/jira/browse/ARROW-13171 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Ian Cook Should work like [stringr::str_pad()|https://stringr.tidyverse.org/reference/str_pad.html]. Should call different of the kernels added in ARROW-12716 depending on the value of the {{side}} argument. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13170) [C++] Reducing branching in compute/kernels/vector_selection.cc
Niranda Perera created ARROW-13170: -- Summary: [C++] Reducing branching in compute/kernels/vector_selection.cc Key: ARROW-13170 URL: https://issues.apache.org/jira/browse/ARROW-13170 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Niranda Perera Assignee: Niranda Perera [~wesm] pointed out that the selection operations can be improved by using a non-branching method in ML and [~yibocai] confirmed this. [https://lists.apache.org/thread.html/rcffa661ca3526863fc5148ed3c111a72f03b2ce2626178bd83570aa6%40%3Cdev.arrow.apache.org%3E] Evaluate the following # Using branch-less approach # Check if `BitmapWordReader` can achieve better performance -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13169) [R] group_by + write_dataset skips some countries with UN COMTRADE / BACI datasets
Mauricio 'Pachá' Vargas Sepúlveda created ARROW-13169: - Summary: [R] group_by + write_dataset skips some countries with UN COMTRADE / BACI datasets Key: ARROW-13169 URL: https://issues.apache.org/jira/browse/ARROW-13169 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 4.0.1 Reporter: Mauricio 'Pachá' Vargas Sepúlveda Fix For: 5.0.0 ``` r library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp library(dplyr) #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union url <- "https://ams3.digitaloceanspaces.com/uncomtrade/baci_hs92_1995.rds; rds <- "baci_hs92_1995.rds" if (!file.exists(rds)) try(download.file(url, rds)) d <- readRDS("baci_hs92_1995.rds") rds_has_usa <- any(grepl("usa", unique(d$reporter_iso))) rds_has_usa #> [1] TRUE dir <- "parquet/baci_hs92" d %>% group_by(year, reporter_iso) %>% write_dataset(dir, hive_style = F) parquet_has_usa <- any(grepl("usa", list.files(paste0(dir, "/1995" parquet_has_usa #> [1] FALSE ``` Created on 2021-06-24 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13168) [C++] Timezone database configuration and access
Rok Mihevc created ARROW-13168: -- Summary: [C++] Timezone database configuration and access Key: ARROW-13168 URL: https://issues.apache.org/jira/browse/ARROW-13168 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Rok Mihevc Note: currently timezone database is not available on windows so timezone aware operations will fail. We're using tz.h library which needs an updated timezone database to correctly handle timezoned timestamps. See [installation instructions|https://howardhinnant.github.io/date/tz.html#Installation]. We have the following options for getting a timezone database: # local (non-windows) OS timezone database - no work required. # arrow bundled folder - we could bundle the database at build time for windows. Database would slowly go stale. # download it from IANA Time Zone Database at runtime - tz.h gets the database at runtime, but curl (and 7-zip on windows) are required. # local user-provided folder - user could provide a location at buildtime. Nice to have. # allow runtime configuration - at runtime say: "the tzdata can be found at this location" For more context see: [ARROW-12980|https://github.com/apache/arrow/pull/10457] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13167) [C++] Type determination kernels ("type", "type_id")
Ian Cook created ARROW-13167: Summary: [C++] Type determination kernels ("type", "type_id") Key: ARROW-13167 URL: https://issues.apache.org/jira/browse/ARROW-13167 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Ian Cook The Arrow C++ library exposes an API for determining the data type of an expression, but it is exposed as a method of the expression class and it requires that the user pass a schema as an argument to the method. This is inconvenient; for example, we have had to write some inconsistent code in the R bindings to make expression objects carry schemas along with them and then pass the schemas to derivative expressions, unifying schemas as needed for derivative expressions that take 2+ expressions as arguments. This would be much cleaner if we could use the kernel function calling interface to call a unary {{type_id}} function that would simply determine the type of its input datum and return a scalar integer value from the data type enum indicating the its data type. It would be convenient to also have a version of this that returned the string description of the data type; I think this could be named {{type}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13166) Java Dataset API ScanOptions expansion
Sebastiaan Alvarez Rodriguez created ARROW-13166: Summary: Java Dataset API ScanOptions expansion Key: ARROW-13166 URL: https://issues.apache.org/jira/browse/ARROW-13166 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Sebastiaan Alvarez Rodriguez Currently, there are very few scanning options which we can set in the Java Dataset API [1]. Additionally, the options that exist now always must be set from Java, without the possibility to use sensible default values from core Arrow. For my use-case, I want to be able to set the `fragment_readahead` option from the Java-side. It would be great if: + `ScanOptions.java` would be expanded to allow us to set more, potentially all options related to scanner creation. + Java users can omit options to use the default values, e.g. [2]. It would be good to know what others think, and whether a PR for this is useful. [1][https://github.com/apache/arrow/blob/master/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java] [2][https://github.com/apache/arrow/blob/ad5dc8207192abe71d3e88303252629041968508/cpp/src/arrow/dataset/scanner.h#L51-L53] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13165) [R] Add bindings for ProjectOptions
Ian Cook created ARROW-13165: Summary: [R] Add bindings for ProjectOptions Key: ARROW-13165 URL: https://issues.apache.org/jira/browse/ARROW-13165 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Ian Cook The {{project}} kernel creates a column of struct (equivalent to a column of named lists in R). Add to {{make_compute_options}} in {{compute.cpp}} so we can pass {{ProjectOptions}} to the {{project}} kernel. One practical application of the {{project}} kernel is to create a binding for the stringr function {{str_locate}} which returns a column of named lists. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13164) [R] altrep vectors from Array with nulls
Romain Francois created ARROW-13164: --- Summary: [R] altrep vectors from Array with nulls Key: ARROW-13164 URL: https://issues.apache.org/jira/browse/ARROW-13164 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Romain Francois Assignee: Romain Francois -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13163) [C++][Gandiva] Implement REPEAT function on Gandiva
João Pedro Antunes Ferreira created ARROW-13163: --- Summary: [C++][Gandiva] Implement REPEAT function on Gandiva Key: ARROW-13163 URL: https://issues.apache.org/jira/browse/ARROW-13163 Project: Apache Arrow Issue Type: Improvement Components: C++ - Gandiva Reporter: João Pedro Antunes Ferreira Assignee: João Pedro Antunes Ferreira Implement REPEAT function on Gandiva which concatenate a string "n" times. - REPEAT(str, int) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13162) [C++][Gandiva] Add new alias for extract date functions in Gandiva registry
João Pedro Antunes Ferreira created ARROW-13162: --- Summary: [C++][Gandiva] Add new alias for extract date functions in Gandiva registry Key: ARROW-13162 URL: https://issues.apache.org/jira/browse/ARROW-13162 Project: Apache Arrow Issue Type: Improvement Components: C++ - Gandiva Reporter: João Pedro Antunes Ferreira Assignee: João Pedro Antunes Ferreira -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13161) Allow setting FragmentReadahead to 0 in ScannerBuilder
Jayjeet Chakraborty created ARROW-13161: --- Summary: Allow setting FragmentReadahead to 0 in ScannerBuilder Key: ARROW-13161 URL: https://issues.apache.org/jira/browse/ARROW-13161 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Jayjeet Chakraborty I have an application where I need to set fragment readahead to 0. But, looks like for some reason the ScannerBuilder does not allow setting the fragment readahead to 0 [1]. It would be very helpful to know why it is that way and if a PR lifting that restriction would be accepted because a docstring mentions that users can set fragment readahead to 0 if they want [2]. [1]https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/scanner.cc#L864 [2]https://github.com/apache/arrow/blob/998a2a1668ea57a49d85fbb38f7f0e7eb94c29db/cpp/src/arrow/dataset/scanner.h#L93 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13160) [CI][C++] Use binary caching for vcpkg builds
Antoine Pitrou created ARROW-13160: -- Summary: [CI][C++] Use binary caching for vcpkg builds Key: ARROW-13160 URL: https://issues.apache.org/jira/browse/ARROW-13160 Project: Apache Arrow Issue Type: Wish Components: C++, Continuous Integration Reporter: Antoine Pitrou Currently, the vcpkg CI builds ({{test-build-vcpkg-win}}) take 2 hours. We should try to enable binary caching: https://github.com/microsoft/vcpkg/blob/master/docs/users/binarycaching.md -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13159) [Doc][Python] The use of IPython directive or doctest code blocks in the python user guide
Joris Van den Bossche created ARROW-13159: - Summary: [Doc][Python] The use of IPython directive or doctest code blocks in the python user guide Key: ARROW-13159 URL: https://issues.apache.org/jira/browse/ARROW-13159 Project: Apache Arrow Issue Type: Improvement Components: Documentation, Python Reporter: Joris Van den Bossche >From https://github.com/apache/arrow/pull/10266#discussion_r630837422 We are currently using the IPython directive in many places in the Python docs, so that something written as {code} .. ipython:: python x = 1 x + 2 {code} is converted during the doc build to (by running the code): {code} .. code-block:: ipython In [1]: x = 1 In [2]: x + 1 Out[2]: 2 {code} Running all the code during the doc build can be costly, and the more docs we add, the slower building the docs becomes. We could convert all those to {{code-block}}, but personally I think ideally we still check the code examples for correctness, where applicable. For this, we could also use the doctest format instead of the IPython directive, and verify the docs using pytest doctests support. This can be run separate as tests, and doesn't need to be part of doc building (at least when you only change wording / rst syntax, and want to verify the resulting html, you don't need to run the doctests). But maintaining examples as doctests also certainly adds some extra cost (eg when outputs change slightly) Another option could also be to add an option to the IPython directive to skip the execution of the code examples (I think this should be rather easy to add to the IPython directive, but then it's still a matter of passing this through from the build command invocation). cc [~apitrou] [~amol-] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13158) [Python] Fix repr and contains of StructScalar with duplicate field names
Joris Van den Bossche created ARROW-13158: - Summary: [Python] Fix repr and contains of StructScalar with duplicate field names Key: ARROW-13158 URL: https://issues.apache.org/jira/browse/ARROW-13158 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Broken off from ARROW-9997 When having duplicate fields, the repr fails: {code} In [28]: s = pa.scalar([('a', 1), ('b', 2), ('a', 3)], pa.struct([('a', 'int64'), ('b', 'int64'), ('a', 'int64')])) In [29]: 0 in s Out[29]: True In [30]: s KeyError: 'a' {code} In addition, the contains ({{in}}) operation also shouldn't accept integers (this is also the case for non-duplicate fields) -- This message was sent by Atlassian Jira (v8.3.4#803005)