[jira] [Created] (ARROW-13171) [R] Add binding for str_pad()

2021-06-24 Thread Ian Cook (Jira)
Ian Cook created ARROW-13171:


 Summary: [R] Add binding for str_pad()
 Key: ARROW-13171
 URL: https://issues.apache.org/jira/browse/ARROW-13171
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Ian Cook


Should work like 
[stringr::str_pad()|https://stringr.tidyverse.org/reference/str_pad.html]. 
Should call different of the kernels added in ARROW-12716 depending on the 
value of the {{side}} argument.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13170) [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-24 Thread Niranda Perera (Jira)
Niranda Perera created ARROW-13170:
--

 Summary: [C++] Reducing branching in 
compute/kernels/vector_selection.cc
 Key: ARROW-13170
 URL: https://issues.apache.org/jira/browse/ARROW-13170
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Niranda Perera
Assignee: Niranda Perera


[~wesm] pointed out that the selection operations can be improved by using a 
non-branching method in ML and [~yibocai] confirmed this.  

[https://lists.apache.org/thread.html/rcffa661ca3526863fc5148ed3c111a72f03b2ce2626178bd83570aa6%40%3Cdev.arrow.apache.org%3E]

 

Evaluate the following
 # Using branch-less approach
 # Check if `BitmapWordReader` can achieve better performance



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13169) [R] group_by + write_dataset skips some countries with UN COMTRADE / BACI datasets

2021-06-24 Thread Jira
Mauricio 'Pachá' Vargas Sepúlveda created ARROW-13169:
-

 Summary: [R] group_by + write_dataset skips some countries with UN 
COMTRADE / BACI datasets
 Key: ARROW-13169
 URL: https://issues.apache.org/jira/browse/ARROW-13169
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 4.0.1
Reporter: Mauricio 'Pachá' Vargas Sepúlveda
 Fix For: 5.0.0


``` r
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#> timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#> filter, lag
#> The following objects are masked from 'package:base':
#> 
#> intersect, setdiff, setequal, union

url <- "https://ams3.digitaloceanspaces.com/uncomtrade/baci_hs92_1995.rds;
rds <- "baci_hs92_1995.rds"

if (!file.exists(rds)) try(download.file(url, rds))

d <- readRDS("baci_hs92_1995.rds")

rds_has_usa <- any(grepl("usa", unique(d$reporter_iso)))
rds_has_usa
#> [1] TRUE

dir <- "parquet/baci_hs92"

d %>% 
  group_by(year, reporter_iso) %>% 
  write_dataset(dir, hive_style = F)

parquet_has_usa <- any(grepl("usa", list.files(paste0(dir, "/1995"
parquet_has_usa
#> [1] FALSE
```

Created on 2021-06-24 by the [reprex 
package](https://reprex.tidyverse.org) (v2.0.0)




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13168) [C++] Timezone database configuration and access

2021-06-24 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-13168:
--

 Summary: [C++] Timezone database configuration and access
 Key: ARROW-13168
 URL: https://issues.apache.org/jira/browse/ARROW-13168
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Rok Mihevc


Note: currently timezone database is not available on windows so timezone aware 
operations will fail.

We're using tz.h library which needs an updated timezone database to correctly 
handle timezoned timestamps. See [installation 
instructions|https://howardhinnant.github.io/date/tz.html#Installation].

We have the following options for getting a timezone database:
 # local (non-windows) OS timezone database - no work required.
 # arrow bundled folder - we could bundle the database at build time for 
windows. Database would slowly go stale.
 # download it from IANA Time Zone Database at runtime - tz.h gets the database 
at runtime, but curl (and 7-zip on windows) are required.
 # local user-provided folder - user could provide a location at buildtime. 
Nice to have.
 # allow runtime configuration - at runtime say: "the tzdata can be found at 
this location"

For more context see: [ARROW-12980|https://github.com/apache/arrow/pull/10457]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13167) [C++] Type determination kernels ("type", "type_id")

2021-06-24 Thread Ian Cook (Jira)
Ian Cook created ARROW-13167:


 Summary: [C++] Type determination kernels ("type", "type_id")
 Key: ARROW-13167
 URL: https://issues.apache.org/jira/browse/ARROW-13167
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Ian Cook


The Arrow C++ library exposes an API for determining the data type of an 
expression, but it is exposed as a method of the expression class and it 
requires that the user pass a schema as an argument to the method. This is 
inconvenient; for example, we have had to write some inconsistent code in the R 
bindings to make expression objects carry schemas along with them and then pass 
the schemas to derivative expressions, unifying schemas as needed for 
derivative expressions that take 2+ expressions as arguments.

This would be much cleaner if we could use the kernel function calling 
interface to call a unary {{type_id}} function that would simply determine the 
type of its input datum and return a scalar integer value from the data type 
enum indicating the its data type. It would be convenient to also have a 
version of this that returned the string description of the data type; I think 
this could be named {{type}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13166) Java Dataset API ScanOptions expansion

2021-06-24 Thread Sebastiaan Alvarez Rodriguez (Jira)
Sebastiaan Alvarez Rodriguez created ARROW-13166:


 Summary: Java Dataset API ScanOptions expansion
 Key: ARROW-13166
 URL: https://issues.apache.org/jira/browse/ARROW-13166
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Sebastiaan Alvarez Rodriguez


Currently, there are very few scanning options which we can set in the Java 
Dataset API [1].

Additionally, the options that exist now always must be set from Java, without 
the possibility to use sensible default values from core Arrow.

For my use-case, I want to be able to set the `fragment_readahead` option from 
the Java-side.

 

It would be great if:
 + `ScanOptions.java` would be expanded to allow us to set more, potentially 
all options related to scanner creation.
 + Java users can omit options to use the default values, e.g. [2].

It would be good to know what others think, and whether a PR for this is useful.


[1][https://github.com/apache/arrow/blob/master/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java]
[2][https://github.com/apache/arrow/blob/ad5dc8207192abe71d3e88303252629041968508/cpp/src/arrow/dataset/scanner.h#L51-L53]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13165) [R] Add bindings for ProjectOptions

2021-06-24 Thread Ian Cook (Jira)
Ian Cook created ARROW-13165:


 Summary: [R] Add bindings for ProjectOptions
 Key: ARROW-13165
 URL: https://issues.apache.org/jira/browse/ARROW-13165
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Ian Cook


The {{project}} kernel creates a column of struct (equivalent to a column of 
named lists in R). Add to {{make_compute_options}} in {{compute.cpp}} so we can 
pass {{ProjectOptions}} to the {{project}} kernel.

One practical application of the {{project}} kernel is to create a binding for 
the stringr function {{str_locate}} which returns a column of named lists.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13164) [R] altrep vectors from Array with nulls

2021-06-24 Thread Romain Francois (Jira)
Romain Francois created ARROW-13164:
---

 Summary: [R] altrep vectors from Array with nulls
 Key: ARROW-13164
 URL: https://issues.apache.org/jira/browse/ARROW-13164
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Romain Francois
Assignee: Romain Francois






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13163) [C++][Gandiva] Implement REPEAT function on Gandiva

2021-06-24 Thread Jira
João Pedro Antunes Ferreira created ARROW-13163:
---

 Summary: [C++][Gandiva] Implement REPEAT function on Gandiva
 Key: ARROW-13163
 URL: https://issues.apache.org/jira/browse/ARROW-13163
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva
Reporter: João Pedro Antunes Ferreira
Assignee: João Pedro Antunes Ferreira


Implement REPEAT function on Gandiva which concatenate a string "n" times.
- REPEAT(str, int)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13162) [C++][Gandiva] Add new alias for extract date functions in Gandiva registry

2021-06-24 Thread Jira
João Pedro Antunes Ferreira created ARROW-13162:
---

 Summary: [C++][Gandiva] Add new alias for extract date functions 
in Gandiva registry
 Key: ARROW-13162
 URL: https://issues.apache.org/jira/browse/ARROW-13162
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva
Reporter: João Pedro Antunes Ferreira
Assignee: João Pedro Antunes Ferreira






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13161) Allow setting FragmentReadahead to 0 in ScannerBuilder

2021-06-24 Thread Jayjeet Chakraborty (Jira)
Jayjeet Chakraborty created ARROW-13161:
---

 Summary: Allow setting FragmentReadahead to 0 in ScannerBuilder
 Key: ARROW-13161
 URL: https://issues.apache.org/jira/browse/ARROW-13161
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Jayjeet Chakraborty


I have an application where I need to set fragment readahead to 0. But, looks 
like for some reason the ScannerBuilder does not allow setting the fragment 
readahead to 0 [1]. It would be very helpful to know why it is that way and if 
a PR lifting that restriction would be accepted because a docstring mentions 
that users can set fragment readahead to 0 if they want [2].

[1]https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/scanner.cc#L864
[2]https://github.com/apache/arrow/blob/998a2a1668ea57a49d85fbb38f7f0e7eb94c29db/cpp/src/arrow/dataset/scanner.h#L93



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13160) [CI][C++] Use binary caching for vcpkg builds

2021-06-24 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-13160:
--

 Summary: [CI][C++] Use binary caching for vcpkg builds
 Key: ARROW-13160
 URL: https://issues.apache.org/jira/browse/ARROW-13160
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++, Continuous Integration
Reporter: Antoine Pitrou


Currently, the vcpkg CI builds ({{test-build-vcpkg-win}}) take 2 hours.

We should try to enable binary caching: 
https://github.com/microsoft/vcpkg/blob/master/docs/users/binarycaching.md




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13159) [Doc][Python] The use of IPython directive or doctest code blocks in the python user guide

2021-06-24 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-13159:
-

 Summary: [Doc][Python] The use of IPython directive or doctest 
code blocks in the python user guide
 Key: ARROW-13159
 URL: https://issues.apache.org/jira/browse/ARROW-13159
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Python
Reporter: Joris Van den Bossche


>From https://github.com/apache/arrow/pull/10266#discussion_r630837422

We are currently using the IPython directive in many places in the Python docs, 
so that something written as 

{code}
.. ipython:: python

  x = 1
  x + 2
{code}

is converted during the doc build to (by running the code):

{code}
.. code-block:: ipython

  In [1]: x = 1 

  In [2]: x + 1
  Out[2]: 2
{code}

Running all the code during the doc build can be costly, and the more docs we 
add, the slower building the docs becomes.

We could convert all those to {{code-block}}, but personally I think ideally we 
still check the code examples for correctness, where applicable. For this, we 
could also use the doctest format instead of the IPython directive, and verify 
the docs using pytest doctests support. 

This can be run separate as tests, and doesn't need to be part of doc building 
(at least when you only change wording / rst syntax, and want to verify the 
resulting html, you don't need to run the doctests).

But maintaining examples as doctests also certainly adds some extra cost (eg 
when outputs change slightly)

Another option could also be to add an option to the IPython directive to skip 
the execution of the code examples (I think this should be rather easy to add 
to the IPython directive, but then it's still a matter of passing this through 
from the build command invocation).

cc [~apitrou] [~amol-] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13158) [Python] Fix repr and contains of StructScalar with duplicate field names

2021-06-24 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-13158:
-

 Summary: [Python] Fix repr and contains of StructScalar with 
duplicate field names
 Key: ARROW-13158
 URL: https://issues.apache.org/jira/browse/ARROW-13158
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Broken off from ARROW-9997

When having duplicate fields, the repr fails:

{code}
In [28]: s = pa.scalar([('a', 1), ('b', 2), ('a', 3)], pa.struct([('a', 
'int64'), ('b', 'int64'), ('a', 'int64')]))

In [29]: 0 in s
Out[29]: True

In [30]: s

KeyError: 'a'
{code}

In addition, the contains ({{in}}) operation also shouldn't accept integers 
(this is also the case for non-duplicate fields)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)