[jira] [Created] (ARROW-15668) Parameterize integration tests by language

2022-02-11 Thread Jira
Jorge Leitão created ARROW-15668:


 Summary: Parameterize integration tests by language
 Key: ARROW-15668
 URL: https://issues.apache.org/jira/browse/ARROW-15668
 Project: Apache Arrow
  Issue Type: Test
  Components: Developer Tools, Integration
Reporter: Jorge Leitão


Currently, the matrix of which integration tests to run is done on a per test 
basis and is written in Python.

I propose that we lift this into a common configuration file (.yaml) that 
declares which tests each language supports, which is then harnessed during 
test execution to skip files that are not supported.

This allows for a better overview of which languages support what.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15667) [R] Windows build can fail if building only shared libraries

2022-02-11 Thread Will Jones (Jira)
Will Jones created ARROW-15667:
--

 Summary: [R] Windows build can fail if building only shared 
libraries
 Key: ARROW-15667
 URL: https://issues.apache.org/jira/browse/ARROW-15667
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 7.0.0
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 8.0.0


This should only affect dev environments. I noticed that when I build with 
shared libraries only that it fails because it's expecting 
arrow_bundled_dependencies, which I think we only build as part of static 
builds.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15666) [C++] Add format inference option to StrptimeOptions

2022-02-11 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-15666:
--

 Summary: [C++] Add format inference option to StrptimeOptions
 Key: ARROW-15666
 URL: https://issues.apache.org/jira/browse/ARROW-15666
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Rok Mihevc


We want to have an option to infer timestamp format.

See 
[pandas.to_datetime|https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html]
 and lubridate 
[parse_date_time|https://lubridate.tidyverse.org/reference/parse_date_time.html]
 for examples.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15665) [C++] Add error handling option to StrptimeOptions

2022-02-11 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-15665:
--

 Summary: [C++] Add error handling option to StrptimeOptions
 Key: ARROW-15665
 URL: https://issues.apache.org/jira/browse/ARROW-15665
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Rok Mihevc


We want to have an option to either raise, ignore or return NA in case of 
format mismatch.

See 
[pandas.to_datetime|https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html]
 and lubridate 
[parse_date_time|https://lubridate.tidyverse.org/reference/parse_date_time.html]
 for examples.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15664) [C++] parquet reader Segfaults with illegal SIMD instruction

2022-02-11 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-15664:
--

 Summary: [C++] parquet reader Segfaults with illegal SIMD 
instruction 
 Key: ARROW-15664
 URL: https://issues.apache.org/jira/browse/ARROW-15664
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 7.0.0
Reporter: Jonathan Keane
 Fix For: 7.0.1, 8.0.0


When compiling with {{-Os}} (or with release type {{MinRelSize}}), and we run 
parquet tests (in R at least, though I imagine the pyarrow and C++ will have 
the same issues!) we get a segfault with an illegal opcode on systems that 
don't have BMI2 available when trying to read parquet files. (It turns out, the 
github runners for macos don't have BMI2, so this is easily testable there!)

Somehow in the optimization combined with the way our runtime detection code 
works, the runtime detection we normally use for this fails (though it works 
just fine with {{-O2}}, {{-O3}}, etc.).

When diagnosing this, I created a branch + PR that runs our R tests after 
installing from brew which can reliably cause this to happen: 
https://github.com/apache/arrow/pull/12364 other test suites that exercise 
parquet reading would probably have the same issue (or even C++ tests built 
with {{-Os}}.

Here's a coredump:
{code}
2491 Thread_829819
+ 2491 thread_start  (in libsystem_pthread.dylib) + 15  [0x7ff801c3e00f]
+   2491 _pthread_start  (in libsystem_pthread.dylib) + 125  [0x7ff801c424f4]
+ 2491 void* 
std::__1::__thread_proxy >, 
arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*)  (in 
arrow.so) + 380  [0x109203749]
+   2491 arrow::internal::FnOnce::operator()() &&  (in arrow.so) + 
26  [0x109201f30]
+ 2491 arrow::internal::FnOnce::FnImpl >&, 
parquet::arrow::(anonymous 
namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr, std::__1::vector > 
const&, std::__1::vector > const&, 
arrow::internal::Executor*)::$_4&, unsigned long&, 
std::__1::shared_ptr > >::invoke()  (in 
arrow.so) + 98  [0x108f125c2]
+   2491 parquet::arrow::(anonymous 
namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr, std::__1::vector > 
const&, std::__1::vector > const&, 
arrow::internal::Executor*)::$_4::operator()(unsigned long, 
std::__1::shared_ptr) const  (in arrow.so) + 
47  [0x108f11ed5]
+ 2491 parquet::arrow::(anonymous 
namespace)::FileReaderImpl::ReadColumn(int, std::__1::vector > const&, parquet::arrow::ColumnReader*, 
std::__1::shared_ptr*)  (in arrow.so) + 273  [0x108f0c037]
+   2491 parquet::arrow::ColumnReaderImpl::NextBatch(long long, 
std::__1::shared_ptr*)  (in arrow.so) + 39  [0x108f0733b]
+ 2491 parquet::arrow::(anonymous 
namespace)::LeafReader::LoadBatch(long long)  (in arrow.so) + 137  [0x108f0794b]
+   2491 parquet::internal::(anonymous 
namespace)::TypedRecordReader 
>::ReadRecords(long long)  (in arrow.so) + 442  [0x108f4f53e]
+ 2491 parquet::internal::(anonymous 
namespace)::TypedRecordReader 
>::ReadRecordData(long long)  (in arrow.so) + 471  [0x108f50503]
+   2491 void 
parquet::internal::standard::DefLevelsToBitmapSimd(short const*, long 
long, parquet::internal::LevelInfo, 
parquet::internal::ValidityBitmapInputOutput*)  (in arrow.so) + 250  
[0x108fc2a5a]
+ 2491 long long 
parquet::internal::standard::DefLevelsBatchToBitmap(short const*, long 
long, long long, parquet::internal::LevelInfo, 
arrow::internal::FirstTimeBitmapWriter*)  (in arrow.so) + 63  [0x108fc34da]
+   2491 ???  (in )  [0x61354518]
+ 2491 _sigtramp  (in libsystem_platform.dylib) + 
29  [0x7ff801c57e2d]
+   2491 sigactionSegv  (in libR.dylib) + 649  
[0x1042598c9]  main.c:625
+ 2491 Rstd_ReadConsole  (in libR.dylib) + 2042 
 [0x10435160a]  sys-std.c:1044
+   2491 R_SelectEx  (in libR.dylib) + 308  
[0x104350854]  sys-std.c:178
+ 2491 __select  (in 
libsystem_kernel.dylib) + 10  [0x7ff801c0de4a]
{code}

And then a disassembly (where you can see a SHLX that shouldn't be there):

{code}
Dump of assembler code from 0x13ac6db00 to 0x13ac6db99ff:
 ...
--Type  for more, q to quit, c to continue without paging--
   0x00013ac6db82:  mov$0x8,%ecx
   0x00013ac6db87:  sub%rax,%rcx
   0x00013ac6db8a:  lea0xf1520b(%rip),%rdi# 0x13bb82d9c
   0x00013ac6db91:  movzbl (%rcx,%rdi,1),%edi
   0x00013ac6db95:  mov%esi,%ebx
   0x00013ac6db97:  and%edi,%ebx
=> 0x00013ac6db99:  shlx   %rax,%rbx,%rax
   0x00013ac6db9e:  or 0x18(%r15),%al
   0x00013ac6dba2:  mov%al,0x18(%r15)
   0x00013ac6dba6:  cmp%rdx,%rcx
   0x00013ac6dba9:  jg 0x13ac6dbf5
   0x00013ac6dbab:  mov

[jira] [Created] (ARROW-15663) [Gandiva][C++] Add TRUNC function

2022-02-11 Thread Johnnathan Rodrigo Pego de Almeida (Jira)
Johnnathan Rodrigo Pego de Almeida created ARROW-15663:
--

 Summary: [Gandiva][C++] Add TRUNC function
 Key: ARROW-15663
 URL: https://issues.apache.org/jira/browse/ARROW-15663
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Gandiva
Reporter: Johnnathan Rodrigo Pego de Almeida


Returns date truncated to the unit specified by the format. Supported formats: 
MONTH/MON/MM, YEAR//YY. Example: trunc('2015-03-17', 'MM') = 2015-03-01.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15662) [C++][FlightRPC] Try to upgrade bundled gRPC version

2022-02-11 Thread David Li (Jira)
David Li created ARROW-15662:


 Summary: [C++][FlightRPC] Try to upgrade bundled gRPC version
 Key: ARROW-15662
 URL: https://issues.apache.org/jira/browse/ARROW-15662
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, FlightRPC
Reporter: David Li


We are v1.35, we should try to upgrade to v1.43 or later.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15661) [Gandiva][C++] Add Mask_Hash function

2022-02-11 Thread Johnnathan Rodrigo Pego de Almeida (Jira)
Johnnathan Rodrigo Pego de Almeida created ARROW-15661:
--

 Summary: [Gandiva][C++] Add Mask_Hash function
 Key: ARROW-15661
 URL: https://issues.apache.org/jira/browse/ARROW-15661
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Gandiva
Reporter: Johnnathan Rodrigo Pego de Almeida


Returns a hashed value based on str. The hash is consistent and can be used to 
join masked values together across tables. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15660) Flight without reticulate

2022-02-11 Thread E. David Aja (Jira)
E. David Aja created ARROW-15660:


 Summary: Flight without reticulate
 Key: ARROW-15660
 URL: https://issues.apache.org/jira/browse/ARROW-15660
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: E. David Aja


Requiring reticulate for the use of flight is a substantial increase in 
deployment complexity; 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15659) [R] strptime should return NA (not error) with format mismatch

2022-02-11 Thread Jira
Dragoș Moldovan-Grünfeld created ARROW-15659:


 Summary: [R] strptime should return NA (not error) with format 
mismatch 
 Key: ARROW-15659
 URL: https://issues.apache.org/jira/browse/ARROW-15659
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Dragoș Moldovan-Grünfeld


{{base::strptime()}} returns {{NA}} when the value passed to the {{format}} 
argument does not match the string to be parsed. The arrow binding currently 
errors in the same scenario. 

{code:r}
strptime("2022-02-11", format = "%Y-%m-%d")
#> [1] "2022-02-11 GMT"
strptime("2022-02-11", format = "%Y %m-%d")
#> [1] NA
{code}

{code:r}
suppressMessages(library(lubridate))
suppressMessages(library(arrow))
suppressMessages(library(dplyr))

df <- tibble(x = "2022-02-11")

df %>% 
  mutate(z = strptime(x, format = "%Y-%m %d"))
#> # A tibble: 1 × 2
#>   x  z 
#> 
#> 1 2022-02-11 NA

df %>% 
  record_batch() %>% 
  mutate(z = strptime(x, format = "%Y-%m %d")) %>% 
  collect()
#> Error: Invalid: Failed to parse string: '2022-02-11' as a scalar of type 
timestamp[ms]
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)