[jira] [Created] (ARROW-15668) Parameterize integration tests by language
Jorge Leitão created ARROW-15668: Summary: Parameterize integration tests by language Key: ARROW-15668 URL: https://issues.apache.org/jira/browse/ARROW-15668 Project: Apache Arrow Issue Type: Test Components: Developer Tools, Integration Reporter: Jorge Leitão Currently, the matrix of which integration tests to run is done on a per test basis and is written in Python. I propose that we lift this into a common configuration file (.yaml) that declares which tests each language supports, which is then harnessed during test execution to skip files that are not supported. This allows for a better overview of which languages support what. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15667) [R] Windows build can fail if building only shared libraries
Will Jones created ARROW-15667: -- Summary: [R] Windows build can fail if building only shared libraries Key: ARROW-15667 URL: https://issues.apache.org/jira/browse/ARROW-15667 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 7.0.0 Reporter: Will Jones Assignee: Will Jones Fix For: 8.0.0 This should only affect dev environments. I noticed that when I build with shared libraries only that it fails because it's expecting arrow_bundled_dependencies, which I think we only build as part of static builds. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15666) [C++] Add format inference option to StrptimeOptions
Rok Mihevc created ARROW-15666: -- Summary: [C++] Add format inference option to StrptimeOptions Key: ARROW-15666 URL: https://issues.apache.org/jira/browse/ARROW-15666 Project: Apache Arrow Issue Type: Improvement Reporter: Rok Mihevc We want to have an option to infer timestamp format. See [pandas.to_datetime|https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html] and lubridate [parse_date_time|https://lubridate.tidyverse.org/reference/parse_date_time.html] for examples. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15665) [C++] Add error handling option to StrptimeOptions
Rok Mihevc created ARROW-15665: -- Summary: [C++] Add error handling option to StrptimeOptions Key: ARROW-15665 URL: https://issues.apache.org/jira/browse/ARROW-15665 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Rok Mihevc We want to have an option to either raise, ignore or return NA in case of format mismatch. See [pandas.to_datetime|https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html] and lubridate [parse_date_time|https://lubridate.tidyverse.org/reference/parse_date_time.html] for examples. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15664) [C++] parquet reader Segfaults with illegal SIMD instruction
Jonathan Keane created ARROW-15664: -- Summary: [C++] parquet reader Segfaults with illegal SIMD instruction Key: ARROW-15664 URL: https://issues.apache.org/jira/browse/ARROW-15664 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 7.0.0 Reporter: Jonathan Keane Fix For: 7.0.1, 8.0.0 When compiling with {{-Os}} (or with release type {{MinRelSize}}), and we run parquet tests (in R at least, though I imagine the pyarrow and C++ will have the same issues!) we get a segfault with an illegal opcode on systems that don't have BMI2 available when trying to read parquet files. (It turns out, the github runners for macos don't have BMI2, so this is easily testable there!) Somehow in the optimization combined with the way our runtime detection code works, the runtime detection we normally use for this fails (though it works just fine with {{-O2}}, {{-O3}}, etc.). When diagnosing this, I created a branch + PR that runs our R tests after installing from brew which can reliably cause this to happen: https://github.com/apache/arrow/pull/12364 other test suites that exercise parquet reading would probably have the same issue (or even C++ tests built with {{-Os}}. Here's a coredump: {code} 2491 Thread_829819 + 2491 thread_start (in libsystem_pthread.dylib) + 15 [0x7ff801c3e00f] + 2491 _pthread_start (in libsystem_pthread.dylib) + 125 [0x7ff801c424f4] + 2491 void* std::__1::__thread_proxy >, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*) (in arrow.so) + 380 [0x109203749] + 2491 arrow::internal::FnOnce::operator()() && (in arrow.so) + 26 [0x109201f30] + 2491 arrow::internal::FnOnce::FnImpl >&, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr, std::__1::vector > const&, std::__1::vector > const&, arrow::internal::Executor*)::$_4&, unsigned long&, std::__1::shared_ptr > >::invoke() (in arrow.so) + 98 [0x108f125c2] + 2491 parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr, std::__1::vector > const&, std::__1::vector > const&, arrow::internal::Executor*)::$_4::operator()(unsigned long, std::__1::shared_ptr) const (in arrow.so) + 47 [0x108f11ed5] + 2491 parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadColumn(int, std::__1::vector > const&, parquet::arrow::ColumnReader*, std::__1::shared_ptr*) (in arrow.so) + 273 [0x108f0c037] + 2491 parquet::arrow::ColumnReaderImpl::NextBatch(long long, std::__1::shared_ptr*) (in arrow.so) + 39 [0x108f0733b] + 2491 parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long long) (in arrow.so) + 137 [0x108f0794b] + 2491 parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadRecords(long long) (in arrow.so) + 442 [0x108f4f53e] + 2491 parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadRecordData(long long) (in arrow.so) + 471 [0x108f50503] + 2491 void parquet::internal::standard::DefLevelsToBitmapSimd(short const*, long long, parquet::internal::LevelInfo, parquet::internal::ValidityBitmapInputOutput*) (in arrow.so) + 250 [0x108fc2a5a] + 2491 long long parquet::internal::standard::DefLevelsBatchToBitmap(short const*, long long, long long, parquet::internal::LevelInfo, arrow::internal::FirstTimeBitmapWriter*) (in arrow.so) + 63 [0x108fc34da] + 2491 ??? (in ) [0x61354518] + 2491 _sigtramp (in libsystem_platform.dylib) + 29 [0x7ff801c57e2d] + 2491 sigactionSegv (in libR.dylib) + 649 [0x1042598c9] main.c:625 + 2491 Rstd_ReadConsole (in libR.dylib) + 2042 [0x10435160a] sys-std.c:1044 + 2491 R_SelectEx (in libR.dylib) + 308 [0x104350854] sys-std.c:178 + 2491 __select (in libsystem_kernel.dylib) + 10 [0x7ff801c0de4a] {code} And then a disassembly (where you can see a SHLX that shouldn't be there): {code} Dump of assembler code from 0x13ac6db00 to 0x13ac6db99ff: ... --Type for more, q to quit, c to continue without paging-- 0x00013ac6db82: mov$0x8,%ecx 0x00013ac6db87: sub%rax,%rcx 0x00013ac6db8a: lea0xf1520b(%rip),%rdi# 0x13bb82d9c 0x00013ac6db91: movzbl (%rcx,%rdi,1),%edi 0x00013ac6db95: mov%esi,%ebx 0x00013ac6db97: and%edi,%ebx => 0x00013ac6db99: shlx %rax,%rbx,%rax 0x00013ac6db9e: or 0x18(%r15),%al 0x00013ac6dba2: mov%al,0x18(%r15) 0x00013ac6dba6: cmp%rdx,%rcx 0x00013ac6dba9: jg 0x13ac6dbf5 0x00013ac6dbab: mov
[jira] [Created] (ARROW-15663) [Gandiva][C++] Add TRUNC function
Johnnathan Rodrigo Pego de Almeida created ARROW-15663: -- Summary: [Gandiva][C++] Add TRUNC function Key: ARROW-15663 URL: https://issues.apache.org/jira/browse/ARROW-15663 Project: Apache Arrow Issue Type: New Feature Components: C++ - Gandiva Reporter: Johnnathan Rodrigo Pego de Almeida Returns date truncated to the unit specified by the format. Supported formats: MONTH/MON/MM, YEAR//YY. Example: trunc('2015-03-17', 'MM') = 2015-03-01. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15662) [C++][FlightRPC] Try to upgrade bundled gRPC version
David Li created ARROW-15662: Summary: [C++][FlightRPC] Try to upgrade bundled gRPC version Key: ARROW-15662 URL: https://issues.apache.org/jira/browse/ARROW-15662 Project: Apache Arrow Issue Type: Improvement Components: C++, FlightRPC Reporter: David Li We are v1.35, we should try to upgrade to v1.43 or later. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15661) [Gandiva][C++] Add Mask_Hash function
Johnnathan Rodrigo Pego de Almeida created ARROW-15661: -- Summary: [Gandiva][C++] Add Mask_Hash function Key: ARROW-15661 URL: https://issues.apache.org/jira/browse/ARROW-15661 Project: Apache Arrow Issue Type: New Feature Components: C++ - Gandiva Reporter: Johnnathan Rodrigo Pego de Almeida Returns a hashed value based on str. The hash is consistent and can be used to join masked values together across tables. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15660) Flight without reticulate
E. David Aja created ARROW-15660: Summary: Flight without reticulate Key: ARROW-15660 URL: https://issues.apache.org/jira/browse/ARROW-15660 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: E. David Aja Requiring reticulate for the use of flight is a substantial increase in deployment complexity; -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15659) [R] strptime should return NA (not error) with format mismatch
Dragoș Moldovan-Grünfeld created ARROW-15659: Summary: [R] strptime should return NA (not error) with format mismatch Key: ARROW-15659 URL: https://issues.apache.org/jira/browse/ARROW-15659 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Dragoș Moldovan-Grünfeld {{base::strptime()}} returns {{NA}} when the value passed to the {{format}} argument does not match the string to be parsed. The arrow binding currently errors in the same scenario. {code:r} strptime("2022-02-11", format = "%Y-%m-%d") #> [1] "2022-02-11 GMT" strptime("2022-02-11", format = "%Y %m-%d") #> [1] NA {code} {code:r} suppressMessages(library(lubridate)) suppressMessages(library(arrow)) suppressMessages(library(dplyr)) df <- tibble(x = "2022-02-11") df %>% mutate(z = strptime(x, format = "%Y-%m %d")) #> # A tibble: 1 × 2 #> x z #> #> 1 2022-02-11 NA df %>% record_batch() %>% mutate(z = strptime(x, format = "%Y-%m %d")) %>% collect() #> Error: Invalid: Failed to parse string: '2022-02-11' as a scalar of type timestamp[ms] {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)