[jira] [Created] (ARROW-9640) [C++][Gandiva] Implement round() for integers and long integers
Sagnik Chakraborty created ARROW-9640: - Summary: [C++][Gandiva] Implement round() for integers and long integers Key: ARROW-9640 URL: https://issues.apache.org/jira/browse/ARROW-9640 Project: Apache Arrow Issue Type: Task Reporter: Sagnik Chakraborty -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9639) [Ruby] Add dependency version check
Kouhei Sutou created ARROW-9639: --- Summary: [Ruby] Add dependency version check Key: ARROW-9639 URL: https://issues.apache.org/jira/browse/ARROW-9639 Project: Apache Arrow Issue Type: Improvement Components: Ruby Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9638) [C++][Compute] Implement mode(most frequent number) kernel
Yibo Cai created ARROW-9638: --- Summary: [C++][Compute] Implement mode(most frequent number) kernel Key: ARROW-9638 URL: https://issues.apache.org/jira/browse/ARROW-9638 Project: Apache Arrow Issue Type: New Feature Reporter: Yibo Cai Assignee: Yibo Cai -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9637) Speed degradation with categoricals
Larry Parker created ARROW-9637: --- Summary: Speed degradation with categoricals Key: ARROW-9637 URL: https://issues.apache.org/jira/browse/ARROW-9637 Project: Apache Arrow Issue Type: Bug Affects Versions: 1.0.0 Reporter: Larry Parker I have noticed some major speed degradation when using categorical data types. For example, a Parquet file with 1 million rows that sums 10 float columns and groups by two columns (one a date column and one a category column). The cardinality of the category seems to have a major effect. When grouping on category column of cardinality 10, performance is decent (query runs in 150 ms). But with cardinality of 100, the query runs in 10 seconds. If I switch over to my Parquet file that does *not* have categorical columns, the same query that took 10 seconds with categoricals now runs in 350 ms. I would be happy to post the Pandas code that I'm using (including how I'm creating the Parquet file), but I first wanted to report this and see if it's a known issue. Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9636) Error when using 'LZO' compression in write_table
Pierre created ARROW-9636: - Summary: Error when using 'LZO' compression in write_table Key: ARROW-9636 URL: https://issues.apache.org/jira/browse/ARROW-9636 Project: Apache Arrow Issue Type: Bug Reporter: Pierre -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9635) Can't install red-arrow 0.1.7.1
Natasha created ARROW-9635: -- Summary: Can't install red-arrow 0.1.7.1 Key: ARROW-9635 URL: https://issues.apache.org/jira/browse/ARROW-9635 Project: Apache Arrow Issue Type: Bug Components: Ruby Affects Versions: 0.17.1 Reporter: Natasha I need some help with this error, this was working ok 3 days ago, now I'm getting this: Dependencies installed: libarrow17 libarrow-glib17 gir1.2-arrow-1.0 \ libparquet17 libparquet-glib17 gir1.2-parquet-1.0 \ libarrow-dev libarrow-glib-dev libparquet-dev libparquet-glib-dev ERROR: Error installing red-parquet: ERROR: Failed to build gem native extension.current directory: /usr/local/bundle/gems/red-arrow-0.17.1/ext/arrow/usr/local/bin/ruby -r ./siteconf20200803-9-v6ubfi.rb extconf.rbchecking --enable-debug-build option... nochecking C++ compiler... g++checking g++ version... 6.3 (gnu++14)mkmf-gnome2 is deprecated. Use mkmf-gnome instead.checking for --enable-debug-build option... nochecking for -Wall option to compiler... yeschecking for -Waggregate-return option to compiler... yeschecking for -Wcast-align option to compiler... yeschecking for -Wextra option to compiler... yeschecking for -Wformat=2 option to compiler... yeschecking for -Winit-self option to compiler... yeschecking for -Wlarger-than-65500 option to compiler... yeschecking for -Wmissing-declarations option to compiler... yeschecking for -Wmissing-format-attribute option to compiler... yeschecking for -Wmissing-include-dirs option to compiler... yeschecking for -Wmissing-noreturn option to compiler... yeschecking for -Wmissing-prototypes option to compiler... yeschecking for -Wnested-externs option to compiler... yeschecking for -Wold-style-definition option to compiler... yeschecking for -Wpacked option to compiler... yeschecking for -Wp,-D_FORTIFY_SOURCE=2 option to compiler... yeschecking for -Wpointer-arith option to compiler... yeschecking for -Wundef option to compiler... yeschecking for -Wout-of-line-declaration option to compiler... nochecking for -Wunsafe-loop-optimizations option to compiler... yeschecking for -Wwrite-strings option to compiler... yeschecking for Homebrew... nochecking for arrow... yeschecking for arrow-glib... yescreating Makefilecurrent directory: /usr/local/bundle/gems/red-arrow-0.17.1/ext/arrowmake "DESTDIR=" cleancurrent directory: /usr/local/bundle/gems/red-arrow-0.17.1/ext/arrowmake "DESTDIR="compiling arrow.cppcompiling converters.cppIn file included from converters.cpp:20:0:converters.hpp:258:19: error: ‘arrow::Status red_arrow::ListArrayValueConverter::Visit(const arrow::UnionArray&)’ marked ‘override’, but does not override arrow::Status Visit(const arrow::TYPE ## Array& array) override { \ ^converters.hpp:288:5: note: in expansion of macro ‘VISIT’ VISIT(Union) ^converters.hpp:360:19: error: ‘arrow::Status red_arrow::StructArrayValueConverter::Visit(const arrow::UnionArray&)’ marked ‘override’, but does not override arrow::Status Visit(const arrow::TYPE ## Array& array) override { \ ^converters.hpp:391:5: note: in expansion of macro ‘VISIT’ VISIT(Union) ^converters.hpp: In member function ‘VALUE red_arrow::StructArrayValueConverter::convert(const arrow::StructArray&, int64_t)’:converters.hpp:342:48: warning: ‘int arrow::DataType::num_children() const’ is deprecated: Use num_fields() [-Wdeprecated-declarations] const auto n = struct_type->num_children(); ^In file included from /usr/include/arrow/array/array_base.h:31:0, from /usr/include/arrow/array.h:25, from /usr/include/arrow/api.h:22, from red-arrow.hpp:22, from converters.hpp:20, from converters.cpp:20:/usr/include/arrow/type.h:139:7: note: declared here int num_children() const \{ return num_fields(); } ^~~~In file included from converters.cpp:20:0:converters.hpp:344:53: warning: ‘const std::shared_ptr& arrow::DataType::child(int) const’ is deprecated: Use field(i) [-Wdeprecated-declarations] const auto field_type = struct_type->child(i).get(); ^In file included from /usr/include/arrow/array/array_base.h:31:0, from /usr/include/arrow/array.h:25, from /usr/include/arrow/api.h:22, from red-arrow.hpp:22, from converters.hpp:20, from converters.cpp:20:/usr/include/arrow/type.h:127:33: note: declared here const std::shared_ptr& child(int i) const \{ return field(i); } ^In file included from converters.cpp:20:0:converters.hpp: At global scope:converters.hpp:451:19: error: ‘arrow::Status
[jira] [Created] (ARROW-9634) [C++][Python] Restore non-UTC time zones when reading Parquet file that was previously Arrow
Wes McKinney created ARROW-9634: --- Summary: [C++][Python] Restore non-UTC time zones when reading Parquet file that was previously Arrow Key: ARROW-9634 URL: https://issues.apache.org/jira/browse/ARROW-9634 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Wes McKinney Fix For: 2.0.0 This was reported on the mailing list {code} In [20]: df = pd.DataFrame({'a': pd.Series(np.arange(0, 1, 1000)).astype(pd.DatetimeTZDtype('ns', 'America/Los_Angeles' ...: ))}) In [21]: t = pa.table(df) In [22]: t Out[22]: pyarrow.Table a: timestamp[ns, tz=America/Los_Angeles] In [23]: pq.write_table(t, 'test.parquet') In [24]: pq.read_table('test.parquet') Out[24]: pyarrow.Table a: timestamp[us, tz=UTC] In [25]: pq.read_table('test.parquet')[0] Out[25]: [ [ 1970-01-01 00:00:00.00, 1970-01-01 00:00:00.01, 1970-01-01 00:00:00.02, 1970-01-01 00:00:00.03, 1970-01-01 00:00:00.04, 1970-01-01 00:00:00.05, 1970-01-01 00:00:00.06, 1970-01-01 00:00:00.07, 1970-01-01 00:00:00.08, 1970-01-01 00:00:00.09 ] ] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9633) [C++] Do not toggle memory mapping globally in LocalFileSystem
Wes McKinney created ARROW-9633: --- Summary: [C++] Do not toggle memory mapping globally in LocalFileSystem Key: ARROW-9633 URL: https://issues.apache.org/jira/browse/ARROW-9633 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 2.0.0 In the context of the Datasets API, some file formats benefit greatly from memory mapping (like Arrow IPC files) while other less so. Additionally, in some scenarios, memory mapping could fail when used on network-attached storage devices. Since a filesystem may be used to read different kinds of files and use both memory mapping and non-memory mapping, and additionally the Datasets API should be able to fall back on non-memory mapping if the attempt to memory map fails, it would make sense to have a non-global option for this: https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/localfs.h I would suggest adding a new filesystem API with something like {{OpenMappedInputFile}} with some options to control the behavior when memory mapping is not possible. These options may be among: * Falling back on a normal RandomAccessFile * Reading the entire file into memory (or even tmpfs?) and then wrapping it in a BufferReader * Failing -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9632) add a func "new" for ExecutionContextSchemaProvider
qingcheng wu created ARROW-9632: --- Summary: add a func "new" for ExecutionContextSchemaProvider Key: ARROW-9632 URL: https://issues.apache.org/jira/browse/ARROW-9632 Project: Apache Arrow Issue Type: Improvement Components: Rust Affects Versions: 2.0.0 Reporter: qingcheng wu I use ExecutionContextSchemaProvider in outside app, so i add keyword "pub" for ExecutionContextSchemaProvider, and add a new func "new" for ExecutionContextSchemaProvider. I add keyword "pub" for build_schema also. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9631) [Rust] Arrow crate should not depend on flight
Andy Grove created ARROW-9631: - Summary: [Rust] Arrow crate should not depend on flight Key: ARROW-9631 URL: https://issues.apache.org/jira/browse/ARROW-9631 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andy Grove Fix For: 2.0.0 It seems that the dependencies are inverted. The core arrow crate should contain the array data structures and compute kernels and should not depend on the flight crate, which contains protocols and brings in many dependencies. If we have code for converting between arrow types and flight types then that code should live in the flight crate. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9630) [Go] Support JSON reader/writer
Ryo Okubo created ARROW-9630: Summary: [Go] Support JSON reader/writer Key: ARROW-9630 URL: https://issues.apache.org/jira/browse/ARROW-9630 Project: Apache Arrow Issue Type: New Feature Components: Go Reporter: Ryo Okubo Any plan to support JSON reader and/or writer in Go implementation? I would like that like [CSV R/W|[https://github.com/apache/arrow/blob/master/docs/source/status.rst#third-party-data-formats]] [arrjson package|[https://github.com/apache/arrow/tree/master/go/arrow/internal/arrjson]] seems to support it but it's an internal package. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9629) [Python] Kartothek integration tests failing due to missing freezegun module
Joris Van den Bossche created ARROW-9629: Summary: [Python] Kartothek integration tests failing due to missing freezegun module Key: ARROW-9629 URL: https://issues.apache.org/jira/browse/ARROW-9629 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche See eg https://github.com/ursa-labs/crossbow/runs/939266052 {code} ERRORS ERROR collecting test session _ /opt/conda/envs/arrow/lib/python3.7/importlib/__init__.py:127: in import_module return _bootstrap._gcd_import(name[level:], package, level) :1006: in _gcd_import ??? :983: in _find_and_load ??? :967: in _find_and_load_unlocked ??? :677: in _load_unlocked ??? /opt/conda/envs/arrow/lib/python3.7/site-packages/_pytest/assertion/rewrite.py:170: in exec_module exec(co, module.__dict__) tests/cli/conftest.py:11: in from freezegun import freeze_time E ModuleNotFoundError: No module named 'freezegun' {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9628) [Rust][DataFusion] Clippy PR test failing intermittently on Rust / AMD64 MacOS
Andrew Lamb created ARROW-9628: -- Summary: [Rust][DataFusion] Clippy PR test failing intermittently on Rust / AMD64 MacOS Key: ARROW-9628 URL: https://issues.apache.org/jira/browse/ARROW-9628 Project: Apache Arrow Issue Type: Bug Reporter: Andrew Lamb As reported by Jorge, on https://github.com/apache/arrow/commit/aa6889a74c57d6faea0d27ea8013d9b0c7ef809a#commitcomment-41124305 " I believe that this is somehow interacting with the caching system and sometimes failing the build of clippy. E.g. this build is failing for Mac OS, and it hits the cache: https://github.com/apache/arrow/runs/937976656 {code} Downloaded heck v0.3.1 Downloaded aho-corasick v0.7.13 Downloaded fnv v1.0.7 Downloaded futures-io v0.3.5 Downloaded base64 v0.11.0 Downloaded dirs v1.0.5 Downloaded async-stream-impl v0.2.1 Downloaded async-stream v0.2.1 Downloaded anyhow v1.0.32 Downloaded atty v0.2.14 Downloaded num-integer v0.1.43 Compiling arrow-flight v2.0.0-SNAPSHOT (/Users/runner/work/arrow/arrow/rust/arrow-flight) error[E0463]: can't find crate for `prost_derive` which `tonic_build` depends on --> arrow-flight/build.rs:36:9 | 36 | tonic_build::compile_protos("../../format/Flight.proto")?; | ^^^ can't find crate error: aborting due to previous error {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9627) JVM failed when use gandiva udf with dynamic libraries
Leo89 created ARROW-9627: Summary: JVM failed when use gandiva udf with dynamic libraries Key: ARROW-9627 URL: https://issues.apache.org/jira/browse/ARROW-9627 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva, Java Environment: OS:Centos7.4 llvm:7.0.1 jdk:1.8.0_162 arrow:1.0.0 Reporter: Leo89 Hi there, Recently I'm trying to add some UDF with dynamic link libaries. It is fine compiling and running test in cpp, but when I call the udf from java, JVM failed with errors. Steps to reproduce the issue 1 Prepare dynamic library 'libmytest.so' {code:java} // code placeholder #ifndef MYTEST_H #define MYTEST_H #ifdef __cplusplus extern "C"{ #endif float testSim(); #ifdef __cplusplus } #endif #endif {code} 2 Add simple code for the udf in file 'string_ops.cc' {code:java} // code placeholder FORCE_INLINEFORCE_INLINE gdv_float32 test_sim_binary_binary(gdv_int64 context, const char* left, gdv_int32 left_len, const char* right, gdv_int32 right_len) { float sim = testSim(); return sim; } {code} 3 Add function details in the function registry file 'function_registry_string.cc' {code:java} // code placeholder NativeFunction("test_sim", {}, DataTypeVector{binary(),binary()},float32(), kResultNullIfNull, "sim_binary_binary", NativeFunction::kNeedsContext | NativeFunction::kCanReturnErrors), {code} 4 Create test functions 5 Add link to the CMakeLists.txt 5 compile and test 6 write a java demo to call the udf -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9626) JVM failed when use gandiva udf with dynamic libraries
Leo89 created ARROW-9626: Summary: JVM failed when use gandiva udf with dynamic libraries Key: ARROW-9626 URL: https://issues.apache.org/jira/browse/ARROW-9626 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva, Java Environment: OS:Centos7.4 llvm:7.0.1 jdk:1.8.0_162 arrow:1.0.0 Reporter: Leo89 Attachments: hs_err_pid28288.log Hi there, Recently I'm trying to add some UDF with dynamic link libaries. It is fine compiling and running test in cpp, but when I call the udf from java, JVM failed with errors. Steps to reproduce the issue 1 Prepare dynamic library 'libmytest.so' {code:java} // code placeholder #ifndef MYTEST_H #define MYTEST_H #ifdef __cplusplus extern "C"{ #endif float testSim(); #ifdef __cplusplus } #endif #endif {code} 2 Add simple code for the udf in file 'string_ops.cc' {code:java} // code placeholder FORCE_INLINEFORCE_INLINE gdv_float32 test_sim_binary_binary(gdv_int64 context, const char* left, gdv_int32 left_len, const char* right, gdv_int32 right_len) { float sim = testSim(); return sim; } {code} 3 Add function details in the function registry file 'function_registry_string.cc' {code:java} // code placeholder NativeFunction("test_sim", {}, DataTypeVector{binary(),binary()},float32(), kResultNullIfNull, "sim_binary_binary", NativeFunction::kNeedsContext | NativeFunction::kCanReturnErrors), {code} 4 Create test functions 5 Add link to the CMakeLists.txt 5 compile and test 6 write a java demo to call the udf -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9625) JVM failed when use gandiva udf with dynamic libraries
Leo89 created ARROW-9625: Summary: JVM failed when use gandiva udf with dynamic libraries Key: ARROW-9625 URL: https://issues.apache.org/jira/browse/ARROW-9625 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva, Java Environment: OS:Centos7.4 llvm:7.0.1 jdk:1.8.0_162 arrow:1.0.0 Reporter: Leo89 Attachments: hs_err_pid28288.log Hi there, Recently I'm trying to add some UDF with dynamic link libaries. It is fine compiling and running test in cpp, but when I call the udf from java, JVM failed with errors. Steps to reproduce the issue 1 Prepare dynamic library 'libmytest.so' {code:java} // code placeholder #ifndef MYTEST_H #define MYTEST_H #ifdef __cplusplus extern "C"{ #endif float testSim(); #ifdef __cplusplus } #endif #endif{code} 2 Add simple code for the udf in file 'string_ops.cc' {code:java} // code placeholder FORCE_INLINEFORCE_INLINE gdv_float32 test_sim_binary_binary(gdv_int64 context, const char* left, gdv_int32 left_len, const char* right, gdv_int32 right_len) { float sim = testSim(); return sim; }{code} 3 Add function details in the function registry file 'function_registry_string.cc' {code:java} // code placeholder NativeFunction("test_sim", {}, DataTypeVector{binary(),binary()},float32(), kResultNullIfNull, "sim_binary_binary", NativeFunction::kNeedsContext | NativeFunction::kCanReturnErrors){code} 4 Create test functions 5 Add link to the CMakeLists.txt 5 compile and test 6 write a java demo to call the udf [^hs_err_pid28288.log] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9624) JVM failed when use gandiva udf with dynamic libraries
Leo89 created ARROW-9624: Summary: JVM failed when use gandiva udf with dynamic libraries Key: ARROW-9624 URL: https://issues.apache.org/jira/browse/ARROW-9624 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva, Java Environment: OS:Centos7.4 llvm:7.0.1 jdk:1.8.0_162 arrow:1.0.0 Reporter: Leo89 Attachments: hs_err_pid28288.log Hi there, Recently I'm trying to add some UDF with dynamic link libaries. It is fine compiling and running test in cpp, but when I call the udf from java, JVM failed with errors. Steps to reproduce the issue 1 Prepare dynamic library 'libmytest.so' {code:java} // code placeholder #ifndef MYTEST_H #define MYTEST_H #ifdef __cplusplus extern "C"{ #endif float testSim(); #ifdef __cplusplus } #endif #endif{code} 2 Add simple code for the udf in file 'string_ops.cc' {code:java} // code placeholder FORCE_INLINEFORCE_INLINE gdv_float32 test_sim_binary_binary(gdv_int64 context, const char* left, gdv_int32 left_len, const char* right, gdv_int32 right_len) { float sim = testSim(); return sim; }{code} 3 Add function details in the function registry file 'function_registry_string.cc' {code:java} // code placeholder NativeFunction("test_sim", {}, DataTypeVector{binary(),binary()},float32(), kResultNullIfNull, "sim_binary_binary", NativeFunction::kNeedsContext | NativeFunction::kCanReturnErrors){code} 4 Create test functions 5 Add link to the CMakeLists.txt 5 compile and test 6 write a java demo to call the udf [^hs_err_pid28288.log] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9623) Performance difference between pc.multiply vs pd.multiply
H G created ARROW-9623: -- Summary: Performance difference between pc.multiply vs pd.multiply Key: ARROW-9623 URL: https://issues.apache.org/jira/browse/ARROW-9623 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 1.0.0 Environment: Windows Pyarrow 1.0.0 Reporter: H G Wanted to report the performance difference observed between Pandas and Pyarrow. ``` import numpy as np import pandas as pd import pyarrow as pa import pyarrow.compute as pc df = pd.DataFrame(np.random.randn(1)) %timeit -n 5 -r 5 df.multiply(df) table = pa.Table.from_pandas(df) %timeit -n 5 -r 5 pc.multiply(table[0],table[0]) ``` Results: ``` %timeit -n 5 -r 5 df.multiply(df) 374 ms ± 15.9 ms per loop (mean ± std. dev. of 5 runs, 5 loops each) ``` ``` %timeit -n 5 -r 5 pc.multiply(table[0],table[0]) 698 ms ± 297 ms per loop (mean ± std. dev. of 5 runs, 5 loops each) ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)