[jira] [Created] (ARROW-17296) [Python] Doctest failure in pyarrow.parquet.read_metadata after 10.0.0 dev version update
Wes McKinney created ARROW-17296: Summary: [Python] Doctest failure in pyarrow.parquet.read_metadata after 10.0.0 dev version update Key: ARROW-17296 URL: https://issues.apache.org/jira/browse/ARROW-17296 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 10.0.0 The version update caused the doctest in this function to fail -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17259) [C++] Use shared_ptr less throughout arrow/compute
Wes McKinney created ARROW-17259: Summary: [C++] Use shared_ptr less throughout arrow/compute Key: ARROW-17259 URL: https://issues.apache.org/jira/browse/ARROW-17259 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 10.0.0 It turns out we generate a ton of code just copying and manipulating {{shared_ptr}} throughput arrow/compute, and especially in the configuration of the function/kernels registry. One function {{RegisterScalarArithmetic}} generates around 300kb of code, which on looking at disassembly contains a significant amount of inlined shared_ptr template code. I made an attempt to refactoring things to use {{const DataType*}} for function signatures which removes quite a bit of code bloat, and puts us on a path to using fewer shared_ptr's in general -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17135) [C++] Reduce code size in arrow/compute/kernels/scalar_compare.cc
Wes McKinney created ARROW-17135: Summary: [C++] Reduce code size in arrow/compute/kernels/scalar_compare.cc Key: ARROW-17135 URL: https://issues.apache.org/jira/browse/ARROW-17135 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney I had noticed the large symbol sizes in scalar_compare.cc when looking at the shared library. I had a quick hack on the plane to try to reduce the code size -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17129) [C++][Compute] Improve memory efficiency in Grouper
Wes McKinney created ARROW-17129: Summary: [C++][Compute] Improve memory efficiency in Grouper Key: ARROW-17129 URL: https://issues.apache.org/jira/browse/ARROW-17129 Project: Apache Arrow Issue Type: Improvement Reporter: Wes McKinney There are APIs in arrow::compute::Grouper (GetUniques, Consume) which may be able to be refactored to write into preallocated memory or otherwise have a mode that does less mandatory allocation. We can investigate at some point -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17100) [C++][Parquet] Fix backwards compatibility for ParquetV2 data pages written prior to 3.0.0 per ARROW-10353
Wes McKinney created ARROW-17100: Summary: [C++][Parquet] Fix backwards compatibility for ParquetV2 data pages written prior to 3.0.0 per ARROW-10353 Key: ARROW-17100 URL: https://issues.apache.org/jira/browse/ARROW-17100 Project: Apache Arrow Issue Type: Bug Components: C++, Parquet Reporter: Wes McKinney Fix For: 9.0.0 As described in https://lists.apache.org/thread/xkrhgfpk9sr1mj74d4chz3r5yp3szt6c, https://github.com/apache/arrow/commit/ef0feb2c9c959681d8a105cbadc1ae6580789e69 Caused some files written prior to 3.0.0 to be unreadable. Given that the patch was small, this will hopefully not be too difficult to fix -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17099) [Python] pyarrow build does not support RELWITHDEBINFO build type
Wes McKinney created ARROW-17099: Summary: [Python] pyarrow build does not support RELWITHDEBINFO build type Key: ARROW-17099 URL: https://issues.apache.org/jira/browse/ARROW-17099 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney I ran into this trying to bisect a Parquet regression that occurred between 2.0.0 and 3.0.0 -- because CMAKE_BUILD_TYPE=debug adds -Werror, this can cause builds to fail, but we need debug symbols to use gdb -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-16929) [C++] Remove ExecBatchIterator
Title: Message Title Wes McKinney created an issue Apache Arrow / ARROW-16929 [C++] Remove ExecBatchIterator Issue Type: Improvement Assignee: Unassigned Components: C++ Created: 28/Jun/22 19:48 Fix Versions: 9.0.0 Priority: Major Reporter: Wes McKinney The only place left using it is in GroupBy in arrow/compute/exec/aggregate.cc. This can be refactored to use ExecSpan. As part of this removal, we should adapt the benchmarks for ExecSpanIterator to demonstrate the performance improvement there Add Comment
[jira] [Created] (ARROW-16852) [C++] Migrate SCALAR_AGGREGATE, HASH_AGGREGATE functions to use ExecSpan
Wes McKinney created ARROW-16852: Summary: [C++] Migrate SCALAR_AGGREGATE, HASH_AGGREGATE functions to use ExecSpan Key: ARROW-16852 URL: https://issues.apache.org/jira/browse/ARROW-16852 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 9.0.0 Following work in ARROW-16824 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16847) [C++] Rename or fix compute/kernels/aggregate_{mode, quantile}.cc modules to actually be aggregate functions
Wes McKinney created ARROW-16847: Summary: [C++] Rename or fix compute/kernels/aggregate_{mode, quantile}.cc modules to actually be aggregate functions Key: ARROW-16847 URL: https://issues.apache.org/jira/browse/ARROW-16847 Project: Apache Arrow Issue Type: Improvement Components: C Reporter: Wes McKinney Fix For: 9.0.0 These modules import VectorFunctions even though their file names state otherwise. Either they should implement aggregate functions or the files should be renamed to indicate that they are vector functions -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16845) [C++] ArraySpan::IsNull/IsValid implementations are incorrect for union types
Wes McKinney created ARROW-16845: Summary: [C++] ArraySpan::IsNull/IsValid implementations are incorrect for union types Key: ARROW-16845 URL: https://issues.apache.org/jira/browse/ARROW-16845 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 9.0.0 Because the first buffer is not a validity bitmap. Follow up work from ARROW-16756 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16837) [C++] Investigate performance regressions observed in Unique, VisitArraySpanInline
Wes McKinney created ARROW-16837: Summary: [C++] Investigate performance regressions observed in Unique, VisitArraySpanInline Key: ARROW-16837 URL: https://issues.apache.org/jira/browse/ARROW-16837 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 9.0.0 See discussion in https://github.com/apache/arrow/pull/13364 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16827) [C++] Refactor internal array sorting code to use ArraySpan
Wes McKinney created ARROW-16827: Summary: [C++] Refactor internal array sorting code to use ArraySpan Key: ARROW-16827 URL: https://issues.apache.org/jira/browse/ARROW-16827 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney I won't be tackling this in ARROW-16824 since this code will require more work to port -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16824) [C++] Migrate non-ScalarKernel implementations to use ExecSpan, ArraySpan
Wes McKinney created ARROW-16824: Summary: [C++] Migrate non-ScalarKernel implementations to use ExecSpan, ArraySpan Key: ARROW-16824 URL: https://issues.apache.org/jira/browse/ARROW-16824 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 9.0.0 ARROW-16756 handles the scalar kernels. Migrate the rest of the kernels and remove the old ExecBatch-based exec API -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16819) [C++] arrow::compute::CallFunction needs a batch length for nullary functions
Wes McKinney created ARROW-16819: Summary: [C++] arrow::compute::CallFunction needs a batch length for nullary functions Key: ARROW-16819 URL: https://issues.apache.org/jira/browse/ARROW-16819 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 9.0.0 This is a design deficiency in {{CallFunction}}. If a function is nullary, the execution machinery has no way to determine the output length from an empty vector of datums. We should change {{CallFunction}} to have variants based on {{ExecBatch}} and {{ExecSpan}} (from ARROW-16755) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16758) [C++] Rewrite ExecuteScalarExpression to not use ScalarExecutor
Wes McKinney created ARROW-16758: Summary: [C++] Rewrite ExecuteScalarExpression to not use ScalarExecutor Key: ARROW-16758 URL: https://issues.apache.org/jira/browse/ARROW-16758 Project: Apache Arrow Issue Type: Improvement Reporter: Wes McKinney {{ExecuteScalarExpression}} sets up and tears down {{ScalarExecutor}} from exec.cc for each node in the expression tree. This adds a ton of overhead from moving around non-trivial objects. After ARROW-16756, we should write a new ScalarExpressionExecutor which is careful to construct ArraySpans and execute the expression tree in a much more lightweight / less bloated fashion. Follow on work in a subsequent Jira will add a pool/stack of allocated temporary buffers to reuse during execution -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16757) [C++] Remove "scalar" output modality from array kernels
Wes McKinney created ARROW-16757: Summary: [C++] Remove "scalar" output modality from array kernels Key: ARROW-16757 URL: https://issues.apache.org/jira/browse/ARROW-16757 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Supporting scalar outputs from array kernels (where all the inputs are scalars) introduces needless complexity into the kernel implementations, causing duplication of effort and excess code generation for paltry benefit. In the scenario where all inputs are scalars, it would be better to promote them all to arrays of length 1 (either by creating the arrays or constructing an appropriate ArraySpan per ARROW-16756) and invoking the array code path. This would enable us to delete thousands of lines of code and ease the ongoing development and maintenance of the array kernels codebase -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16756) [C++] Introduce initial ArraySpan, ExecSpan non-owning / shared_ptr-free data structures for kernel execution, refactor scalar kernels
Wes McKinney created ARROW-16756: Summary: [C++] Introduce initial ArraySpan, ExecSpan non-owning / shared_ptr-free data structures for kernel execution, refactor scalar kernels Key: ARROW-16756 URL: https://issues.apache.org/jira/browse/ARROW-16756 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 9.0.0 This is essential to reduce microperformance overhead as has been discussed and investigated many other places. This first stage of work is to remove the use of {{Datum}} and {{ExecBatch}} from the input side of only scalar kernels, so that we can work toward using span/view data structures as the inputs (and eventually outputs) of all kernels. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16755) [C++] Improve array expression and kernel evaluation performance on small inputs
Wes McKinney created ARROW-16755: Summary: [C++] Improve array expression and kernel evaluation performance on small inputs Key: ARROW-16755 URL: https://issues.apache.org/jira/browse/ARROW-16755 Project: Apache Arrow Issue Type: Improvement Components: C Reporter: Wes McKinney This is an umbrella issue for a variety of follow-up Jiras to refactor and improve the array kernels / function machinery to have less overhead and work more efficiently for parallel processing as well as small inputs (down to ~1000 elements per kernel invocation) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16643) [C++] Fix -Werror CHECKIN build with clang-14
Wes McKinney created ARROW-16643: Summary: [C++] Fix -Werror CHECKIN build with clang-14 Key: ARROW-16643 URL: https://issues.apache.org/jira/browse/ARROW-16643 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 9.0.0 With clang-14, the C++ build fails on a handful of new warnings including {{-Wreturn-stack-address}}. Will submit patch -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-15111) [C++] Implement ODBC driver "wrapper" using FlightSQL
Wes McKinney created ARROW-15111: Summary: [C++] Implement ODBC driver "wrapper" using FlightSQL Key: ARROW-15111 URL: https://issues.apache.org/jira/browse/ARROW-15111 Project: Apache Arrow Issue Type: New Feature Components: FlightRPC Reporter: Wes McKinney The ODBC analogue to ARROW-7744 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14303) [C++][Parquet] Do not duplicate Schema metadata in Parquet schema metadata and serialized ARROW:schema value
Wes McKinney created ARROW-14303: Summary: [C++][Parquet] Do not duplicate Schema metadata in Parquet schema metadata and serialized ARROW:schema value Key: ARROW-14303 URL: https://issues.apache.org/jira/browse/ARROW-14303 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 6.0.0 Metadata values are being duplicated in the Parquet file footer — we should either only store them in ARROW:schema or the Parquet schema metadata. Removing them from the Parquet schema metadata may break applications that are expecting that metadata to be there when serialized from Arrow, so dropping the keys from ARROW:schema is probably a safer choice -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13469) [C++] Suppress -Wmissing-field-initializers in DayMilliseconds arrow/type.h
Wes McKinney created ARROW-13469: Summary: [C++] Suppress -Wmissing-field-initializers in DayMilliseconds arrow/type.h Key: ARROW-13469 URL: https://issues.apache.org/jira/browse/ARROW-13469 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 6.0.0 The absence of default values for {{days}} and {{milliseconds}} triggers a compiler warning in some compilers. This could be resolved by setting the struct member default values to 0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13023) [Go] Upgrade "text" dependency to mitigate CVE
Wes McKinney created ARROW-13023: Summary: [Go] Upgrade "text" dependency to mitigate CVE Key: ARROW-13023 URL: https://issues.apache.org/jira/browse/ARROW-13023 Project: Apache Arrow Issue Type: Bug Components: Go Reporter: Wes McKinney See automated report https://github.com/apache/arrow/issues/10392 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13021) [C++] Add/improve documentation about employing Arrow in downstream CMake projects
Wes McKinney created ARROW-13021: Summary: [C++] Add/improve documentation about employing Arrow in downstream CMake projects Key: ARROW-13021 URL: https://issues.apache.org/jira/browse/ARROW-13021 Project: Apache Arrow Issue Type: Improvement Reporter: Wes McKinney In our C++ documentation, it may be useful to create a section about how we recommend introducing Arrow as a build / runtime dependency of downstream projects, particularly other CMake-based build systems. This would be at this level: https://arrow.apache.org/docs/cpp/index.html We have the "Minimal Build" example in the codebase which helps, but it may not cover all the various ways that people need to be able to depend on the project. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12884) [Flight] Data checksumming support
Wes McKinney created ARROW-12884: Summary: [Flight] Data checksumming support Key: ARROW-12884 URL: https://issues.apache.org/jira/browse/ARROW-12884 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Currently, there is not a built-in mechanism to allow for data integrity checks for FlightData messages. This issue is to discuss and see if there may be a way to add this to Flight without making things more complicated for the non-checksummed use case -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12849) [C++] Implement scalar kernel function that computes "isin" for each element in a List array
Wes McKinney created ARROW-12849: Summary: [C++] Implement scalar kernel function that computes "isin" for each element in a List array Key: ARROW-12849 URL: https://issues.apache.org/jira/browse/ARROW-12849 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney The type signature would look like this: {code} (Array>, Scalar) -> Array {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12530) [C++] Remove Buffer::mutable_data_ member and use const_cast on data_ only if is_mutable_ is true
Wes McKinney created ARROW-12530: Summary: [C++] Remove Buffer::mutable_data_ member and use const_cast on data_ only if is_mutable_ is true Key: ARROW-12530 URL: https://issues.apache.org/jira/browse/ARROW-12530 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 5.0.0 Proposed new implementation of mutable_data() by [~apitrou] {code} uint8_t* mutable_data() { return is_mutable() ? const_cast(data()) : nullptr; } {code} This will help avoid various classes of bugs (initializing Buffer subclasses incorrectly) and make the object smaller on the heap -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12495) [C++][Python] NumPy buffer sets is_mutable_ to true but does not set mutable_data_ when the NumPy array is writable
Wes McKinney created ARROW-12495: Summary: [C++][Python] NumPy buffer sets is_mutable_ to true but does not set mutable_data_ when the NumPy array is writable Key: ARROW-12495 URL: https://issues.apache.org/jira/browse/ARROW-12495 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Wes McKinney Fix For: 4.0.0 Bug is evident {code} NumPyBuffer::NumPyBuffer(PyObject* ao) : Buffer(nullptr, 0) { PyAcquireGIL lock; arr_ = ao; Py_INCREF(ao); if (PyArray_Check(ao)) { PyArrayObject* ndarray = reinterpret_cast(ao); data_ = reinterpret_cast(PyArray_DATA(ndarray)); size_ = PyArray_SIZE(ndarray) * PyArray_DESCR(ndarray)->elsize; capacity_ = size_; if (PyArray_FLAGS(ndarray) & NPY_ARRAY_WRITEABLE) { is_mutable_ = true; } } } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12404) [C++] Implement "random" nullary function that generates uniform random between 0 and 1
Wes McKinney created ARROW-12404: Summary: [C++] Implement "random" nullary function that generates uniform random between 0 and 1 Key: ARROW-12404 URL: https://issues.apache.org/jira/browse/ARROW-12404 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney This is similar to PostgreSQL's random() https://www.postgresql.org/docs/8.2/functions-math.html -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12280) [Developer] Remove @-mentions from commit messages in merge tool
Wes McKinney created ARROW-12280: Summary: [Developer] Remove @-mentions from commit messages in merge tool Key: ARROW-12280 URL: https://issues.apache.org/jira/browse/ARROW-12280 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Wes McKinney Fix For: 4.0.0 When someone @-mentions someone in their PR description, it triggers spam e-mails in GitHub's system to all the mentioned people each time someone synchronizes their fork. For example, this commit triggered an e-mail to me https://github.com/bkietz/arrow/commit/b2fa55db273d44b14814d45dae8525b065e01a91 It would be fairly each to sanitize @-mentions to simply strip the @-symbol, with the right regular expression of course (since the characters after the @ symbol can include hyphens or underscores, but otherwise any ASCII alphanumeric character) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11661) [C++] Compilation failure in arrow/scalar.cc on Xcode 8.3.3
Wes McKinney created ARROW-11661: Summary: [C++] Compilation failure in arrow/scalar.cc on Xcode 8.3.3 Key: ARROW-11661 URL: https://issues.apache.org/jira/browse/ARROW-11661 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney See https://gist.github.com/wesm/e3b52381de1556f2af669c7e2458afd0 It seems that this template construct is not supported so robustly across older compilers: {code} // timestamp to string Status CastImpl(const TimestampScalar& from, StringScalar* to) { to->value = FormatToBuffer(internal::StringFormatter{}, from); return Status::OK(); } // date to string template Status CastImpl(const DateScalar& from, StringScalar* to) { TimestampScalar ts({}, timestamp(TimeUnit::MILLI)); RETURN_NOT_OK(CastImpl(from, )); return CastImpl(ts, to); } // string to any template Status CastImpl(const StringScalar& from, ScalarType* to) { ARROW_ASSIGN_OR_RAISE(auto out, Scalar::Parse(to->type, util::string_view(*from.value))); to->value = std::move(checked_cast(*out).value); return Status::OK(); } // binary to string Status CastImpl(const BinaryScalar& from, StringScalar* to) { to->value = from.value; return Status::OK(); } // formattable to string template , // note: Value unused but necessary to trigger SFINAE if Formatter is // undefined typename Value = typename Formatter::value_type> Status CastImpl(const ScalarType& from, StringScalar* to) { to->value = FormatToBuffer(Formatter{from.type}, from); return Status::OK(); } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11643) [C++] protobuf_ep failure on Xcode 8.3.3 / Apple LLVM 8.1
Wes McKinney created ARROW-11643: Summary: [C++] protobuf_ep failure on Xcode 8.3.3 / Apple LLVM 8.1 Key: ARROW-11643 URL: https://issues.apache.org/jira/browse/ARROW-11643 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney I randomly decided to see if we can still build and run on a pre-SSE4.2 machine (2009-era MacBook), but protobuf_ep fails with {code} FAILED: CMakeFiles/libprotobuf.dir/Users/wesm/code/arrow/cpp/build/protobuf_ep-prefix/src/protobuf_ep/src/google/protobuf/dynamic_message.cc.o /Applications/Xcode8.3.3.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -DGOOGLE_PROTOBUF_CMAKE_BUILD -DHAVE_PTHREAD -DHAVE_ZLIB -I. -I/Users/wesm/code/arrow/cpp/build/protobuf_ep-prefix/src/protobuf_ep/src -Qunused-arguments -fcolor-diagnostics -O3 -DNDEBUG -O3 -DNDEBUG -fPIC -Qunused-arguments -fcolor-diagnostics -O3 -DNDEBUG -O3 -DNDEBUG -fPIC -std=c++11 -MD -MT CMakeFiles/libprotobuf.dir/Users/wesm/code/arrow/cpp/build/protobuf_ep-prefix/src/protobuf_ep/src/google/protobuf/dynamic_message.cc.o -MF CMakeFiles/libprotobuf.dir/Users/wesm/code/arrow/cpp/build/protobuf_ep-prefix/src/protobuf_ep/src/google/protobuf/dynamic_message.cc.o.d -o CMakeFiles/libprotobuf.dir/Users/wesm/code/arrow/cpp/build/protobuf_ep-prefix/src/protobuf_ep/src/google/protobuf/dynamic_message.cc.o -c /Users/wesm/code/arrow/cpp/build/protobuf_ep-prefix/src/protobuf_ep/src/google/protobuf/dynamic_message.cc In file included from /Users/wesm/code/arrow/cpp/build/protobuf_ep-prefix/src/protobuf_ep/src/google/protobuf/dynamic_message.cc:80: /Users/wesm/code/arrow/cpp/build/protobuf_ep-prefix/src/protobuf_ep/src/google/protobuf/map_field.h:332:37: error: constexpr constructor never produces a constant expression [-Winvalid-constexpr] explicit PROTOBUF_MAYBE_CONSTEXPR MapFieldBase(ConstantInitialized) ^ /Users/wesm/code/arrow/cpp/build/protobuf_ep-prefix/src/protobuf_ep/src/google/protobuf/map_field.h:335:9: note: non-literal type 'internal::WrappedMutex' cannot be used in a constant expression mutex_(GOOGLE_PROTOBUF_LINKER_INITIALIZED), ^ 1 error generated. {code} Since this appears to be a warning, perhaps it can be suppressed -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11391) [C++] HdfsOutputStream::Write unsafely truncates integers exceeding INT32_MAX
Wes McKinney created ARROW-11391: Summary: [C++] HdfsOutputStream::Write unsafely truncates integers exceeding INT32_MAX Key: ARROW-11391 URL: https://issues.apache.org/jira/browse/ARROW-11391 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 4.0.0 Originally reported on user@, see https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/hdfs.cc#L277 {{tSize} is a 32-bit integer -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11120) [Python][R] Prove out plumbing to pass data between Python and R using rpy2
Wes McKinney created ARROW-11120: Summary: [Python][R] Prove out plumbing to pass data between Python and R using rpy2 Key: ARROW-11120 URL: https://issues.apache.org/jira/browse/ARROW-11120 Project: Apache Arrow Issue Type: Improvement Components: Python, R Reporter: Wes McKinney Per discussion on the mailing list, we should see what is required (if anything) to be able to pass data structures using the C interface between Python and R from the perspective of the Python user using rpy2. rpy2 is sort of the Python version of reticulate. Unit tests will then validate that it's working -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11009) [Python] Add environment variable to elect default usage of system memory allocator instead of jemalloc/mimalloc
Wes McKinney created ARROW-11009: Summary: [Python] Add environment variable to elect default usage of system memory allocator instead of jemalloc/mimalloc Key: ARROW-11009 URL: https://issues.apache.org/jira/browse/ARROW-11009 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 3.0.0 We routinely get reports like ARROW-11007 where there is suspicion of a memory leak (which may or may not be valid) — having an easy way (requiring no changes to application code) to toggle usage of the non-system memory allocator would help with determining whether the memory usage patterns are the result of the allocator being used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10658) [Python] Wheel builds for Apple Silicon
Wes McKinney created ARROW-10658: Summary: [Python] Wheel builds for Apple Silicon Key: ARROW-10658 URL: https://issues.apache.org/jira/browse/ARROW-10658 Project: Apache Arrow Issue Type: Improvement Components: Packaging, Python Reporter: Wes McKinney We are only able to create Intel builds at the moment -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10657) [CI] Continuous integration on Apple M1 architecture
Wes McKinney created ARROW-10657: Summary: [CI] Continuous integration on Apple M1 architecture Key: ARROW-10657 URL: https://issues.apache.org/jira/browse/ARROW-10657 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Developer Tools Reporter: Wes McKinney Fix For: 3.0.0 It would be nice if we had some confidence that our next major release runs on Apple Silicon. I am looking at hooking up an M1 Mac Mini to Buildkite so that we are able to run CI jobs on one. If anyone else would like to contribute a machine to the build cluster, please be our guest -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10648) [Java] Prepare Java codebase for source release without requiring any git tags to be created or pushed
Wes McKinney created ARROW-10648: Summary: [Java] Prepare Java codebase for source release without requiring any git tags to be created or pushed Key: ARROW-10648 URL: https://issues.apache.org/jira/browse/ARROW-10648 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools, Java Reporter: Wes McKinney Fix For: 3.0.0 This makes the release process a lot more complex and makes it hard for us to create nightly source RCs -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10598) [C++] Improve performance of GenerateBitsUnrolled
Wes McKinney created ARROW-10598: Summary: [C++] Improve performance of GenerateBitsUnrolled Key: ARROW-10598 URL: https://issues.apache.org/jira/browse/ARROW-10598 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 3.0.0 internal::GenerateBitsUnrolled doesn't vectorize too well, there are some improvements we can make to get better code generation -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10569) [C++][Python] Poor Table filtering performance
Wes McKinney created ARROW-10569: Summary: [C++][Python] Poor Table filtering performance Key: ARROW-10569 URL: https://issues.apache.org/jira/browse/ARROW-10569 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Wes McKinney Fix For: 3.0.0 >From the mailing list {code:java} import pandas as pd import pyarrow as pa import pyarrow.compute as pc import numpy as np num_rows = 10_000_000 data = np.random.randn(num_rows) df = pd.DataFrame({'data{}'.format(i): data for i in range(100)}) df['key'] = np.random.randint(0, 100, size=num_rows) rb = pa.record_batch(df) t = pa.table(df) I found that the performance of filtering a record batch is very similar: In [22]: timeit df[df.key == 5] 71.3 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [24]: %timeit rb.filter(pc.equal(rb[-1], 5)) 75.8 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Whereas the performance of filtering a table is absolutely abysmal (no idea what's going on here) In [23]: %timeit t.filter(pc.equal(t[-1], 5)) 961 ms ± 3.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) {code} [https://lists.apache.org/thread.html/r4d4ffa7935efb2902600b9024859211e53aa6552d43ba0ad83517af5%40%3Cuser.arrow.apache.org%3Ehttps://lists.apache.org/thread.html/r4d4ffa7935efb2902600b9024859211e53aa6552d43ba0ad83517af5%40%3Cuser.arrow.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10567) [C++] Add options to help increase precision of arrow-flight-benchmark
Wes McKinney created ARROW-10567: Summary: [C++] Add options to help increase precision of arrow-flight-benchmark Key: ARROW-10567 URL: https://issues.apache.org/jira/browse/ARROW-10567 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10474) [C++] Implement vector kernel that transforms a boolean mask into selection indices
Wes McKinney created ARROW-10474: Summary: [C++] Implement vector kernel that transforms a boolean mask into selection indices Key: ARROW-10474 URL: https://issues.apache.org/jira/browse/ARROW-10474 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney See discussion in ARROW-10423 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10465) [C++] Faster PEXT for AMD CPUs
Wes McKinney created ARROW-10465: Summary: [C++] Faster PEXT for AMD CPUs Key: ARROW-10465 URL: https://issues.apache.org/jira/browse/ARROW-10465 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 3.0.0 See [https://twitter.com/InstLatX64/status/1322503571288559617] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10380) [C++] Running tests with ASAN, UBSAN using conda-forge compiler toolchain on macOS
Wes McKinney created ARROW-10380: Summary: [C++] Running tests with ASAN, UBSAN using conda-forge compiler toolchain on macOS Key: ARROW-10380 URL: https://issues.apache.org/jira/browse/ARROW-10380 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney I tried running the test suite with ASAN/UBSAN enabled using the conda-forge toolchain (following the instructions in the Python documentation) and found that it's horribly broken, at least with the way that I'm running it. I would guess there is some additional configuration necessary or perhaps the compiler flags are wrong. see for example https://gist.github.com/wesm/88aa66f90a642fd0a051c4a7960de350 here are what the compiler flags look like from the CMake output {code} -- CMAKE_C_FLAGS: -march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -isystem /Users/wesm/miniconda/envs/pyarrow-dev/include -Qunused-arguments -ggdb -O0 -Wall -Wextra -Wdocumentation -Wno-missing-braces -Wno-unused-parameter -Wno-unknown-warning-option -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option -Wno-pass-failed -march=haswell -mavx2 -- CMAKE_CXX_FLAGS: -march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -stdlib=libc++ -fvisibility-inlines-hidden -std=c++14 -fmessage-length=0 -isystem /Users/wesm/miniconda/envs/pyarrow-dev/include -Qunused-arguments -fcolor-diagnostics -ggdb -O0 -Wall -Wextra -Wdocumentation -Wno-missing-braces -Wno-unused-parameter -Wno-unknown-warning-option -Wno-constant-logical-operand -Werror -Wno-unknown-warning-option -Wno-pass-failed -march=haswell -mavx2 -fsanitize=address -DADDRESS_SANITIZER -fsanitize=undefined -fno-sanitize=alignment,vptr,function,float-divide-by-zero -fno-sanitize-recover=all -fsanitize-blacklist=/Users/wesm/code/arrow/cpp/build-support/sanitizer-disallowed-entries.txt {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10351) [C++][Flight] See if reading/writing to gRPC get/put streams asynchronously helps performance
Wes McKinney created ARROW-10351: Summary: [C++][Flight] See if reading/writing to gRPC get/put streams asynchronously helps performance Key: ARROW-10351 URL: https://issues.apache.org/jira/browse/ARROW-10351 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney We don't use any asynchronous concepts in the way that Flight is implemented now, i.e. IPC deconstruction/reconstruction (which may include compression!) is not performed concurrent with moving FlightData objects through the gRPC machinery, which may yield suboptimal performance. It might be better to apply an actor-type approach where a dedicated thread retrieves and prepares the next raw IPC message (within a Future) while the current IPC message is being processed -- that way reading/writing to/from the gRPC stream is not blocked on the IPC code doing its thing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10147) [Python] Constructing pandas metadata fails if an Index name is not JSON-serializable by default
Wes McKinney created ARROW-10147: Summary: [Python] Constructing pandas metadata fails if an Index name is not JSON-serializable by default Key: ARROW-10147 URL: https://issues.apache.org/jira/browse/ARROW-10147 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 2.0.0 originally reported in https://github.com/apache/arrow/issues/8270 here's a minimal reproduction: {code} In [24]: idx = pd.RangeIndex(0, 4, name=np.int64(6)) In [25]: df = pd.DataFrame(index=idx) In [26]: pa.table(df) --- TypeError Traceback (most recent call last) in > 1 pa.table(df) ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.table() ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas() ~/code/arrow/python/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe) 604 pandas_metadata = construct_metadata(df, column_names, index_columns, 605 index_descriptors, preserve_index, --> 606 types) 607 metadata = deepcopy(schema.metadata) if schema.metadata else dict() 608 metadata.update(pandas_metadata) ~/code/arrow/python/pyarrow/pandas_compat.py in construct_metadata(df, column_names, index_levels, index_descriptors, preserve_index, types) 243 'version': pa.__version__ 244 }, --> 245 'pandas_version': _pandas_api.version 246 }).encode('utf8') 247 } ~/miniconda/envs/arrow-3.7/lib/python3.7/json/__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw) 229 cls is None and indent is None and separators is None and 230 default is None and not sort_keys and not kw): --> 231 return _default_encoder.encode(obj) 232 if cls is None: 233 cls = JSONEncoder ~/miniconda/envs/arrow-3.7/lib/python3.7/json/encoder.py in encode(self, o) 197 # exceptions aren't as detailed. The list call should be roughly 198 # equivalent to the PySequence_Fast that ''.join() would do. --> 199 chunks = self.iterencode(o, _one_shot=True) 200 if not isinstance(chunks, (list, tuple)): 201 chunks = list(chunks) ~/miniconda/envs/arrow-3.7/lib/python3.7/json/encoder.py in iterencode(self, o, _one_shot) 255 self.key_separator, self.item_separator, self.sort_keys, 256 self.skipkeys, _one_shot) --> 257 return _iterencode(o, 0) 258 259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr, ~/miniconda/envs/arrow-3.7/lib/python3.7/json/encoder.py in default(self, o) 177 178 """ --> 179 raise TypeError(f'Object of type {o.__class__.__name__} ' 180 f'is not JSON serializable') 181 TypeError: Object of type int64 is not JSON serializable {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10121) [C++][Python] Variable dictionaries do not survive roundtrip to IPC stream
Wes McKinney created ARROW-10121: Summary: [C++][Python] Variable dictionaries do not survive roundtrip to IPC stream Key: ARROW-10121 URL: https://issues.apache.org/jira/browse/ARROW-10121 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Wes McKinney Fix For: 2.0.0 Failing test case (from dev@ https://lists.apache.org/thread.html/r338942b4e9f9316b48e87aab41ac49c7ffedd45733d4a6349523b7eb%40%3Cdev.arrow.apache.org%3E) {code} import pyarrow as pa from io import BytesIO pa.__version__ schema = pa.schema([pa.field('foo', pa.int32()), pa.field('bar', pa.dictionary(pa.int32(), pa.string()))] ) r1 = pa.record_batch( [ [1, 2, 3, 4, 5], pa.array(["a", "b", "c", "d", "e"]).dictionary_encode() ], schema ) r1.validate() r2 = pa.record_batch( [ [1, 2, 3, 4, 5], pa.array(["c", "c", "e", "f", "g"]).dictionary_encode() ], schema ) r2.validate() assert r1.column(1).dictionary != r2.column(1).dictionary sink = pa.BufferOutputStream() writer = pa.RecordBatchStreamWriter(sink, schema) writer.write(r1) writer.write(r2) serialized = BytesIO(sink.getvalue().to_pybytes()) stream = pa.ipc.open_stream(serialized) deserialized = [] while True: try: deserialized.append(stream.read_next_batch()) except StopIteration: break assert deserialized[1][1].to_pylist() == r2[1].to_pylist() {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10117) [C++] Implement work-stealing scheduler / multiple queues in ThreadPool
Wes McKinney created ARROW-10117: Summary: [C++] Implement work-stealing scheduler / multiple queues in ThreadPool Key: ARROW-10117 URL: https://issues.apache.org/jira/browse/ARROW-10117 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney This involves a change from a single task queue shared amongst all threads to a per-thread task queue and the ability for idle threads to take tasks from other threads' queues (work stealing). As part of this, the task submission API would need to be evolved in some fashion to allow for tasks related to a particular workload to end up in the same task queue -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10097) [C++] Persist SetLookupState in between usages of IsIn when filtering dataset batches
Wes McKinney created ARROW-10097: Summary: [C++] Persist SetLookupState in between usages of IsIn when filtering dataset batches Key: ARROW-10097 URL: https://issues.apache.org/jira/browse/ARROW-10097 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 1.0.1 Reporter: Wes McKinney Fix For: 2.0.0 Building a large hash table has a non-trivial cost. See mailing list discussion https://lists.apache.org/thread.html/rb85519cc21ffb09a836a9107919e07b076165ff81c22fb88b59a8296%40%3Cuser.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9983) [C++][Dataset] Use larger default batch size than 32K for Datasets API
Wes McKinney created ARROW-9983: --- Summary: [C++][Dataset] Use larger default batch size than 32K for Datasets API Key: ARROW-9983 URL: https://issues.apache.org/jira/browse/ARROW-9983 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 2.0.0 Dremio uses 64K batch sizes. We could probably get away with even larger batch sizes (e.g. 256K or 1M) and allow memory-constrained users to elect a smaller batch size. See example of some performance issues related to this in ARROW-9924 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface
Wes McKinney created ARROW-9924: --- Summary: [Python] Performance regression reading individual Parquet files using Dataset interface Key: ARROW-9924 URL: https://issues.apache.org/jira/browse/ARROW-9924 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 2.0.0 I haven't investigated very deeply but this seems symptomatic of a problem: {code} In [27]: df = pd.DataFrame({'A': np.random.randn(1000)}) In [28]: pq.write_table(pa.table(df), 'test.parquet') In [29]: timeit pq.read_table('test.parquet') 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True) 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9843) [C++] Implement Between trinary kernel
Wes McKinney created ARROW-9843: --- Summary: [C++] Implement Between trinary kernel Key: ARROW-9843 URL: https://issues.apache.org/jira/browse/ARROW-9843 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney A specialized {{between(arr, left_bound, right_bound)}} kernel would multiple scans and AND operation -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9842) [C++] Explore alternative strategy for Compare kernel implementation for better performance
Wes McKinney created ARROW-9842: --- Summary: [C++] Explore alternative strategy for Compare kernel implementation for better performance Key: ARROW-9842 URL: https://issues.apache.org/jira/browse/ARROW-9842 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 2.0.0 The compiler may be able to vectorize comparison options if the bitpacking of results is deferred until the end (or in chunks). Instead, a temporary bytemap can be populated on a chunk-by-chunk basis and then the bytemaps can be bitpacked into the output buffer. This may also reduce the code size of the compare kernels (which are actually quite large at the moment) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9761) [C++] Add experimental pull-based iterator structures to C interface implementation
Wes McKinney created ARROW-9761: --- Summary: [C++] Add experimental pull-based iterator structures to C interface implementation Key: ARROW-9761 URL: https://issues.apache.org/jira/browse/ARROW-9761 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 2.0.0 This purpose of this would be to validate some initial use cases / workflows prior to potentially formalizing the interface in the C ABI -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9740) [C++] Add minimal build option to use ExternalProject to build Arrow from CMake
Wes McKinney created ARROW-9740: --- Summary: [C++] Add minimal build option to use ExternalProject to build Arrow from CMake Key: ARROW-9740 URL: https://issues.apache.org/jira/browse/ARROW-9740 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9705) [C++] Validate that intraday time is zeroed out in Date64 data
Wes McKinney created ARROW-9705: --- Summary: [C++] Validate that intraday time is zeroed out in Date64 data Key: ARROW-9705 URL: https://issues.apache.org/jira/browse/ARROW-9705 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 2.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9634) [C++][Python] Restore non-UTC time zones when reading Parquet file that was previously Arrow
Wes McKinney created ARROW-9634: --- Summary: [C++][Python] Restore non-UTC time zones when reading Parquet file that was previously Arrow Key: ARROW-9634 URL: https://issues.apache.org/jira/browse/ARROW-9634 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Wes McKinney Fix For: 2.0.0 This was reported on the mailing list {code} In [20]: df = pd.DataFrame({'a': pd.Series(np.arange(0, 1, 1000)).astype(pd.DatetimeTZDtype('ns', 'America/Los_Angeles' ...: ))}) In [21]: t = pa.table(df) In [22]: t Out[22]: pyarrow.Table a: timestamp[ns, tz=America/Los_Angeles] In [23]: pq.write_table(t, 'test.parquet') In [24]: pq.read_table('test.parquet') Out[24]: pyarrow.Table a: timestamp[us, tz=UTC] In [25]: pq.read_table('test.parquet')[0] Out[25]: [ [ 1970-01-01 00:00:00.00, 1970-01-01 00:00:00.01, 1970-01-01 00:00:00.02, 1970-01-01 00:00:00.03, 1970-01-01 00:00:00.04, 1970-01-01 00:00:00.05, 1970-01-01 00:00:00.06, 1970-01-01 00:00:00.07, 1970-01-01 00:00:00.08, 1970-01-01 00:00:00.09 ] ] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9633) [C++] Do not toggle memory mapping globally in LocalFileSystem
Wes McKinney created ARROW-9633: --- Summary: [C++] Do not toggle memory mapping globally in LocalFileSystem Key: ARROW-9633 URL: https://issues.apache.org/jira/browse/ARROW-9633 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 2.0.0 In the context of the Datasets API, some file formats benefit greatly from memory mapping (like Arrow IPC files) while other less so. Additionally, in some scenarios, memory mapping could fail when used on network-attached storage devices. Since a filesystem may be used to read different kinds of files and use both memory mapping and non-memory mapping, and additionally the Datasets API should be able to fall back on non-memory mapping if the attempt to memory map fails, it would make sense to have a non-global option for this: https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/localfs.h I would suggest adding a new filesystem API with something like {{OpenMappedInputFile}} with some options to control the behavior when memory mapping is not possible. These options may be among: * Falling back on a normal RandomAccessFile * Reading the entire file into memory (or even tmpfs?) and then wrapping it in a BufferReader * Failing -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9612) [Python] Automatically back on larger IO block size when JSON parsing fails
Wes McKinney created ARROW-9612: --- Summary: [Python] Automatically back on larger IO block size when JSON parsing fails Key: ARROW-9612 URL: https://issues.apache.org/jira/browse/ARROW-9612 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 2.0.0 >From GitHub issue https://github.com/apache/arrow/issues/7835 This seems like a less than ideal failure mode, perhaps when this occurs it could automatically change to processing the file as a single block? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9610) [Python] Struct types are unhandled in AppendNdarrayItem when converting from a Python sequence to Arrow array in python_to_arrow.cc
Wes McKinney created ARROW-9610: --- Summary: [Python] Struct types are unhandled in AppendNdarrayItem when converting from a Python sequence to Arrow array in python_to_arrow.cc Key: ARROW-9610 URL: https://issues.apache.org/jira/browse/ARROW-9610 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 2.0.0 See https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/python_to_arrow.cc#L759 and mailing list discussion https://lists.apache.org/thread.html/r0a42d315df94997b7a01488d8309a0ad8f3b63997b8b29fdfb23932e%40%3Cuser.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9601) [C++][Flight] IpcWriteOptions do not appear to be propagated in DoGet requests
Wes McKinney created ARROW-9601: --- Summary: [C++][Flight] IpcWriteOptions do not appear to be propagated in DoGet requests Key: ARROW-9601 URL: https://issues.apache.org/jira/browse/ARROW-9601 Project: Apache Arrow Issue Type: Bug Components: C++, FlightRPC Reporter: Wes McKinney Fix For: 2.0.0 I haven't fully investigated this yet, but I have found that while compression (e.g. ZSTD) is respected in DoPut requests on the client side, it does not appear to propagate through DoGet requests. This may be a bug or by design, but I think it should be possible for the client to request that compression be employed when serving a DoGet -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9525) [Python] Array.__str__ shows misleading output for timestamp types with time zone set
Wes McKinney created ARROW-9525: --- Summary: [Python] Array.__str__ shows misleading output for timestamp types with time zone set Key: ARROW-9525 URL: https://issues.apache.org/jira/browse/ARROW-9525 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Wes McKinney The output is being shown with UTC interpretation {code} In [13]: arr = pa.array([0, 1, 2], type=pa.timestamp('ns', 'America/Los_Angeles')) In [14]: arr.view('int64') Out[14]: [ 0, 1, 2 ] In [15]: arr.type Out[15]: TimestampType(timestamp[ns, tz=America/Los_Angeles]) In [16]: arr Out[16]: [ 1970-01-01 00:00:00.0, 1970-01-01 00:00:00.1, 1970-01-01 00:00:00.2 ] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9518) [Python] Deprecate Union-based serialization implemented by pyarrow.serialization
Wes McKinney created ARROW-9518: --- Summary: [Python] Deprecate Union-based serialization implemented by pyarrow.serialization Key: ARROW-9518 URL: https://issues.apache.org/jira/browse/ARROW-9518 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 2.0.0 Per mailing list discussion -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9500) [C++] Fix segfault with std::to_string in -O3 builds on gcc 7.5.0
Wes McKinney created ARROW-9500: --- Summary: [C++] Fix segfault with std::to_string in -O3 builds on gcc 7.5.0 Key: ARROW-9500 URL: https://issues.apache.org/jira/browse/ARROW-9500 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 There seems to be a gcc bug related to {{std::to_string}} that only appears in {{-O3}} builds. It can be seen in something innocuous like {code} return Status::Invalid("Float value ", std::to_string(val), " was truncated converting to", *output.type()); {code} where val is NaN. I haven't found a canonical reference but using something other than to_string for the formatting (here just letting {{std::ostringstream}} take care of it) makes the problem go away I wasn't able to reproduce the issue with gcc-8 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9498) [C++][Parquet] Consider revamping RleDecoder based on "upstream" changes in Apache Impala
Wes McKinney created ARROW-9498: --- Summary: [C++][Parquet] Consider revamping RleDecoder based on "upstream" changes in Apache Impala Key: ARROW-9498 URL: https://issues.apache.org/jira/browse/ARROW-9498 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Since the initial code import in 2016, Impala made some improvements to RleDecoder that we might examine to see if they are beneficial for us See https://github.com/apache/impala/blob/master/be/src/util/rle-encoding.h and history thereof -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9497) [C++][Parquet] Fix failure caused by malformed repetition/definition levels
Wes McKinney created ARROW-9497: --- Summary: [C++][Parquet] Fix failure caused by malformed repetition/definition levels Key: ARROW-9497 URL: https://issues.apache.org/jira/browse/ARROW-9497 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix a case discovered by OSS-Fuzz -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9451) [Python] Unsigned integer types will accept string values in pyarrow.array
Wes McKinney created ARROW-9451: --- Summary: [Python] Unsigned integer types will accept string values in pyarrow.array Key: ARROW-9451 URL: https://issues.apache.org/jira/browse/ARROW-9451 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 1.0.0 See {code} In [12]: pa.array(['5'], type='uint32') Out[12]: [ 5 ] {code} Also: {code} In [9]: pa.scalar('5', type='uint8') Out[9]: In [10]: pa.scalar('5', type='uint16') Out[10]: In [11]: pa.scalar('5', type='uint32') Out[11]: {code} But: {code} In [13]: pa.array(['5'], type='int32') --- TypeError Traceback (most recent call last) in > 1 pa.array(['5'], type='int32') ~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.array() 267 else: 268 # ConvertPySequence does strict conversion if type is explicitly passed --> 269 return _sequence_to_array(obj, mask, size, type, pool, c_from_pandas) 270 271 ~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array() 36 37 with nogil: ---> 38 check_status(ConvertPySequence(sequence, mask, options, )) 39 40 if out.get().num_chunks() == 1: TypeError: an integer is required (got type str) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9450) [Python] "pytest pyarrow" takes over 10 seconds to collect tests and start executing
Wes McKinney created ARROW-9450: --- Summary: [Python] "pytest pyarrow" takes over 10 seconds to collect tests and start executing Key: ARROW-9450 URL: https://issues.apache.org/jira/browse/ARROW-9450 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 1.0.0 This seems to be a new development -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9446) [C++] Export compiler information in BuildInfo
Wes McKinney created ARROW-9446: --- Summary: [C++] Export compiler information in BuildInfo Key: ARROW-9446 URL: https://issues.apache.org/jira/browse/ARROW-9446 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 This may help improve debugging and reporting -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9442) [Python] Add pyarrow_wrap_table_no_validate API
Wes McKinney created ARROW-9442: --- Summary: [Python] Add pyarrow_wrap_table_no_validate API Key: ARROW-9442 URL: https://issues.apache.org/jira/browse/ARROW-9442 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 I have discovered that the forced validation check in pyarrow_wrap_table can add 20-30% time to a call to {{RecordBatchStreamReader.read_all}}, which should be expected to be already valid. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9441) [C++] Optimize RecordBatchReader::ReadAll
Wes McKinney created ARROW-9441: --- Summary: [C++] Optimize RecordBatchReader::ReadAll Key: ARROW-9441 URL: https://issues.apache.org/jira/browse/ARROW-9441 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Based on perf reports, more time is spent manipulating C++ data structures than reconstructing record batches from IPC messages, which strikes me as not what we want here is from a perf report based on the Python code {code} for i in range(100): pa.ipc.open_stream('nyctaxi.arrow').read_all() {code} {code} - 50.40% 0.06% python libarrow.so.100.0.0 [.] arrow::RecordBatchReader::ReadAll - 50.34% arrow::RecordBatchReader::ReadAll - 25.86% arrow::Table::FromRecordBatches - 18.41% arrow::SimpleRecordBatch::column - 16.00% arrow::MakeArray - 10.49% arrow::VisitTypeInline 7.71% arrow::PrimitiveArray::SetData 1.87% arrow::StringArray::StringArray 1.54% __pthread_mutex_lock 0.88% __pthread_mutex_unlock 0.67% std::_Hash_bytes 0.60% arrow::ChunkedArray::ChunkedArray - 22.30% arrow::RecordBatchReader::ReadAll - 22.12% arrow::ipc::RecordBatchStreamReaderImpl::ReadNext - 15.91% arrow::ipc::ReadRecordBatchInternal - 15.15% arrow::ipc::LoadRecordBatch - 14.45% arrow::ipc::ArrayLoader::Load + 13.15% arrow::VisitTypeInline + 5.53% arrow::ipc::InputStreamMessageReader::ReadNextMessage 1.84% arrow::SimpleRecordBatch::~SimpleRecordBatch {code} Perhaps {{ChunkedArray}} internally should be changed to contain a vector of {{ArrayData}} instead of boxed Arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9424) [C++][Parquet] Disable writing files with LZ4 codec
Wes McKinney created ARROW-9424: --- Summary: [C++][Parquet] Disable writing files with LZ4 codec Key: ARROW-9424 URL: https://issues.apache.org/jira/browse/ARROW-9424 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9419) [C++] Test that "fill_null" function works with sliced inputs, expand tests
Wes McKinney created ARROW-9419: --- Summary: [C++] Test that "fill_null" function works with sliced inputs, expand tests Key: ARROW-9419 URL: https://issues.apache.org/jira/browse/ARROW-9419 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 I observed some bugs in the implementation that I did yesterday so adding tests to cover them -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9412) [C++] Add non-BUNDLED dependencies to exported INSTALL_INTERFACE_LIBS of arrow_static and test that it works
Wes McKinney created ARROW-9412: --- Summary: [C++] Add non-BUNDLED dependencies to exported INSTALL_INTERFACE_LIBS of arrow_static and test that it works Key: ARROW-9412 URL: https://issues.apache.org/jira/browse/ARROW-9412 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney As a companion project to ARROW-7605, we must document and test a workflow for statically linking with external static dependencies. When a dependency is not built as BUNDLED, it can be added to "ARROW_STATIC_INSTALL_INTERFACE_LIBS" so that it's included in ArrowTargets-*.cmake. The third party project of course must configure the dependent CMake targets Prior to the patch for ARROW-7605, toolchain libraries were added unconditionally to ARROW_STATIC_INSTALL_INTERFACE_LIBS whether BUNDLED or not (including our private jemalloc), creating a broken CMake "arrow_static" target. So this patch is to partially revert these changes to enable static linking with external toolchain libraries without breaking the BUNDLED static builds. Finally, this must be tested similar to cpp/examples/minimal_build/run_static.sh so that we can verify that each of the build/link scenarios are working correctly -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9400) [Python] Do not depend on conda-forge static libraries in Windows wheel builds
Wes McKinney created ARROW-9400: --- Summary: [Python] Do not depend on conda-forge static libraries in Windows wheel builds Key: ARROW-9400 URL: https://issues.apache.org/jira/browse/ARROW-9400 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Based on https://github.com/conda-forge/cfep/blob/e9bb3f58eca79107baede71cb9b05311705a10f2/cfep-18.md it appears that static libraries may not be included in the future in many packages that we use for building the Windows Python wheels. We should change the build to use BUNDLED builds so we don't have this issue -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9399) [C++] Add forward compatibility checks for unrecognized future MetadataVersion
Wes McKinney created ARROW-9399: --- Summary: [C++] Add forward compatibility checks for unrecognized future MetadataVersion Key: ARROW-9399 URL: https://issues.apache.org/jira/browse/ARROW-9399 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 We should have no need of these checks in theory, but they present a safeguard should some years in the future it became necessary to increment the MetadataVersion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9396) [Python] Expose CpuInfo for informational / debugging purposes
Wes McKinney created ARROW-9396: --- Summary: [Python] Expose CpuInfo for informational / debugging purposes Key: ARROW-9396 URL: https://issues.apache.org/jira/browse/ARROW-9396 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney This would help to see what CpuInfo says about the current processor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9395) [Python] Provide configurable MetadataVersion in IPC API and environment variable to set default to V4 when needed
Wes McKinney created ARROW-9395: --- Summary: [Python] Provide configurable MetadataVersion in IPC API and environment variable to set default to V4 when needed Key: ARROW-9395 URL: https://issues.apache.org/jira/browse/ARROW-9395 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 1.0.0 This is a follow up to ARROW-9265 and must be implemented in order to release 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9379) [Rust] Support unsigned dictionary indices
Wes McKinney created ARROW-9379: --- Summary: [Rust] Support unsigned dictionary indices Key: ARROW-9379 URL: https://issues.apache.org/jira/browse/ARROW-9379 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Wes McKinney -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9378) [Go] Support unsigned dictionary indices
Wes McKinney created ARROW-9378: --- Summary: [Go] Support unsigned dictionary indices Key: ARROW-9378 URL: https://issues.apache.org/jira/browse/ARROW-9378 Project: Apache Arrow Issue Type: Improvement Components: Go Reporter: Wes McKinney -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9377) [Java] Support unsigned dictionary indices
Wes McKinney created ARROW-9377: --- Summary: [Java] Support unsigned dictionary indices Key: ARROW-9377 URL: https://issues.apache.org/jira/browse/ARROW-9377 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Wes McKinney child of ARROW-9259 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9348) [C++] Replace usages of TestBase::MakeRandomArray in testing/gtest_util.h with RandomArrayGenerator
Wes McKinney created ARROW-9348: --- Summary: [C++] Replace usages of TestBase::MakeRandomArray in testing/gtest_util.h with RandomArrayGenerator Key: ARROW-9348 URL: https://issues.apache.org/jira/browse/ARROW-9348 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney This would be good code cleanliness -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9326) [Python] Setuptools 49.1.0 appears to break our Python 3.6 builds
Wes McKinney created ARROW-9326: --- Summary: [Python] Setuptools 49.1.0 appears to break our Python 3.6 builds Key: ARROW-9326 URL: https://issues.apache.org/jira/browse/ARROW-9326 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 1.0.0 Not sure who thought it was a good idea to release setuptools on July 3, a holiday in the United States, but it appears to be breaking some of our builds https://github.com/apache/arrow/pull/7539/checks?check_run_id=835994558 {code} File "/opt/conda/envs/arrow/lib/python3.6/site-packages/setuptools/command/egg_info.py", line 297, in run self.find_sources() File "/opt/conda/envs/arrow/lib/python3.6/site-packages/setuptools/command/egg_info.py", line 304, in find_sources mm.run() File "/opt/conda/envs/arrow/lib/python3.6/site-packages/setuptools/command/egg_info.py", line 535, in run self.add_defaults() File "/opt/conda/envs/arrow/lib/python3.6/site-packages/setuptools/command/egg_info.py", line 571, in add_defaults sdist.add_defaults(self) File "/opt/conda/envs/arrow/lib/python3.6/site-packages/setuptools/_distutils/command/sdist.py", line 228, in add_defaults self._add_defaults_ext() File "/opt/conda/envs/arrow/lib/python3.6/site-packages/setuptools/_distutils/command/sdist.py", line 312, in _add_defaults_ext self.filelist.extend(build_ext.get_source_files()) File "/opt/conda/envs/arrow/lib/python3.6/distutils/command/build_ext.py", line 420, in get_source_files self.check_extensions_list(self.extensions) File "/opt/conda/envs/arrow/lib/python3.6/distutils/command/build_ext.py", line 362, in check_extensions_list "each element of 'ext_modules' option must be an " distutils.errors.DistutilsSetupError: each element of 'ext_modules' option must be an Extension instance or 2-tuple {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9305) [Python] Dependency load failure in Windows wheel build
Wes McKinney created ARROW-9305: --- Summary: [Python] Dependency load failure in Windows wheel build Key: ARROW-9305 URL: https://issues.apache.org/jira/browse/ARROW-9305 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 1.0.0 The Windows wheels are experiencing a DLL load failure probably due to one of the dependencies -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9304) [C++] Add "AppendEmptyValue" builder APIs for use inside StructBuilder::AppendNull
Wes McKinney created ARROW-9304: --- Summary: [C++] Add "AppendEmptyValue" builder APIs for use inside StructBuilder::AppendNull Key: ARROW-9304 URL: https://issues.apache.org/jira/browse/ARROW-9304 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 StructBuilder should probably also add "UnsafeAppendNull" so that there is the option of using the Unsafe* operations on the children -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9287) [C++] Implement support for unsigned dictionary indices
Wes McKinney created ARROW-9287: --- Summary: [C++] Implement support for unsigned dictionary indices Key: ARROW-9287 URL: https://issues.apache.org/jira/browse/ARROW-9287 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 Follow on work from ARROW-9259 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9286) [C++] Add function "aliases" to compute::FunctionRegistry
Wes McKinney created ARROW-9286: --- Summary: [C++] Add function "aliases" to compute::FunctionRegistry Key: ARROW-9286 URL: https://issues.apache.org/jira/browse/ARROW-9286 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney The purpose of aliases would be to avoid breaking APIs when/if functions are renamed in between releases -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9285) [C++] Detect unauthorized memory allocations in function kernels
Wes McKinney created ARROW-9285: --- Summary: [C++] Detect unauthorized memory allocations in function kernels Key: ARROW-9285 URL: https://issues.apache.org/jira/browse/ARROW-9285 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney If a function has been configured to preallocate space, then executing the kernel should not replace the preallocated buffer during execution -- this is an implementation error. Detecting this would be relatively easy and improve debugging for kernel implementers -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9278) [C++] Implement Union validity bitmap changes from ARROW-9222
Wes McKinney created ARROW-9278: --- Summary: [C++] Implement Union validity bitmap changes from ARROW-9222 Key: ARROW-9278 URL: https://issues.apache.org/jira/browse/ARROW-9278 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9270) [C++] Create Buffer to represent a slice of a ResizableBuffer that may yet reallocate
Wes McKinney created ARROW-9270: --- Summary: [C++] Create Buffer to represent a slice of a ResizableBuffer that may yet reallocate Key: ARROW-9270 URL: https://issues.apache.org/jira/browse/ARROW-9270 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney One problem with slicing ResizableBuffer is that slices of a buffer that is still changing may be invalidated after a resize operation. I'd be interested in having a way to slice a ResizableBuffer where the slice is still usable after a resize. This would presume, of course, that the code responsible for the parent and child buffers behaves appropriately (e.g. if you call {{child->data()}} and then resize the parent, the pointer may become invalid -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9265) [C++] Add support for writing MetadataVersion::V4-compatible IPC messages for compatibility with library versions <= 0.17.1
Wes McKinney created ARROW-9265: --- Summary: [C++] Add support for writing MetadataVersion::V4-compatible IPC messages for compatibility with library versions <= 0.17.1 Key: ARROW-9265 URL: https://issues.apache.org/jira/browse/ARROW-9265 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 While we need to increment the MetadataVersion, we should not strand old library versions since V4 is backward compatible with V5. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9260) [CI] "ARM64v8 Ubuntu 20.04 C++" fails
Wes McKinney created ARROW-9260: --- Summary: [CI] "ARM64v8 Ubuntu 20.04 C++" fails Key: ARROW-9260 URL: https://issues.apache.org/jira/browse/ARROW-9260 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration Reporter: Wes McKinney Fix For: 1.0.0 This GHA build should be disabled until it is passing reliably, e.g. https://github.com/apache/arrow/runs/816007838. This seems to be similar to the Travis CI failure -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9259) [Format] Permit unsigned dictionary indices in Columnar.rst
Wes McKinney created ARROW-9259: --- Summary: [Format] Permit unsigned dictionary indices in Columnar.rst Key: ARROW-9259 URL: https://issues.apache.org/jira/browse/ARROW-9259 Project: Apache Arrow Issue Type: Improvement Components: Format Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9258) [Format] Add V5 MetadataVersion
Wes McKinney created ARROW-9258: --- Summary: [Format] Add V5 MetadataVersion Key: ARROW-9258 URL: https://issues.apache.org/jira/browse/ARROW-9258 Project: Apache Arrow Issue Type: Improvement Components: Format Reporter: Wes McKinney Fix For: 1.0.0 Per mailing list discussion -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9257) [CI] Fix Travis CI builds
Wes McKinney created ARROW-9257: --- Summary: [CI] Fix Travis CI builds Key: ARROW-9257 URL: https://issues.apache.org/jira/browse/ARROW-9257 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Wes McKinney These are being allowed to fail on master. I am not sure what's wrong with them -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9254) [C++] Factor out some integer casting internals so it can be reused with temporal casts
Wes McKinney created ARROW-9254: --- Summary: [C++] Factor out some integer casting internals so it can be reused with temporal casts Key: ARROW-9254 URL: https://issues.apache.org/jira/browse/ARROW-9254 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 The "CastNumberToNumberUnsafe" function can be shared outside of scalar_cast_numeric.cc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9253) [C++] Add vectorized "IntegersMultipleOf" to arrow/util/int_util.h
Wes McKinney created ARROW-9253: --- Summary: [C++] Add vectorized "IntegersMultipleOf" to arrow/util/int_util.h Key: ARROW-9253 URL: https://issues.apache.org/jira/browse/ARROW-9253 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney There are various places where we check whether an array of integers are all a multiple of another number (e.g. a multiple of 8640 milliseconds per day). It would be better to factor out this data check into a reusable function similar to the {{CheckIntegersInRange}} function -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9252) [Integration] GitHub Actions integration test job does not test against "gold" 0.14.1 files in apache/arrow-testing
Wes McKinney created ARROW-9252: --- Summary: [Integration] GitHub Actions integration test job does not test against "gold" 0.14.1 files in apache/arrow-testing Key: ARROW-9252 URL: https://issues.apache.org/jira/browse/ARROW-9252 Project: Apache Arrow Issue Type: Bug Components: Developer Tools Reporter: Wes McKinney Fix For: 1.0.0 I'm not sure when and why this was dropped but it is critical that these tests from https://github.com/apache/arrow/commit/26d72f328b82bcff4e074109a5f905ebf069a416#diff-776ea3bf11df5829827f7afb43c37174 are restored -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9251) [C++] Move JSON testing code for integration tests to libarrow_testing
Wes McKinney created ARROW-9251: --- Summary: [C++] Move JSON testing code for integration tests to libarrow_testing Key: ARROW-9251 URL: https://issues.apache.org/jira/browse/ARROW-9251 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 This code contributes over 700KB to release builds and is never used {code} -rw--- 1 wesm wesm 34104 Jun 27 11:14 dictionary.cc.o -rw--- 1 wesm wesm 199592 Jun 27 11:14 feather.cc.o -rw--- 1 wesm wesm 63448 Jun 27 11:14 json_integration.cc.o -rw--- 1 wesm wesm 727336 Jun 27 11:14 json_internal.cc.o -rw--- 1 wesm wesm 828056 Jun 27 11:14 json_simple.cc.o -rw--- 1 wesm wesm 185344 Jun 27 11:14 message.cc.o -rw--- 1 wesm wesm 223592 Jun 27 11:14 metadata_internal.cc.o -rw--- 1 wesm wesm 3416 Jun 27 11:14 options.cc.o -rw--- 1 wesm wesm 557960 Jun 27 11:14 reader.cc.o -rw--- 1 wesm wesm 285744 Jun 27 11:14 writer.cc.o {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9250) [C++] Compact generated code in compute/kernels/scalar_set_lookup.cc using same method as vector_hash.cc
Wes McKinney created ARROW-9250: --- Summary: [C++] Compact generated code in compute/kernels/scalar_set_lookup.cc using same method as vector_hash.cc Key: ARROW-9250 URL: https://issues.apache.org/jira/browse/ARROW-9250 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 This module can be made to compile smaller and faster by using common kernels for types having the same binary representation -- This message was sent by Atlassian Jira (v8.3.4#803005)