[jira] [Updated] (ARROW-6388) [C++] Consider implementing BufferOuputStream using BufferBuilder internally
[ https://issues.apache.org/jira/browse/ARROW-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6388: Fix Version/s: (was: 1.0.0) > [C++] Consider implementing BufferOuputStream using BufferBuilder internally > > > Key: ARROW-6388 > URL: https://issues.apache.org/jira/browse/ARROW-6388 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > See discussion in ARROW-6381 https://github.com/apache/arrow/pull/5222 > We should be careful that this doesn't introduce any performance regression. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8970) [C++] Reduce shared library / binary code size (umbrella issue)
[ https://issues.apache.org/jira/browse/ARROW-8970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8970: Summary: [C++] Reduce shared library / binary code size (umbrella issue) (was: [C++] Reduce shared library code size (umbrella issue)) > [C++] Reduce shared library / binary code size (umbrella issue) > --- > > Key: ARROW-8970 > URL: https://issues.apache.org/jira/browse/ARROW-8970 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > We're reaching a point where we may need to be careful about decisions that > increase code size: > * Instantiating too many templates for code that isn't performance sensitive, > or where some templates may do the same thing (e.g. Int32Type kernels may do > the same thing as a Date32Type kernel) > * Inlining functions that don't need to be inline > Code size tends to correlate also with compilation times, but not always. > I'll use this umbrella issue to organize issues related to reducing compiled > code size > At this moment (2020-05-27), here are the 25 largest object files in a -O2 > build > {code} > 524896src/arrow/CMakeFiles/arrow_objlib.dir/array/builder_dict.cc.o > 531920src/arrow/CMakeFiles/arrow_objlib.dir/filesystem/s3fs.cc.o > 552000src/arrow/CMakeFiles/arrow_objlib.dir/json/converter.cc.o > 575920src/arrow/CMakeFiles/arrow_objlib.dir/csv/converter.cc.o > 595112 > src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_cast_string.cc.o > 645728src/arrow/CMakeFiles/arrow_objlib.dir/type.cc.o > 683040 > src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_set_lookup.cc.o > 702232src/arrow/CMakeFiles/arrow_objlib.dir/ipc/reader.cc.o > 729912src/arrow/CMakeFiles/arrow_objlib.dir/tensor/coo_converter.cc.o > 752776src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csc_converter.cc.o > 752776src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csr_converter.cc.o > 877680src/arrow/CMakeFiles/arrow_objlib.dir/array/dict_internal.cc.o > 885624src/arrow/CMakeFiles/arrow_objlib.dir/builder.cc.o > 919072src/arrow/CMakeFiles/arrow_objlib.dir/scalar.cc.o > 941776src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_internal.cc.o > 1055248 src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_simple.cc.o > 1233304 > src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_compare.cc.o > 1265160 src/arrow/CMakeFiles/arrow_objlib.dir/sparse_tensor.cc.o > 1343480 src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csf_converter.cc.o > 1346928 src/arrow/CMakeFiles/arrow_objlib.dir/array.cc.o > 1502568 > src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_hash.cc.o > 1609760 > src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_cast_numeric.cc.o > 1794416 src/arrow/CMakeFiles/arrow_objlib.dir/array/diff.cc.o > 2759552 > src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_filter.cc.o > 7609432 > src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_take.cc.o > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8901) [C++] Reduce number of take kernels
[ https://issues.apache.org/jira/browse/ARROW-8901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-8901: --- Assignee: Wes McKinney > [C++] Reduce number of take kernels > --- > > Key: ARROW-8901 > URL: https://issues.apache.org/jira/browse/ARROW-8901 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > > After ARROW-8792 we can observe that we are generating 312 take kernels > {code} > In [1]: import pyarrow.compute as pc > > In [2]: reg = pc.function_registry() > > In [3]: reg.get_function('take') > > Out[3]: > arrow.compute.Function > kind: vector > num_kernels: 312 > {code} > You can see them all here: > https://gist.github.com/wesm/c3085bf40fa2ee5e555204f8c65b4ad5 > It's probably going to be sufficient to only support int16, int32, and int64 > index types for almost all types and insert implicit casts (once we implement > implicit-cast-insertion into the execution code) for other index types. If we > determine that there is some performance hot path where we need to specialize > for other index types, then we can always do that. > Additionally, we should be able to collapse the date/time kernels since we're > just moving memory. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-4333) [C++] Sketch out design for kernels and "query" execution in compute layer
[ https://issues.apache.org/jira/browse/ARROW-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123933#comment-17123933 ] Wes McKinney edited comment on ARROW-4333 at 6/2/20, 3:30 PM: -- I'd suggest we resolve this and pursue answers to some of the unanswered questions as more specific followups. In particular, I plan to be building multi-kernel expression evaluation in the near future so some of the pipelining/memory reuse questions must be addressed as a part of this was (Author: wesmckinn): I'd suggest we close this and pursue answers to some of the unanswered questions as more specific followups. In particular, I plan to be building multi-kernel expression evaluation in the near future so some of the pipelining/memory reuse questions must be addressed as a part of this > [C++] Sketch out design for kernels and "query" execution in compute layer > -- > > Key: ARROW-4333 > URL: https://issues.apache.org/jira/browse/ARROW-4333 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Micah Kornfield >Priority: Major > Labels: analytics > > It would be good to formalize the design of kernels and the controlling query > execution layer (e.g. volcano batch model?) to understand the following: > Contracts for kernels: > * Thread safety of kernels? > * When Kernels should allocate memory vs expect preallocated memory? How to > communicate requirements for a kernels memory allocaiton? > * How to communicate the whether a kernels execution is parallelizable > across a ChunkedArray? How to determine if the order to execution across a > ChunkedArray is important? > * How to communicate when it is safe to re-use the same buffers and input > and output to the same kernel? > What does the threading model look like for the higher level of control? > Where should synchronization happen? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-4333) [C++] Sketch out design for kernels and "query" execution in compute layer
[ https://issues.apache.org/jira/browse/ARROW-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123933#comment-17123933 ] Wes McKinney commented on ARROW-4333: - I'd suggest we close this and pursue answers to some of the unanswered questions as more specific followups. In particular, I plan to be building multi-kernel expression evaluation in the near future so some of the pipelining/memory reuse questions must be addressed as a part of this > [C++] Sketch out design for kernels and "query" execution in compute layer > -- > > Key: ARROW-4333 > URL: https://issues.apache.org/jira/browse/ARROW-4333 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Micah Kornfield >Priority: Major > Labels: analytics > > It would be good to formalize the design of kernels and the controlling query > execution layer (e.g. volcano batch model?) to understand the following: > Contracts for kernels: > * Thread safety of kernels? > * When Kernels should allocate memory vs expect preallocated memory? How to > communicate requirements for a kernels memory allocaiton? > * How to communicate the whether a kernels execution is parallelizable > across a ChunkedArray? How to determine if the order to execution across a > ChunkedArray is important? > * How to communicate when it is safe to re-use the same buffers and input > and output to the same kernel? > What does the threading model look like for the higher level of control? > Where should synchronization happen? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8966) [C++] Move arrow::ArrayData to a separate header file
[ https://issues.apache.org/jira/browse/ARROW-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8966. - Resolution: Fixed This was resolved in https://github.com/apache/arrow/commit/94a5026edb652d060110cac170380edf3d856f05 > [C++] Move arrow::ArrayData to a separate header file > - > > Key: ARROW-8966 > URL: https://issues.apache.org/jira/browse/ARROW-8966 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > There are code modules (such as compute kernels) that only require ArrayData > for doing computations, so pulling in all the code in array.h is not > necessary. There are probably other code paths that might benefit from this > also. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6052) [C++] Divide up arrow/array.h,cc into files in arrow/array/ similar to builder files
[ https://issues.apache.org/jira/browse/ARROW-6052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-6052. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7310 [https://github.com/apache/arrow/pull/7310] > [C++] Divide up arrow/array.h,cc into files in arrow/array/ similar to > builder files > > > Key: ARROW-6052 > URL: https://issues.apache.org/jira/browse/ARROW-6052 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Since these files are getting larger, this would improve codebase > navigability. Probably should use the same naming scheme as builder_* e.g. > {{arrow/array/array_dict.h}} > I recommend also putting the unit test files related to these in there for > better semantic organization. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9008) [C++] jemalloc_set_decay_ms precedence
[ https://issues.apache.org/jira/browse/ARROW-9008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-9008: Summary: [C++] jemalloc_set_decay_ms precedence (was: jemalloc_set_decay_ms precedence) > [C++] jemalloc_set_decay_ms precedence > -- > > Key: ARROW-9008 > URL: https://issues.apache.org/jira/browse/ARROW-9008 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Remi Dettai >Priority: Major > Labels: jemalloc > > I've noticed that the jemalloc const configuration [je_arrow_malloc_conf > |https://github.com/apache/arrow/blob/d542482bdc6bea8a449f000bdd74de8990c20015/cpp/src/arrow/memory_pool.h#L169] > overrides the arrow public function > [jemalloc_set_decay_ms()|https://github.com/apache/arrow/blob/e4bf4297585e1d0723957833d012aaf5c119f6b0/cpp/src/arrow/memory_pool.cc#L69]. > > Is their a way to call jemalloc_set_decay_ms so that it has the right > precedence ? > -> if yes, I believe this should be specified in the comments > -> if no, the function should be deprecated -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-4530) [C++] Review Aggregate kernel state allocation/ownership semantics
[ https://issues.apache.org/jira/browse/ARROW-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-4530. --- Resolution: Later > [C++] Review Aggregate kernel state allocation/ownership semantics > -- > > Key: ARROW-4530 > URL: https://issues.apache.org/jira/browse/ARROW-4530 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Francois Saint-Jacques >Priority: Major > Labels: analytics > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-8955) [C++] Use kernels for casting Scalar values instead of bespoke implementation
[ https://issues.apache.org/jira/browse/ARROW-8955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-8955. --- Fix Version/s: (was: 1.0.0) Resolution: Duplicate duplicate of ARROW-9006 > [C++] Use kernels for casting Scalar values instead of bespoke implementation > - > > Key: ARROW-8955 > URL: https://issues.apache.org/jira/browse/ARROW-8955 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > See details of casting in arrow/scalar.cc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8844) [C++] Optimize TransferBitmap unaligned case
[ https://issues.apache.org/jira/browse/ARROW-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8844. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7300 [https://github.com/apache/arrow/pull/7300] > [C++] Optimize TransferBitmap unaligned case > > > Key: ARROW-8844 > URL: https://issues.apache.org/jira/browse/ARROW-8844 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yibo Cai >Assignee: Yibo Cai >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 4.5h > Remaining Estimate: 0h > > TransferBitmap(CopyBitmap, InvertBitmap) unaligned case is processed > bit-by-bit[1]. Similar trick in this PR[2] may also be helpful here to > improve performance by processing in words. > [1] > https://github.com/apache/arrow/blob/e5a33f1220705aec6a224b55d2a6f47fbd957603/cpp/src/arrow/util/bit_util.cc#L121-L134 > [2] https://github.com/apache/arrow/pull/7135 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9006) [C++] Use Cast kernels to implement Scalar::Parse and Scalar::CastTo
Wes McKinney created ARROW-9006: --- Summary: [C++] Use Cast kernels to implement Scalar::Parse and Scalar::CastTo Key: ARROW-9006 URL: https://issues.apache.org/jira/browse/ARROW-9006 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 We should not maintain distinct (and possibly differently behaving) implementations of elementwise array casting and scalar casting. The new kernels framework provides for relatively easily generating kernels that can process arrays or scalars. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8929) [C++] Change compute::Arity:VarArgs min_args default to 0
[ https://issues.apache.org/jira/browse/ARROW-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8929. - Resolution: Fixed Issue resolved by pull request 7322 [https://github.com/apache/arrow/pull/7322] > [C++] Change compute::Arity:VarArgs min_args default to 0 > - > > Key: ARROW-8929 > URL: https://issues.apache.org/jira/browse/ARROW-8929 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > The issue of minimum number of arguments is separate from providing an > {{InputType}} for input type checking. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-3520) [C++] Implement List Flatten kernel
[ https://issues.apache.org/jira/browse/ARROW-3520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-3520: --- Assignee: Wes McKinney > [C++] Implement List Flatten kernel > --- > > Key: ARROW-3520 > URL: https://issues.apache.org/jira/browse/ARROW-3520 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 2.0.0 > > > see also ARROW-45 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9003) [C++] Add VectorFunction wrapping arrow::Concatenate
Wes McKinney created ARROW-9003: --- Summary: [C++] Add VectorFunction wrapping arrow::Concatenate Key: ARROW-9003 URL: https://issues.apache.org/jira/browse/ARROW-9003 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 This would be a varargs function -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8929) [C++] Change compute::Arity:VarArgs min_args default to 0
[ https://issues.apache.org/jira/browse/ARROW-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-8929: --- Assignee: Wes McKinney > [C++] Change compute::Arity:VarArgs min_args default to 0 > - > > Key: ARROW-8929 > URL: https://issues.apache.org/jira/browse/ARROW-8929 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > The issue of minimum number of arguments is separate from providing an > {{InputType}} for input type checking. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8896) [C++] Reimplement dictionary unpacking in Cast kernels using Take
[ https://issues.apache.org/jira/browse/ARROW-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-8896: --- Assignee: Wes McKinney > [C++] Reimplement dictionary unpacking in Cast kernels using Take > - > > Key: ARROW-8896 > URL: https://issues.apache.org/jira/browse/ARROW-8896 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > As suggested by [~apitrou] this should yield less code to maintain -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9001) [R] Box outputs as correct type in call_function
Wes McKinney created ARROW-9001: --- Summary: [R] Box outputs as correct type in call_function Key: ARROW-9001 URL: https://issues.apache.org/jira/browse/ARROW-9001 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Wes McKinney Fix For: 1.0.0 This would prevent segfaults by putting the SEXP in the wrong kind of R6 container -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7009) [C++] Refactor filter/take kernels to use Datum instead of overloads
[ https://issues.apache.org/jira/browse/ARROW-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-7009: --- Assignee: Wes McKinney (was: Ben Kietzman) > [C++] Refactor filter/take kernels to use Datum instead of overloads > > > Key: ARROW-7009 > URL: https://issues.apache.org/jira/browse/ARROW-7009 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Neal Richardson >Assignee: Wes McKinney >Priority: Minor > Fix For: 1.0.0 > > > Followup to ARROW-6784. See discussion on > [https://github.com/apache/arrow/pull/5686,|https://github.com/apache/arrow/pull/5686] > as well as ARROW-6959. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8917) [C++][Compute] Formalize "metafunction" concept
[ https://issues.apache.org/jira/browse/ARROW-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-8917: --- Assignee: Wes McKinney > [C++][Compute] Formalize "metafunction" concept > --- > > Key: ARROW-8917 > URL: https://issues.apache.org/jira/browse/ARROW-8917 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > A metafunction is a function that provides the {{Execute}} API but does not > contain any kernels. Such functions can also handle non-Array/Scalar inputs > like RecordBatch or Table. > This will enable bindings to invoke such functions (like take, filter) like > {code} > call_function('take', [table, indices]) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8992) [CI][C++] march not passing correctly for docker-compose run
[ https://issues.apache.org/jira/browse/ARROW-8992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8992: Summary: [CI][C++] march not passing correctly for docker-compose run (was: march not passing correctly for docker-compose run) > [CI][C++] march not passing correctly for docker-compose run > > > Key: ARROW-8992 > URL: https://issues.apache.org/jira/browse/ARROW-8992 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging, Python >Affects Versions: 0.17.0, 0.17.1 > Environment: Mendel Linux 4.0 >Reporter: Elliott Kipp >Priority: Critical > Attachments: CMakeError.log, CMakeOutput.log > > > [https://github.com/apache/arrow/issues/7307] > Building on the new ASUS Tinker Edge T, running Mendel Linux 4.0 (Day). > docker-compose build commands work fine with no errors: > DEBIAN=10 ARCH=arm64v8 docker-compose build debian-cpp && DEBIAN=10 > ARCH=arm64v8 docker-compose build debian-python > DEBIAN=10 ARCH=arm64v8 docker-compose run debian-python - fails with the > following: > – Running cmake for pyarrow > cmake -DPYTHON_EXECUTABLE=/usr/local/bin/python -G Ninja > -DPYARROW_BUILD_CUDA=off -DPYARROW_BUILD_FLIGHT=on -DPYARROW_BUILD_GANDIVA=on > -DPYARROW_BUILD_DATASET=on -DPYARROW_BUILD_ORC=on -DPYARROW_BUILD_PARQUET=on > -DPYARROW_BUILD_PLASMA=on -DPYARROW_BUILD_S3=off -DPYARROW_BUILD_HDFS=off > -DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off > -DPYARROW_BUNDLE_BOOST=off -DPYARROW_GENERATE_COVERAGE=off > -DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on > -DCMAKE_BUILD_TYPE=debug /arrow/python > – The C compiler identification is GNU 8.3.0 > – The CXX compiler identification is GNU 8.3.0 > – Check for working C compiler: /usr/lib/ccache/gcc > – Check for working C compiler: /usr/lib/ccache/gcc – works > – Detecting C compiler ABI info > – Detecting C compiler ABI info - done > – Detecting C compile features > – Detecting C compile features - done > – Check for working CXX compiler: /usr/lib/ccache/g++ > – Check for working CXX compiler: /usr/lib/ccache/g++ – works > – Detecting CXX compiler ABI info > – Detecting CXX compiler ABI info - done > – Detecting CXX compile features > – Detecting CXX compile features - done > – System processor: aarch64 > – Performing Test CXX_SUPPORTS_ARMV8_ARCH > – Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed > – Arrow build warning level: PRODUCTION > CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message): > Unsupported arch flag: -march=. > Call Stack (most recent call first): > CMakeLists.txt:100 (include) > – Configuring incomplete, errors occurred! > See also "/build/python/temp.linux-aarch64-3.7/CMakeFiles/CMakeOutput.log". > See also "/build/python/temp.linux-aarch64-3.7/CMakeFiles/CMakeError.log". > error: command 'cmake' failed with exit status 1 > Tried the tarball release for both 0.17.0 and 0.17.1, same result. Also tried > compiling manually (following these instructions: > [https://dzone.com/articles/building-pyarrow-with-cuda-support]) with the > same result. > Only modifications I made to source are editing the docker-compose volumes, > as described here: [https://github.com/apache/arrow/pull/6907] > Jira opened, per request at: [https://github.com/apache/arrow/issues/7307] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8995) [C++] Scalar formatting code used in array/diff.cc should be reusable
[ https://issues.apache.org/jira/browse/ARROW-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121145#comment-17121145 ] Wes McKinney commented on ARROW-8995: - Since there is already {{Scalar::ToString}}, let's use that. Does not seem justified to have more than one way to format scalar values. > [C++] Scalar formatting code used in array/diff.cc should be reusable > - > > Key: ARROW-8995 > URL: https://issues.apache.org/jira/browse/ARROW-8995 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > Formatting Array values as strings is not specific to the diff.cc code, so it > may make sense to move this code elsewhere where it can be used generally > (perhaps a method like {{Array::FormatValue}}?). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-5854) [Python] Expose compare kernels on Array class
[ https://issues.apache.org/jira/browse/ARROW-5854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-5854. - Resolution: Fixed Issue resolved by pull request 7273 [https://github.com/apache/arrow/pull/7273] > [Python] Expose compare kernels on Array class > -- > > Key: ARROW-5854 > URL: https://issues.apache.org/jira/browse/ARROW-5854 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Expose the compare kernel for comparing with scalar or array (ARROW-3087, > ARROW-4990) on the python Array class. > This can implement the {{\_\_eq\_\_}} et al dunder methods on the Array class. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8999) [Python][C++] Non-deterministic segfault in "AMD64 MacOS 10.15 Python 3.7" build
Wes McKinney created ARROW-8999: --- Summary: [Python][C++] Non-deterministic segfault in "AMD64 MacOS 10.15 Python 3.7" build Key: ARROW-8999 URL: https://issues.apache.org/jira/browse/ARROW-8999 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 1.0.0 I've been seeing this segfault periodically the last week, does anyone have an idea what might be wrong? https://github.com/apache/arrow/pull/7273/checks?check_run_id=717249862 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8998) [Python] Make NumPy an optional runtime dependency
Wes McKinney created ARROW-8998: --- Summary: [Python] Make NumPy an optional runtime dependency Key: ARROW-8998 URL: https://issues.apache.org/jira/browse/ARROW-8998 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Wes McKinney Since in the relatively near future, one will be able to do non-trivial analytical operations and query processing natively on Arrow data structures through pyarrow, it does not make sense to require users to always install NumPy when that install pyarrow. I propose to split the NumPy-depending parts of libarrow_python into a libarrow_numpy (which also must be bundled) and moving this part of the codebase into a separate Cython module. This refactoring should be relatively painless though there may be a number of packaging details to chase up since this would introduce a new shared library to be installed in various packaging targets. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8998) [Python] Make NumPy an optional runtime dependency
[ https://issues.apache.org/jira/browse/ARROW-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8998: Description: Since in the relatively near future, one will be able to do non-trivial analytical operations and query processing natively on Arrow data structures through pyarrow, it does not make sense to require users to always install NumPy when they install pyarrow. I propose to split the NumPy-depending parts of libarrow_python into a libarrow_numpy (which also must be bundled) and moving this part of the codebase into a separate Cython module. This refactoring should be relatively painless though there may be a number of packaging details to chase up since this would introduce a new shared library to be installed in various packaging targets. was: Since in the relatively near future, one will be able to do non-trivial analytical operations and query processing natively on Arrow data structures through pyarrow, it does not make sense to require users to always install NumPy when that install pyarrow. I propose to split the NumPy-depending parts of libarrow_python into a libarrow_numpy (which also must be bundled) and moving this part of the codebase into a separate Cython module. This refactoring should be relatively painless though there may be a number of packaging details to chase up since this would introduce a new shared library to be installed in various packaging targets. > [Python] Make NumPy an optional runtime dependency > -- > > Key: ARROW-8998 > URL: https://issues.apache.org/jira/browse/ARROW-8998 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > > Since in the relatively near future, one will be able to do non-trivial > analytical operations and query processing natively on Arrow data structures > through pyarrow, it does not make sense to require users to always install > NumPy when they install pyarrow. I propose to split the NumPy-depending parts > of libarrow_python into a libarrow_numpy (which also must be bundled) and > moving this part of the codebase into a separate Cython module. > This refactoring should be relatively painless though there may be a number > of packaging details to chase up since this would introduce a new shared > library to be installed in various packaging targets. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8937) [C++] Add "parse_strptime" function for string to timestamp conversions using the kernels framework
[ https://issues.apache.org/jira/browse/ARROW-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-8937: --- Assignee: Wes McKinney > [C++] Add "parse_strptime" function for string to timestamp conversions using > the kernels framework > --- > > Key: ARROW-8937 > URL: https://issues.apache.org/jira/browse/ARROW-8937 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > > This should be relatively straightforward to implement using the new kernels > framework -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7784) [C++] diff.cc is extremely slow to compile
[ https://issues.apache.org/jira/browse/ARROW-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-7784: --- Assignee: Wes McKinney > [C++] diff.cc is extremely slow to compile > -- > > Key: ARROW-7784 > URL: https://issues.apache.org/jira/browse/ARROW-7784 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Wes McKinney >Priority: Minor > Fix For: 1.0.0 > > > This comes up especially when doing an optimized build. {{diff.cc}} is always > enabled even if all components are disabled, and it takes multiple seconds to > compile. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8995) [C++] Scalar formatting code used in array/diff.cc should be reusable
Wes McKinney created ARROW-8995: --- Summary: [C++] Scalar formatting code used in array/diff.cc should be reusable Key: ARROW-8995 URL: https://issues.apache.org/jira/browse/ARROW-8995 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Formatting Array values as strings is not specific to the diff.cc code, so it may make sense to move this code elsewhere where it can be used generally (perhaps a method like {{Array::FormatValue}}?). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8993) [Rust] Support gzipped json files
[ https://issues.apache.org/jira/browse/ARROW-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8993: Summary: [Rust] Support gzipped json files (was: Support gzipped json files) > [Rust] Support gzipped json files > - > > Key: ARROW-8993 > URL: https://issues.apache.org/jira/browse/ARROW-8993 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Mohamed Zenadi >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > It would be interesting to be able to read already compressed json files. > This is is regularly used, with many storing their files as json.gz (we do > the same). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8994) [C++] Disable include-what-you-use cpplint lint checks
[ https://issues.apache.org/jira/browse/ARROW-8994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-8994: --- Assignee: Wes McKinney > [C++] Disable include-what-you-use cpplint lint checks > -- > > Key: ARROW-8994 > URL: https://issues.apache.org/jira/browse/ARROW-8994 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > If we want to be serious about IWYU, it would be better to use IWYU directly. > The minimal checks that IWYU does can be a nuisance rather than addressing > the problem holistically -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6052) [C++] Divide up arrow/array.h,cc into files in arrow/array/ similar to builder files
[ https://issues.apache.org/jira/browse/ARROW-6052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-6052: --- Assignee: Wes McKinney > [C++] Divide up arrow/array.h,cc into files in arrow/array/ similar to > builder files > > > Key: ARROW-6052 > URL: https://issues.apache.org/jira/browse/ARROW-6052 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > > Since these files are getting larger, this would improve codebase > navigability. Probably should use the same naming scheme as builder_* e.g. > {{arrow/array/array_dict.h}} > I recommend also putting the unit test files related to these in there for > better semantic organization. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8994) [C++] Disable include-what-you-use cpplint lint checks
Wes McKinney created ARROW-8994: --- Summary: [C++] Disable include-what-you-use cpplint lint checks Key: ARROW-8994 URL: https://issues.apache.org/jira/browse/ARROW-8994 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 If we want to be serious about IWYU, it would be better to use IWYU directly. The minimal checks that IWYU does can be a nuisance rather than addressing the problem holistically -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8991) [C++][Compute] Add scalar_hash function
Wes McKinney created ARROW-8991: --- Summary: [C++][Compute] Add scalar_hash function Key: ARROW-8991 URL: https://issues.apache.org/jira/browse/ARROW-8991 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 The purpose of this function is to compute 32- or 64-bit hash values for each cell in an Array. Hashes for nested types can be computed recursively by combining the hash values of their children -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8990) [C++] Benchmark hash table against thirdparty options, possibly vendor a thirdparty hash table library
[ https://issues.apache.org/jira/browse/ARROW-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8990: Description: While we have our own hash table implementation, it would be worthwhile to set up some benchmarks so that we can compare against std::unordered_map and some other thirdparty libraries for hash tables to know whether we should possibly use a thirdparty library. See e.g. https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html Libraries to consider: * https://github.com/sparsehash/sparsehash was: While we have our own hash table implementation, it would be worthwhile to set up some benchmarks so that we can compare against std::unordered_map and some other thirdparty libraries for hash tables to know whether we should possibly use a thirdparty library. See e.g. https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html > [C++] Benchmark hash table against thirdparty options, possibly vendor a > thirdparty hash table library > -- > > Key: ARROW-8990 > URL: https://issues.apache.org/jira/browse/ARROW-8990 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > While we have our own hash table implementation, it would be worthwhile to > set up some benchmarks so that we can compare against std::unordered_map and > some other thirdparty libraries for hash tables to know whether we should > possibly use a thirdparty library. See e.g. > https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html > Libraries to consider: > * https://github.com/sparsehash/sparsehash -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8990) [C++] Benchmark hash table against thirdparty options, possibly vendor a thirdparty hash table library
Wes McKinney created ARROW-8990: --- Summary: [C++] Benchmark hash table against thirdparty options, possibly vendor a thirdparty hash table library Key: ARROW-8990 URL: https://issues.apache.org/jira/browse/ARROW-8990 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney While we have our own hash table implementation, it would be worthwhile to set up some benchmarks so that we can compare against std::unordered_map and some other thirdparty libraries for hash tables to know whether we should possibly use a thirdparty library. See e.g. https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8989) [C++] Document available functions in compute::FunctionRegistry
[ https://issues.apache.org/jira/browse/ARROW-8989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8989: Summary: [C++] Document available functions in compute::FunctionRegistry (was: [C++] Document available functions in FunctionRegistry) > [C++] Document available functions in compute::FunctionRegistry > --- > > Key: ARROW-8989 > URL: https://issues.apache.org/jira/browse/ARROW-8989 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Create a compute page in the C++ section of the Sphinx docs and make a list > of the available functions and what they do -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8989) [C++] Document available functions in FunctionRegistry
Wes McKinney created ARROW-8989: --- Summary: [C++] Document available functions in FunctionRegistry Key: ARROW-8989 URL: https://issues.apache.org/jira/browse/ARROW-8989 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 Create a compute page in the C++ section of the Sphinx docs and make a list of the available functions and what they do -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8922) [C++] Implement example string scalar kernel function to assist with string kernels buildout per ARROW-555
[ https://issues.apache.org/jira/browse/ARROW-8922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8922. - Resolution: Fixed Issue resolved by pull request 7278 [https://github.com/apache/arrow/pull/7278] > [C++] Implement example string scalar kernel function to assist with string > kernels buildout per ARROW-555 > -- > > Key: ARROW-8922 > URL: https://issues.apache.org/jira/browse/ARROW-8922 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > I will write a patch to provide an example of creating a string-input > string-output kernel for executing scalar-valued string functions -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8988) [Python] After upgrade pyarrow from 0.15 to 0.17.1 connect to hdfs don`t work with libdfs jni
[ https://issues.apache.org/jira/browse/ARROW-8988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8988: Summary: [Python] After upgrade pyarrow from 0.15 to 0.17.1 connect to hdfs don`t work with libdfs jni (was: Help! After upgrade pyarrow from 0.15 to 0.17.1 connect to hdfs don`t work with libdfs jni) > [Python] After upgrade pyarrow from 0.15 to 0.17.1 connect to hdfs don`t work > with libdfs jni > - > > Key: ARROW-8988 > URL: https://issues.apache.org/jira/browse/ARROW-8988 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 >Reporter: Pavel Dourugyan >Priority: Major > Labels: beginners, hdfs, hortonworks, libhdfs, pyarrow, python3 > Attachments: 1.txt, 2.txt > > > h2. Problem > After upgrade pyarrow from 0.15 to 0.17, I have a some troubles. I > understand, that libhdfs3 no support now. However, in my case, libhdfs not > work too. See below. > My experience in the Hadoop ecosystem is not big. Maybe, I took a some > wrongs. I installed Hortonworks HDP from Ambari service on the virtual > machine, installed on my PC. > I try that.. > 1. just connect.. > %xmode Verbose > import pyarrow as pa > hdfs = pa.hdfs.connect(host='hdp.test.com', port=8020, user='hdfs') > --- > FileNotFoundError: [Errno 2] No such file or directory: 'hadoop': 'hadoop' > ([#1.txt]) > 2. to bypass if driver == 'libhdfs'.. > %xmode Verbose > import pyarrow as pa > hdfs = pa.HadoopFileSystem(host='hdp.test.com', port=8020, user='hdfs', > driver=None') > --- > OSError: Unable to load libjvm: /usr/java/latest//lib/server/libjvm.so: > cannot open shared object file: No such file or directory ([#2.txt]) > 3. With libhdfs3 it working: > import hdfs3 > hdfs = hdfs3.HDFileSystem(host='hdp.test.com', port=8020, user='hdfs') > #ls remote folder > hdfs.ls('/data/', detail=False) > ['/data/TimeSheet.2020-04-11', '/data/test', '/data/test.json'] > h2. Environment. > h4. +Client PC:+ > OS: Debian 10. Dev.: Anaconda3 (python 3.7.6), Jupyter Lab 2, pyarrow 0.17.1 > (from conda-forge) > +Hadoop+ (on VM – Oracle VirtualBox): > OS: Oracle Linux 7.6. Distr.: Hortonworks HDP 3.1.4 > libhdfs.so: > [root@hdp /]# find / -name libhdfs.so > /usr/lib/ams-hbase/lib/hadoop-native/libhdfs.so > /usr/hdp/3.1.4.0-315/usr/lib/libhdfs.so > > Java path: > [root@hdp /]# sudo alternatives --config java > > --- > *+ 1 java-1.8.0-openjdk.x86_64 > (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/bin/java) > > libjvm: > [root@hdp /]# find / -name libjvm.* > > /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib/amd64/server/libjvm.so > /usr/jdk64/jdk1.8.0_112/jre/lib/amd64/server/libjvm.so > > I tried many settings (. Below last : > # etc/profile. > ... > export JAVA_HOME=$(dirname $(dirname $(readlink $(readlink $(which javac) > export JRE_HOME=$JAVA_HOME/jre > export > JAVA_CLASSPATH=$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar > export HADOOP_HOME=/usr/hdp/3.1.4.0-315/hadoop > export HADOOP_CLASSPATH=$(find $HADOOP_HOME -name '*.jar' | xargs echo | tr ' > ' ':') > export ARROW_LIBHDFS_DIR=/usr/lib/ams-hbase/lib/hadoop-native > export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin > export CLASSPATH==.:$CLASSPATH:$JAVA_CLASSPATH:$HADOOP_CLASSPATH > export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JRE_HOME/lib/amd64/server > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-8916) [Python] Add relevant glue for implementing each kind of FunctionOptions
[ https://issues.apache.org/jira/browse/ARROW-8916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-8916. --- Fix Version/s: (was: 1.0.0) Resolution: Fixed I think it's fine that we deal with the options on a case by case basis. Not that many functions will require options anyhow > [Python] Add relevant glue for implementing each kind of FunctionOptions > > > Key: ARROW-8916 > URL: https://issues.apache.org/jira/browse/ARROW-8916 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8917) [C++][Compute] Formalize "metafunction" concept
[ https://issues.apache.org/jira/browse/ARROW-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8917: Description: A metafunction is a function that provides the {{Execute}} API but does not contain any kernels. Such functions can also handle non-Array/Scalar inputs like RecordBatch or Table. This will enable bindings to invoke such functions (like take, filter) like {code} call_function('take', [table, indices]) {code} was: This will enable bindings to invoke such functions (like take, filter) like {code} call_function('take', [table, indices]) {code} > [C++][Compute] Formalize "metafunction" concept > --- > > Key: ARROW-8917 > URL: https://issues.apache.org/jira/browse/ARROW-8917 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > A metafunction is a function that provides the {{Execute}} API but does not > contain any kernels. Such functions can also handle non-Array/Scalar inputs > like RecordBatch or Table. > This will enable bindings to invoke such functions (like take, filter) like > {code} > call_function('take', [table, indices]) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8917) [C++][Compute] Formalize "metafunction" concept
[ https://issues.apache.org/jira/browse/ARROW-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8917: Summary: [C++][Compute] Formalize "metafunction" concept (was: [C++] Add compute::Function subclass for invoking certain kernels on RecordBatch/Table-valued inputs) > [C++][Compute] Formalize "metafunction" concept > --- > > Key: ARROW-8917 > URL: https://issues.apache.org/jira/browse/ARROW-8917 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This will enable bindings to invoke such functions (like take, filter) like > {code} > call_function('take', [table, indices]) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-8971) [Python] Upgrade pip version in manylinux* builds
[ https://issues.apache.org/jira/browse/ARROW-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-8971. --- Resolution: Won't Fix The pip in the manylinux images is not at risk of experiencing this security issue > [Python] Upgrade pip version in manylinux* builds > - > > Key: ARROW-8971 > URL: https://issues.apache.org/jira/browse/ARROW-8971 > Project: Apache Arrow > Issue Type: Bug >Reporter: bindu >Assignee: bindu >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > Could you please update the pip latest version 20.1 > [https://github.com/apache/arrow/blob/2688a62f8179f20c20c06a10fcd22fe8a714ae48/python/manylinux1/scripts/requirements.txt] > CVE-2018-20225 > An issue was discovered in pip (all versions) because it installs the version > with the highest version number, even if the user had intended to obtain a > private package from a private index. This only affects use of the > --extra-index-url option, and exploitation requires that the package does not > already exist in the public index (and thus the attacker can put the package > there with an arbitrary version number). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package
[ https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8394: Summary: [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package (was: Typescript compiler errors for arrow d.ts files, when using es2015-esm package) > [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm > package > --- > > Key: ARROW-8394 > URL: https://issues.apache.org/jira/browse/ARROW-8394 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Affects Versions: 0.16.0 >Reporter: Shyamal Shukla >Priority: Blocker > > Attempting to use apache-arrow within a web application, but typescript > compiler throws the following errors in some of arrow's .d.ts files > import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow"; > export class SomeClass { > . > . > constructor() { > const t = Table.from(''); > } > *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: > Class static side 'typeof Column' incorrectly extends base class static side > 'typeof Chunked'. Types of property 'new' are incompatible. > *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: > Subsequent property declarations must have the same type. Property 'schema' > must be of type 'Schema', but here has type 'Schema'. > 238 schema: Schema; > *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error > TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. > The types of 'slice(...).clone' are incompatible between these types. > the tsconfig.json file looks like > { > "compilerOptions": { > "target":"ES6", > "outDir": "dist", > "baseUrl": "src/" > }, > "exclude": ["dist"], > "include": ["src/*.ts"] > } -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8964) [Python][Parquet] improve reading of partitioned parquet datasets whose schema changed
[ https://issues.apache.org/jira/browse/ARROW-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8964: Summary: [Python][Parquet] improve reading of partitioned parquet datasets whose schema changed (was: Pyarrow: improve reading of partitioned parquet datasets whose schema changed) > [Python][Parquet] improve reading of partitioned parquet datasets whose > schema changed > -- > > Key: ARROW-8964 > URL: https://issues.apache.org/jira/browse/ARROW-8964 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.17.1 > Environment: Ubuntu 18.04, latest miniconda with python 3.7, pyarrow > 0.17.1 >Reporter: Ira Saktor >Priority: Major > > Hi there, i'm encountering the following issue when reading from HDFS: > > *My situation:* > I have a paritioned parquet dataset in HDFS, whose recent partitions contain > parquet files with more columns than the older ones. When i try to read data > using pyarrow.dataset.dataset and filter on recent data, i still get only the > columns that are also contained in the old parquet files. I'd like to somehow > merge the schema or use the schema from parquet files from which data ends up > being loaded. > *when using:* > `pyarrow.dataset.dataset(path_to_hdfs_directory, paritioning = 'hive', > filters = my_filter_expression).to_table().to_pandas()` > Is there please a way to handle schema changes in a way, that the read data > would contain all columns? > everything works fine when i copy the needed parquet files into a separate > folder, however it is very inconvenient way of working. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8982) [CI] Remove allow_failures for s390x in TravisCI
[ https://issues.apache.org/jira/browse/ARROW-8982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8982. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7301 [https://github.com/apache/arrow/pull/7301] > [CI] Remove allow_failures for s390x in TravisCI > > > Key: ARROW-8982 > URL: https://issues.apache.org/jira/browse/ARROW-8982 > Project: Apache Arrow > Issue Type: Bug > Components: CI >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Now, all of existing tests except Parquet pass on s390x. It is good time to > remove {{allow_failures}} for s390x on TravisCI. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8985) [Format] Add "byte width" field with default of 16 to Decimal Flatbuffers type for forward compatibility
Wes McKinney created ARROW-8985: --- Summary: [Format] Add "byte width" field with default of 16 to Decimal Flatbuffers type for forward compatibility Key: ARROW-8985 URL: https://issues.apache.org/jira/browse/ARROW-8985 Project: Apache Arrow Issue Type: Improvement Components: Format Reporter: Wes McKinney Fix For: 1.0.0 This will permit larger or smaller decimals to be added to the format later without having to add a new Type union value -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8966) [C++] Move arrow::ArrayData to a separate header file
[ https://issues.apache.org/jira/browse/ARROW-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-8966: --- Assignee: Wes McKinney > [C++] Move arrow::ArrayData to a separate header file > - > > Key: ARROW-8966 > URL: https://issues.apache.org/jira/browse/ARROW-8966 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > There are code modules (such as compute kernels) that only require ArrayData > for doing computations, so pulling in all the code in array.h is not > necessary. There are probably other code paths that might benefit from this > also. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6856) [C++] Use ArrayData instead of Array for ArrayData::dictionary
[ https://issues.apache.org/jira/browse/ARROW-6856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-6856: --- Assignee: Wes McKinney > [C++] Use ArrayData instead of Array for ArrayData::dictionary > -- > > Key: ARROW-6856 > URL: https://issues.apache.org/jira/browse/ARROW-6856 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This would be helpful for consistency. {{DictionaryArray}} may want to cache > a "boxed" version of this to return from {{DictionaryArray::dictionary}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8983) [Python] Downloading sources of pyarrow and its requirements from pypi takes several minutes starting from 0.16.0
[ https://issues.apache.org/jira/browse/ARROW-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8983: Summary: [Python] Downloading sources of pyarrow and its requirements from pypi takes several minutes starting from 0.16.0 (was: Downloading sources of pyarrow and its requirements from pypi takes several minutes starting from 0.16.0) > [Python] Downloading sources of pyarrow and its requirements from pypi takes > several minutes starting from 0.16.0 > - > > Key: ARROW-8983 > URL: https://issues.apache.org/jira/browse/ARROW-8983 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.16.0, 0.17.0, 0.17.1 >Reporter: Valentyn Tymofieiev >Priority: Minor > > It appears that > python -m pip download --dest /tmp pyarrow==0.17.1 --no-binary :all: > takes several minutes to execute. > There seems to be an increase in runtime starting from 0.16.0: on Python 2 > python -m pip download --dest /tmp pyarrow==0.15.1 --no-binary :all: > appears to be somewhat faster, but the same command is still slow on Py3. > The command is stuck for a while with "Installing build dependencies ... ", > and increased CPU usage. > The intent of this command is to download source tarball for a package and > its dependencies. > Some investigation was started on the mailing list: > https://lists.apache.org/thread.html/r9baa48a9d1517834c285f0f238f29fcf54405cb7cf1e681314239d7f%40%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8976) [C++] compute::CallFunction can't Filter/Take with ChunkedArray
[ https://issues.apache.org/jira/browse/ARROW-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119554#comment-17119554 ] Wes McKinney commented on ARROW-8976: - I plan to add Take and Filter "metafunctions" that deal with this and also Table/RecordBatch inputs > [C++] compute::CallFunction can't Filter/Take with ChunkedArray > --- > > Key: ARROW-8976 > URL: https://issues.apache.org/jira/browse/ARROW-8976 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Neal Richardson >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Followup to ARROW-8938 > {{Invalid: Kernel does not support chunked array arguments}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8976) [C++] compute::CallFunction can't Filter/Take with ChunkedArray
[ https://issues.apache.org/jira/browse/ARROW-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-8976: --- Assignee: Wes McKinney > [C++] compute::CallFunction can't Filter/Take with ChunkedArray > --- > > Key: ARROW-8976 > URL: https://issues.apache.org/jira/browse/ARROW-8976 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Neal Richardson >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Followup to ARROW-8938 > {{Invalid: Kernel does not support chunked array arguments}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8926) [C++] Improve docstrings in new public APIs in arrow/compute and fix miscellaneous typos
[ https://issues.apache.org/jira/browse/ARROW-8926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8926. - Resolution: Fixed Issue resolved by pull request 7264 [https://github.com/apache/arrow/pull/7264] > [C++] Improve docstrings in new public APIs in arrow/compute and fix > miscellaneous typos > > > Key: ARROW-8926 > URL: https://issues.apache.org/jira/browse/ARROW-8926 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > I've noticed some imprecise language while reading the headers and some other > opportunities for improvement -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8978) [C++][Compute] "Conditional jump or move depends on uninitialised value(s)" Valgrind warning
[ https://issues.apache.org/jira/browse/ARROW-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-8978: --- Assignee: Wes McKinney > [C++][Compute] "Conditional jump or move depends on uninitialised value(s)" > Valgrind warning > > > Key: ARROW-8978 > URL: https://issues.apache.org/jira/browse/ARROW-8978 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Kouhei Sutou >Assignee: Wes McKinney >Priority: Major > > https://github.com/ursa-labs/crossbow/runs/715700830#step:6:4277 > {noformat} > [ RUN ] TestCallScalarFunction.PreallocationCases > ==5357== Conditional jump or move depends on uninitialised value(s) > ==5357==at 0x51D69A6: void arrow::internal::TransferBitmap true>(unsigned char const*, long, long, long, unsigned char*) > (bit_util.cc:176) > ==5357==by 0x51CE866: arrow::internal::CopyBitmap(unsigned char const*, > long, long, unsigned char*, long, bool) (bit_util.cc:208) > ==5357==by 0x52B6325: > arrow::compute::detail::NullPropagator::PropagateSingle() (exec.cc:295) > ==5357==by 0x52B36D1: Execute (exec.cc:378) > ==5357==by 0x52B36D1: > arrow::compute::detail::PropagateNulls(arrow::compute::KernelContext*, > arrow::compute::ExecBatch const&, arrow::ArrayData*) (exec.cc:412) > ==5357==by 0x52BA7F3: ExecuteBatch (exec.cc:586) > ==5357==by 0x52BA7F3: > arrow::compute::detail::ScalarExecutor::Execute(std::vector std::allocator > const&, arrow::compute::detail::ExecListener*) > (exec.cc:542) > ==5357==by 0x52BC21F: > arrow::compute::Function::Execute(std::vector std::allocator > const&, arrow::compute::FunctionOptions > const*, arrow::compute::ExecContext*) const (function.cc:94) > ==5357==by 0x52B141C: > arrow::compute::CallFunction(std::__cxx11::basic_string std::char_traits, std::allocator > const&, > std::vector > const&, > arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) > (exec.cc:937) > ==5357==by 0x52B16F2: > arrow::compute::CallFunction(std::__cxx11::basic_string std::char_traits, std::allocator > const&, > std::vector > const&, > arrow::compute::ExecContext*) (exec.cc:942) > ==5357==by 0x155515: > arrow::compute::detail::TestCallScalarFunction_PreallocationCases_Test::TestBody()::{lambda(std::__cxx11::basic_string std::char_traits, std::allocator > >)#1}::operator()(std::__cxx11::basic_string, > std::allocator >) const (exec_test.cc:756) > ==5357==by 0x156AF2: > arrow::compute::detail::TestCallScalarFunction_PreallocationCases_Test::TestBody() > (exec_test.cc:786) > ==5357==by 0x5BE4862: void > testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) (in > /opt/conda/envs/arrow/lib/libgtest.so) > ==5357==by 0x5BDEDE2: void > testing::internal::HandleExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) (in > /opt/conda/envs/arrow/lib/libgtest.so) > ==5357== > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8918) [C++] Add cast "metafunction" to FunctionRegistry that addresses dispatching to appropriate type-specific CastFunction
[ https://issues.apache.org/jira/browse/ARROW-8918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8918. - Resolution: Fixed Issue resolved by pull request 7258 [https://github.com/apache/arrow/pull/7258] > [C++] Add cast "metafunction" to FunctionRegistry that addresses dispatching > to appropriate type-specific CastFunction > -- > > Key: ARROW-8918 > URL: https://issues.apache.org/jira/browse/ARROW-8918 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > By setting the output type in {{CastOptions}}, we can write > {code} > call_function("cast", [arg], cast_options) > {code} > This simplifies use of casting for binding developers. This mimics the > standard SQL > {code} > CAST(expr AS target_type) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8968) [C++][Gandiva] Show link warning message on s390x
[ https://issues.apache.org/jira/browse/ARROW-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8968. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7295 [https://github.com/apache/arrow/pull/7295] > [C++][Gandiva] Show link warning message on s390x > - > > Key: ARROW-8968 > URL: https://issues.apache.org/jira/browse/ARROW-8968 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > When execute gandiva test, the warning message is shown as follows > {code} > ~/arrow/cpp/src/gandiva$ ../../build/debug/gandiva-binary-test -V > Running main() from > /home/ishizaki/arrow/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from TestBinary > [ RUN ] TestBinary.TestSimple > warning: Linking two modules of different data layouts: 'precompiled' is > 'E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-a:8:16-n32:64' whereas 'codegen' is > 'E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-v128:64-a:8:16-n32:64' > [ OK ] TestBinary.TestSimple (41 ms) > [--] 1 test from TestBinary (41 ms total) > [--] Global test environment tear-down > [==] 1 test from 1 test case ran. (41 ms total) > [ PASSED ] 1 test. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8793) [C++] BitUtil::SetBitsTo probably doesn't need to be inline
[ https://issues.apache.org/jira/browse/ARROW-8793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8793. - Resolution: Fixed Issue resolved by pull request 7296 [https://github.com/apache/arrow/pull/7296] > [C++] BitUtil::SetBitsTo probably doesn't need to be inline > --- > > Key: ARROW-8793 > URL: https://issues.apache.org/jira/browse/ARROW-8793 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Inlining this function probably does not yield meaningful performance benefits -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8971) [Python] Upgrade pip version in manylinux* builds
[ https://issues.apache.org/jira/browse/ARROW-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8971: Summary: [Python] Upgrade pip version in manylinux* builds (was: [Python] Upgrade pip) > [Python] Upgrade pip version in manylinux* builds > - > > Key: ARROW-8971 > URL: https://issues.apache.org/jira/browse/ARROW-8971 > Project: Apache Arrow > Issue Type: Bug >Reporter: bindu >Assignee: bindu >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Could you please update the pip latest version 20.1 > [https://github.com/apache/arrow/blob/2688a62f8179f20c20c06a10fcd22fe8a714ae48/python/manylinux1/scripts/requirements.txt] > CVE-2018-20225 > An issue was discovered in pip (all versions) because it installs the version > with the highest version number, even if the user had intended to obtain a > private package from a private index. This only affects use of the > --extra-index-url option, and exploitation requires that the package does not > already exist in the public index (and thus the attacker can put the package > there with an arbitrary version number). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8960) [MINOR] [FORMAT] Fix typos in comments
[ https://issues.apache.org/jira/browse/ARROW-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8960. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7274 [https://github.com/apache/arrow/pull/7274] > [MINOR] [FORMAT] Fix typos in comments > -- > > Key: ARROW-8960 > URL: https://issues.apache.org/jira/browse/ARROW-8960 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Reporter: Chen >Assignee: Chen >Priority: Trivial > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8960) [MINOR] [FORMAT] fix typo
[ https://issues.apache.org/jira/browse/ARROW-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-8960: --- Assignee: Chen > [MINOR] [FORMAT] fix typo > - > > Key: ARROW-8960 > URL: https://issues.apache.org/jira/browse/ARROW-8960 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Chen >Assignee: Chen >Priority: Trivial > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8960) [MINOR] [FORMAT] Fix typos in comments
[ https://issues.apache.org/jira/browse/ARROW-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8960: Summary: [MINOR] [FORMAT] Fix typos in comments (was: [MINOR] [FORMAT] fix typo) > [MINOR] [FORMAT] Fix typos in comments > -- > > Key: ARROW-8960 > URL: https://issues.apache.org/jira/browse/ARROW-8960 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Reporter: Chen >Assignee: Chen >Priority: Trivial > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8960) [MINOR] [FORMAT] fix typo
[ https://issues.apache.org/jira/browse/ARROW-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8960: Component/s: Documentation > [MINOR] [FORMAT] fix typo > - > > Key: ARROW-8960 > URL: https://issues.apache.org/jira/browse/ARROW-8960 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Reporter: Chen >Assignee: Chen >Priority: Trivial > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8970) [C++] Reduce shared library code size (umbrella issue)
[ https://issues.apache.org/jira/browse/ARROW-8970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8970: Description: We're reaching a point where we may need to be careful about decisions that increase code size: * Instantiating too many templates for code that isn't performance sensitive, or where some templates may do the same thing (e.g. Int32Type kernels may do the same thing as a Date32Type kernel) * Inlining functions that don't need to be inline Code size tends to correlate also with compilation times, but not always. I'll use this umbrella issue to organize issues related to reducing compiled code size At this moment (2020-05-27), here are the 25 largest object files in a -O2 build {code} 524896 src/arrow/CMakeFiles/arrow_objlib.dir/array/builder_dict.cc.o 531920 src/arrow/CMakeFiles/arrow_objlib.dir/filesystem/s3fs.cc.o 552000 src/arrow/CMakeFiles/arrow_objlib.dir/json/converter.cc.o 575920 src/arrow/CMakeFiles/arrow_objlib.dir/csv/converter.cc.o 595112 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_cast_string.cc.o 645728 src/arrow/CMakeFiles/arrow_objlib.dir/type.cc.o 683040 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_set_lookup.cc.o 702232 src/arrow/CMakeFiles/arrow_objlib.dir/ipc/reader.cc.o 729912 src/arrow/CMakeFiles/arrow_objlib.dir/tensor/coo_converter.cc.o 752776 src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csc_converter.cc.o 752776 src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csr_converter.cc.o 877680 src/arrow/CMakeFiles/arrow_objlib.dir/array/dict_internal.cc.o 885624 src/arrow/CMakeFiles/arrow_objlib.dir/builder.cc.o 919072 src/arrow/CMakeFiles/arrow_objlib.dir/scalar.cc.o 941776 src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_internal.cc.o 1055248 src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_simple.cc.o 1233304 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_compare.cc.o 1265160 src/arrow/CMakeFiles/arrow_objlib.dir/sparse_tensor.cc.o 1343480 src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csf_converter.cc.o 1346928 src/arrow/CMakeFiles/arrow_objlib.dir/array.cc.o 1502568 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_hash.cc.o 1609760 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_cast_numeric.cc.o 1794416 src/arrow/CMakeFiles/arrow_objlib.dir/array/diff.cc.o 2759552 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_filter.cc.o 7609432 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_take.cc.o {code} was: We're reaching a point where we may need to be careful about decisions that increase code size: * Instantiating too many templates for code that isn't performance sensitive * Inlining functions that don't need to be inline Code size tends to correlate also with compilation times, but not always. I'll use this umbrella issue to organize issues related to reducing compiled code size At this moment (2020-05-27), here are the 25 largest object files in a -O2 build {code} 524896 src/arrow/CMakeFiles/arrow_objlib.dir/array/builder_dict.cc.o 531920 src/arrow/CMakeFiles/arrow_objlib.dir/filesystem/s3fs.cc.o 552000 src/arrow/CMakeFiles/arrow_objlib.dir/json/converter.cc.o 575920 src/arrow/CMakeFiles/arrow_objlib.dir/csv/converter.cc.o 595112 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_cast_string.cc.o 645728 src/arrow/CMakeFiles/arrow_objlib.dir/type.cc.o 683040 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_set_lookup.cc.o 702232 src/arrow/CMakeFiles/arrow_objlib.dir/ipc/reader.cc.o 729912 src/arrow/CMakeFiles/arrow_objlib.dir/tensor/coo_converter.cc.o 752776 src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csc_converter.cc.o 752776 src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csr_converter.cc.o 877680 src/arrow/CMakeFiles/arrow_objlib.dir/array/dict_internal.cc.o 885624 src/arrow/CMakeFiles/arrow_objlib.dir/builder.cc.o 919072 src/arrow/CMakeFiles/arrow_objlib.dir/scalar.cc.o 941776 src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_internal.cc.o 1055248 src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_simple.cc.o 1233304 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_compare.cc.o 1265160 src/arrow/CMakeFiles/arrow_objlib.dir/sparse_tensor.cc.o 1343480 src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csf_converter.cc.o 1346928 src/arrow/CMakeFiles/arrow_objlib.dir/array.cc.o 1502568 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_hash.cc.o 1609760 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_cast_numeric.cc.o 1794416 src/arrow/CMakeFiles/arrow_objlib.dir/array/diff.cc.o 2759552 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_filter.cc.o 7609432 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_take.cc.o {code} > [C++] Reduce shared library code size (umbrella issue) > -- > > Key: ARROW-8970 >
[jira] [Updated] (ARROW-8970) [C++] Reduce shared library code size (umbrella issue)
[ https://issues.apache.org/jira/browse/ARROW-8970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8970: Description: We're reaching a point where we may need to be careful about decisions that increase code size: * Instantiating too many templates for code that isn't performance sensitive * Inlining functions that don't need to be inline Code size tends to correlate also with compilation times, but not always. I'll use this umbrella issue to organize issues related to reducing compiled code size At this moment (2020-05-27), here are the 25 largest object files in a -O2 build {code} 524896 src/arrow/CMakeFiles/arrow_objlib.dir/array/builder_dict.cc.o 531920 src/arrow/CMakeFiles/arrow_objlib.dir/filesystem/s3fs.cc.o 552000 src/arrow/CMakeFiles/arrow_objlib.dir/json/converter.cc.o 575920 src/arrow/CMakeFiles/arrow_objlib.dir/csv/converter.cc.o 595112 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_cast_string.cc.o 645728 src/arrow/CMakeFiles/arrow_objlib.dir/type.cc.o 683040 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_set_lookup.cc.o 702232 src/arrow/CMakeFiles/arrow_objlib.dir/ipc/reader.cc.o 729912 src/arrow/CMakeFiles/arrow_objlib.dir/tensor/coo_converter.cc.o 752776 src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csc_converter.cc.o 752776 src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csr_converter.cc.o 877680 src/arrow/CMakeFiles/arrow_objlib.dir/array/dict_internal.cc.o 885624 src/arrow/CMakeFiles/arrow_objlib.dir/builder.cc.o 919072 src/arrow/CMakeFiles/arrow_objlib.dir/scalar.cc.o 941776 src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_internal.cc.o 1055248 src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_simple.cc.o 1233304 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_compare.cc.o 1265160 src/arrow/CMakeFiles/arrow_objlib.dir/sparse_tensor.cc.o 1343480 src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csf_converter.cc.o 1346928 src/arrow/CMakeFiles/arrow_objlib.dir/array.cc.o 1502568 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_hash.cc.o 1609760 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_cast_numeric.cc.o 1794416 src/arrow/CMakeFiles/arrow_objlib.dir/array/diff.cc.o 2759552 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_filter.cc.o 7609432 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_take.cc.o {code} was: We're reaching a point where we may need to be careful about decisions that increase code size: * Instantiating too many templates for code that isn't performance sensitive * Inlining functions that don't need to be inline Code size tends to correlate also with compilation times, but not always. I'll use this umbrella issue to organize issues related to reducing compiled code size > [C++] Reduce shared library code size (umbrella issue) > -- > > Key: ARROW-8970 > URL: https://issues.apache.org/jira/browse/ARROW-8970 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > We're reaching a point where we may need to be careful about decisions that > increase code size: > * Instantiating too many templates for code that isn't performance sensitive > * Inlining functions that don't need to be inline > Code size tends to correlate also with compilation times, but not always. > I'll use this umbrella issue to organize issues related to reducing compiled > code size > At this moment (2020-05-27), here are the 25 largest object files in a -O2 > build > {code} > 524896src/arrow/CMakeFiles/arrow_objlib.dir/array/builder_dict.cc.o > 531920src/arrow/CMakeFiles/arrow_objlib.dir/filesystem/s3fs.cc.o > 552000src/arrow/CMakeFiles/arrow_objlib.dir/json/converter.cc.o > 575920src/arrow/CMakeFiles/arrow_objlib.dir/csv/converter.cc.o > 595112 > src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_cast_string.cc.o > 645728src/arrow/CMakeFiles/arrow_objlib.dir/type.cc.o > 683040 > src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_set_lookup.cc.o > 702232src/arrow/CMakeFiles/arrow_objlib.dir/ipc/reader.cc.o > 729912src/arrow/CMakeFiles/arrow_objlib.dir/tensor/coo_converter.cc.o > 752776src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csc_converter.cc.o > 752776src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csr_converter.cc.o > 877680src/arrow/CMakeFiles/arrow_objlib.dir/array/dict_internal.cc.o > 885624src/arrow/CMakeFiles/arrow_objlib.dir/builder.cc.o > 919072src/arrow/CMakeFiles/arrow_objlib.dir/scalar.cc.o > 941776src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_internal.cc.o > 1055248 src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_simple.cc.o > 1233304 >
[jira] [Created] (ARROW-8970) [C++] Reduce shared library code size (umbrella issue)
Wes McKinney created ARROW-8970: --- Summary: [C++] Reduce shared library code size (umbrella issue) Key: ARROW-8970 URL: https://issues.apache.org/jira/browse/ARROW-8970 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney We're reaching a point where we may need to be careful about decisions that increase code size: * Instantiating too many templates for code that isn't performance sensitive * Inlining functions that don't need to be inline Code size tends to correlate also with compilation times, but not always. I'll use this umbrella issue to organize issues related to reducing compiled code size -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7784) [C++] diff.cc is extremely slow to compile
[ https://issues.apache.org/jira/browse/ARROW-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-7784: Issue Type: Improvement (was: Bug) > [C++] diff.cc is extremely slow to compile > -- > > Key: ARROW-7784 > URL: https://issues.apache.org/jira/browse/ARROW-7784 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Priority: Minor > Fix For: 1.0.0 > > > This comes up especially when doing an optimized build. {{diff.cc}} is always > enabled even if all components are disabled, and it takes multiple seconds to > compile. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8793) [C++] BitUtil::SetBitsTo probably doesn't need to be inline
[ https://issues.apache.org/jira/browse/ARROW-8793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-8793: --- Assignee: Wes McKinney > [C++] BitUtil::SetBitsTo probably doesn't need to be inline > --- > > Key: ARROW-8793 > URL: https://issues.apache.org/jira/browse/ARROW-8793 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Inlining this function probably does not yield meaningful performance benefits -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8957) [FlightRPC][C++] Fail to build due to IpcOptions
[ https://issues.apache.org/jira/browse/ARROW-8957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8957. - Resolution: Fixed Issue resolved by pull request 7277 [https://github.com/apache/arrow/pull/7277] > [FlightRPC][C++] Fail to build due to IpcOptions > > > Key: ARROW-8957 > URL: https://issues.apache.org/jira/browse/ARROW-8957 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8969) [C++] Reduce generated code in compute/kernels/scalar_compare.cc
[ https://issues.apache.org/jira/browse/ARROW-8969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8969: Description: We are instantiating multiple versions of templates in this module for cases that, byte-wise, do the exact same comparison. For example: * For equals, not_equals, we can use the same 32-bit/64-bit comparison kernels for signed int / unsigned int / floating point types of the same byte width * TimestampType can reuse int64 kernels, similarly for other date/time types * BinaryType/StringType can share kernels etc. was: We are instantiating templates in this module for cases that, byte-wise, do the exact same comparison. For example: * For equals, not_equals, we can use the same 32-bit/64-bit comparison kernels for signed int / unsigned int / floating point types of the same byte width * TimestampType can reuse int64 kernels, similarly for other date/time types * BinaryType/StringType can share kernels etc. > [C++] Reduce generated code in compute/kernels/scalar_compare.cc > > > Key: ARROW-8969 > URL: https://issues.apache.org/jira/browse/ARROW-8969 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > We are instantiating multiple versions of templates in this module for cases > that, byte-wise, do the exact same comparison. For example: > * For equals, not_equals, we can use the same 32-bit/64-bit comparison > kernels for signed int / unsigned int / floating point types of the same byte > width > * TimestampType can reuse int64 kernels, similarly for other date/time types > * BinaryType/StringType can share kernels > etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8969) [C++] Reduce generated code in compute/kernels/scalar_compare.cc
Wes McKinney created ARROW-8969: --- Summary: [C++] Reduce generated code in compute/kernels/scalar_compare.cc Key: ARROW-8969 URL: https://issues.apache.org/jira/browse/ARROW-8969 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 We are instantiating templates in this module for cases that, byte-wise, do the exact same comparison. For example: * For equals, not_equals, we can use the same 32-bit/64-bit comparison kernels for signed int / unsigned int / floating point types of the same byte width * TimestampType can reuse int64 kernels, similarly for other date/time types * BinaryType/StringType can share kernels etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library
[ https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118061#comment-17118061 ] Wes McKinney commented on ARROW-8961: - Ah great. I see that utf8proc includes a 1.5 MB data file, so we shouldn't be too cavalier about vendoring it. If utf8proc is only required when {{-DARROW_COMPUTE=ON}} then perhaps we can just add it as a normal thirdparty toolchain library > [C++] Vendor utf8proc library > - > > Key: ARROW-8961 > URL: https://issues.apache.org/jira/browse/ARROW-8961 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This is a minimal MIT-licensed library for UTF-8 data processing originally > developed for use in Julia > https://github.com/JuliaStrings/utf8proc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8967) [Python] [Parquet] pyarrow.Table.to_pandas() fails to convert valid TIMESTAMP_MILLIS to pandas timestamp
[ https://issues.apache.org/jira/browse/ARROW-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118060#comment-17118060 ] Wes McKinney commented on ARROW-8967: - I'm not sure if this is fixable, since pandas datetime64 data uses the nanosecond unit. [~jorisvandenbossche] do you know? [~markwaddle] you can read this file fine into Arrow format, so it isn't true that "there is no way to read this file". You just can't convert out of bounds timestamps to pandas format at the moment. > [Python] [Parquet] pyarrow.Table.to_pandas() fails to convert valid > TIMESTAMP_MILLIS to pandas timestamp > > > Key: ARROW-8967 > URL: https://issues.apache.org/jira/browse/ARROW-8967 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.0 >Reporter: Mark Waddle >Priority: Major > > reading a parquet file with a valid TIMESTAMP_MILLIS value of -6155291520 > (0019-06-20) results in the following error > {noformat} > File "pyarrow/array.pxi", line 587, in > pyarrow.lib._PandasConvertible.to_pandas > File "pyarrow/table.pxi", line 1640, in pyarrow.lib.Table._to_pandas > File > "/Users/mark/.local/share/virtualenvs/parquetpy-BNIqCtDj/lib/python3.7/site-packages/pyarrow/pandas_compat.py", > line 766, in table_to_blockmanager > blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes) > File > "/Users/mark/.local/share/virtualenvs/parquetpy-BNIqCtDj/lib/python3.7/site-packages/pyarrow/pandas_compat.py", > line 1102, in _table_to_blocks > list(extension_columns.keys())) > File "pyarrow/table.pxi", line 1107, in pyarrow.lib.table_to_blocks > File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Casting from timestamp[ms] to timestamp[ns] would > result in out of bounds timestamp: -6155291520 > {noformat} > as it stands there is no way to read this file > i would like to be able to choose the timestamp unit when reading, much like > you can when writing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library
[ https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117932#comment-17117932 ] Wes McKinney commented on ARROW-8961: - [~uwe] I would say it would be worth going ahead and adding utf8proc to conda-forge if it is not there already. > [C++] Vendor utf8proc library > - > > Key: ARROW-8961 > URL: https://issues.apache.org/jira/browse/ARROW-8961 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This is a minimal MIT-licensed library for UTF-8 data processing originally > developed for use in Julia > https://github.com/JuliaStrings/utf8proc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8963) [C++][Parquet] Parquet cpp optimize allocate memory
[ https://issues.apache.org/jira/browse/ARROW-8963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8963: Summary: [C++][Parquet] Parquet cpp optimize allocate memory (was: Parquet cpp optimize allocate memory) > [C++][Parquet] Parquet cpp optimize allocate memory > --- > > Key: ARROW-8963 > URL: https://issues.apache.org/jira/browse/ARROW-8963 > Project: Apache Arrow > Issue Type: Improvement > Components: Format >Affects Versions: 0.17.1 >Reporter: yiming.xu >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > LeafReader::NextBatch should Reset memory first, otherwise Reserve will > allocate memory twice -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8966) [C++] Move arrow::ArrayData to a separate header file
Wes McKinney created ARROW-8966: --- Summary: [C++] Move arrow::ArrayData to a separate header file Key: ARROW-8966 URL: https://issues.apache.org/jira/browse/ARROW-8966 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 There are code modules (such as compute kernels) that only require ArrayData for doing computations, so pulling in all the code in array.h is not necessary. There are probably other code paths that might benefit from this also. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7384) [Website] Fix search indexing warning reported by Google
[ https://issues.apache.org/jira/browse/ARROW-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117256#comment-17117256 ] Wes McKinney commented on ARROW-7384: - I would guess it's still a problem. I think it's something to do with our Jekyll website configuration > [Website] Fix search indexing warning reported by Google > > > Key: ARROW-7384 > URL: https://issues.apache.org/jira/browse/ARROW-7384 > Project: Apache Arrow > Issue Type: Bug > Components: Website >Reporter: Wes McKinney >Priority: Major > > I received the following e-mail from Google regarding arrow.apache.org (since > I'm an admin on the Analytics account) > {code} > Top Warnings > Warnings are suggestions for improvement. Some warnings can affect your > appearance on Search; some might be reclassified as errors in the future. The > following warnings were found on your site: > Indexed, though blocked by robots.txt > We recommend that you fix these issues when possible to enable the best > experience and coverage in Google Search. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-8956) [C++] arrow::ScalarEquals returns false when values are both null
[ https://issues.apache.org/jira/browse/ARROW-8956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117254#comment-17117254 ] Wes McKinney edited comment on ARROW-8956 at 5/27/20, 2:43 AM: --- The only options for this function at the moment are true or false https://github.com/apache/arrow/blob/master/cpp/src/arrow/compare.h#L114 This is data structure equality, not elementwise analytic equality was (Author: wesmckinn): The only options for this function at the moment are true or false https://github.com/apache/arrow/blob/master/cpp/src/arrow/compare.h#L114 > [C++] arrow::ScalarEquals returns false when values are both null > - > > Key: ARROW-8956 > URL: https://issues.apache.org/jira/browse/ARROW-8956 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Major > > I wasn't sure if this was deliberate but it appeared while writing unit tests > and so wanted to check what was the intention before changing it. Arrays > compare equal when null slots are respectively null so this seems > inconsistent. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8956) [C++] arrow::ScalarEquals returns false when values are both null
[ https://issues.apache.org/jira/browse/ARROW-8956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117254#comment-17117254 ] Wes McKinney commented on ARROW-8956: - The only options for this function at the moment are true or false https://github.com/apache/arrow/blob/master/cpp/src/arrow/compare.h#L114 > [C++] arrow::ScalarEquals returns false when values are both null > - > > Key: ARROW-8956 > URL: https://issues.apache.org/jira/browse/ARROW-8956 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Major > > I wasn't sure if this was deliberate but it appeared while writing unit tests > and so wanted to check what was the intention before changing it. Arrays > compare equal when null slots are respectively null so this seems > inconsistent. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8961) [C++] Vendor utf8proc library
Wes McKinney created ARROW-8961: --- Summary: [C++] Vendor utf8proc library Key: ARROW-8961 URL: https://issues.apache.org/jira/browse/ARROW-8961 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 This is a minimal MIT-licensed library for UTF-8 data processing originally developed for use in Julia https://github.com/JuliaStrings/utf8proc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8922) [C++] Implement example string scalar kernel function to assist with string kernels buildout per ARROW-555
[ https://issues.apache.org/jira/browse/ARROW-8922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-8922: --- Assignee: Wes McKinney > [C++] Implement example string scalar kernel function to assist with string > kernels buildout per ARROW-555 > -- > > Key: ARROW-8922 > URL: https://issues.apache.org/jira/browse/ARROW-8922 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > I will write a patch to provide an example of creating a string-input > string-output kernel for executing scalar-valued string functions -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8956) [C++] arrow::ScalarEquals returns false when values are both null
[ https://issues.apache.org/jira/browse/ARROW-8956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8956: Description: I wasn't sure if this was deliberate but it appeared while writing unit tests and so wanted to check what was the intention before changing it. Arrays compare equal when null slots are respectively null so this seems inconsistent. (was: I wasn't sure if this was deliberate but it appeared while writing unit tests and so wanted to check what was the intention before changing it. ) > [C++] arrow::ScalarEquals returns false when values are both null > - > > Key: ARROW-8956 > URL: https://issues.apache.org/jira/browse/ARROW-8956 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Major > > I wasn't sure if this was deliberate but it appeared while writing unit tests > and so wanted to check what was the intention before changing it. Arrays > compare equal when null slots are respectively null so this seems > inconsistent. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8956) [C++] arrow::ScalarEquals returns false when values are both null
Wes McKinney created ARROW-8956: --- Summary: [C++] arrow::ScalarEquals returns false when values are both null Key: ARROW-8956 URL: https://issues.apache.org/jira/browse/ARROW-8956 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney I wasn't sure if this was deliberate but it appeared while writing unit tests and so wanted to check what was the intention before changing it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8955) [C++] Use kernels for casting Scalar values instead of bespoke implementation
Wes McKinney created ARROW-8955: --- Summary: [C++] Use kernels for casting Scalar values instead of bespoke implementation Key: ARROW-8955 URL: https://issues.apache.org/jira/browse/ARROW-8955 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 See details of casting in arrow/scalar.cc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-1329) [C++] Define "virtual table" interface
[ https://issues.apache.org/jira/browse/ARROW-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-1329. --- Resolution: Duplicate Closing in favor of ARROW-8939 > [C++] Define "virtual table" interface > -- > > Key: ARROW-1329 > URL: https://issues.apache.org/jira/browse/ARROW-1329 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: dataframe > > The idea is that a virtual table may reference Arrow data that is not yet > available in memory. The implementation will define the semantics of how > columns are loaded into memory. > A virtual column interface will need to accompany this. For example: > {code:language=c++} > std::shared_ptr vtable = ...; > std::shared_ptr vcolumn = vtable->column(i); > std::shared_ptr = vcolumn->Materialize(); > std::shared_ptr = vtable->Materialize(); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8949) [Java] Flight - getInfo() returning 0.0.0.0:47470
[ https://issues.apache.org/jira/browse/ARROW-8949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8949: Summary: [Java] Flight - getInfo() returning 0.0.0.0:47470 (was: Java Flight - getInfo() returning 0.0.0.0:47470) > [Java] Flight - getInfo() returning 0.0.0.0:47470 > - > > Key: ARROW-8949 > URL: https://issues.apache.org/jira/browse/ARROW-8949 > Project: Apache Arrow > Issue Type: Bug > Components: FlightRPC, Java >Affects Versions: 0.17.1 >Reporter: Bryce Brooks >Priority: Major > > The code below is incomplete but I thought it would be good to show how I am > connecting. The server is Dremio. The python client works fine when I attempt > a simple test. I am not sure what is going on with the Java client but the > getInfo returns an endpoint location of 0.0.0.0:47470. > The info object i get back is: > info: FlightInfo\{schema=Schema, descriptor=53 45 4C 45 43 > 54 20 54 41 42 4C 45 5F 4E 41 4D 45 20 46 52 4F 4D 20 49 4E 46 4F 52 4D 41 54 > 49 4F 4E 5F 53 43 48 45 4D 41 2E 56 49 45 57 53 , > endpoints=[FlightEndpoint{locations=[Location{uri=grpc+tcp://0.0.0.0:47470}], > ticket=org.apache.arrow.flight.Ticket@1ad0dd97}], bytes=-1, records=-1} > > {code:java} > // > private void testConnect() { > final String host = "somehost"; // removed for security > final int port = 1234; // removed for security > final Location location = Location.forGrpcInsecure(host, port); > try (FlightClient c = flightClient(allocator, location)) { > c.authenticate(new BasicClientAuthHandler("username", "password")); > > String sql = "SELECT TABLE_NAME FROM INFORMATION_SCHEMA.VIEWS"; > > FlightInfo info = c.getInfo(FlightDescriptor.command(sql.getBytes())); > > log.info("info: " + info.toString()); > log.info(" " + info.getDescriptor().toString()); > log.info(" " + info.getSchema().toString()); > log.info(" " + info.getEndpoints().size()); > long total = info.getEndpoints().stream() > .map(this::submit) > .map(DataApiApplication::get) > .mapToLong(Long::longValue) > .sum(); > log.info("" + total); >} catch (Exception e) { > log.error("ERROR DURING GET"); > log.error(e.getMessage()); > log.error(e.getLocalizedMessage()); >} > } > private Future submit(FlightEndpoint e) { >int thisEndpoint = endpointsSubmitted.incrementAndGet(); >log.debug("submitting flight endpoint {} with ticket {} to {}", >thisEndpoint, >new String(e.getTicket().getBytes()), >e.getLocations().get(0).getUri()); >RunnableReader reader = new RunnableReader(allocator, e); >Future f = tp.submit(reader); >log.debug("submitted flight endpoint {} with ticket {} to {}", > thisEndpoint, > new String(e.getTicket().getBytes()), > e.getLocations().get(0).getUri()); >return f; > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8951) [C++] Fix compiler warning in compute/kernels/scalar_cast_temporal.cc
Wes McKinney created ARROW-8951: --- Summary: [C++] Fix compiler warning in compute/kernels/scalar_cast_temporal.cc Key: ARROW-8951 URL: https://issues.apache.org/jira/browse/ARROW-8951 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 The kernel functor can return an uninitialized value on errors {code} ../src/arrow/compute/kernels/scalar_cast_temporal.cc: In member function ‘OUT arrow::compute::internal::ParseTimestamp::Call(arrow::compute::KernelContext*, ARG0) const [with OUT = long int; ARG0 = nonstd::sv_lite::basic_string_view]’: ../src/arrow/compute/kernels/scalar_cast_temporal.cc:267:12: warning: ‘result’ may be used uninitialized in this function [-Wmaybe-uninitialized] return result; {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8782) [Rust] [DataFusion] Add benchmarks based on NYC Taxi data set
[ https://issues.apache.org/jira/browse/ARROW-8782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8782. - Resolution: Fixed Issue resolved by pull request 7205 [https://github.com/apache/arrow/pull/7205] > [Rust] [DataFusion] Add benchmarks based on NYC Taxi data set > - > > Key: ARROW-8782 > URL: https://issues.apache.org/jira/browse/ARROW-8782 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > I plan on adding a new benchmarks folder beneatch the datafusion crate, > containing benchmarks based on the NYC Taxi data set. The benchmark will be a > CLI and will support running a number of different queries against CSV and > Parquet. > The README will contain instructions for downloading the data set. > The benchmark will produce CSV files containing results. > These benchmarks will allow us to manually verify performance before major > releases and on an ongoing basis as we make changes to > Arrow/Parquet/DataFusion. > I will be basing this on existing benchmarks I recently built in Ballista [1] > (I am the only contributor to these benchmarks so far). > A dockerfile will be provided, making it easy to restrict CPU and RAM when > running these benchmarks. > [1] https://github.com/ballista-compute/ballista/tree/master/rust/benchmarks > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8297) [FlightRPC][C++] Implement Flight DoExchange for C++
[ https://issues.apache.org/jira/browse/ARROW-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8297. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 6656 [https://github.com/apache/arrow/pull/6656] > [FlightRPC][C++] Implement Flight DoExchange for C++ > > > Key: ARROW-8297 > URL: https://issues.apache.org/jira/browse/ARROW-8297 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, FlightRPC >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > > As described in the mailing list vote. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8945) [Python] An independent Cython package for projects that want to program against the C data interface
Wes McKinney created ARROW-8945: --- Summary: [Python] An independent Cython package for projects that want to program against the C data interface Key: ARROW-8945 URL: https://issues.apache.org/jira/browse/ARROW-8945 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Wes McKinney I've been thinking it would be useful to have a minimal Cython package, call it "cyarrow", containing some pxd files and a small amount of compiled pyx code (using a C compiler only) that enables projects written in Cython to interact with Arrow datasets in minimal ways (for example, iterating over their values, interacting with dictionary-encoded/categorical arrays) that don't amount to reimplementation of the "hard stuff" where they would want to utilize pyarrow or the C++ library instead. Otherwise, every Python project that has compiled code in Cython and wants to use the C interface would have to create their own minimal implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8945) [Python] An independent Cython package for Cython-based projects that want to program against the C data interface
[ https://issues.apache.org/jira/browse/ARROW-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8945: Description: I've been thinking it would be useful to have a minimal Cython package, call it "cyarrow", containing some pxd files and a small amount of compiled pyx code (using a C compiler only) that enables projects written in Cython to interact with Arrow datasets in minimal ways (for example, iterating over their values, interacting with dictionary-encoded/categorical arrays) that don't amount to reimplementation of the "hard stuff" where they would want to utilize pyarrow or the C++ library instead. Otherwise, every Python project that has compiled code in Cython and wants to use the C interface (https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst) would have to create their own minimal implementation. Target user for this project would be Python projects like scikit-learn that are mostly written in Cython was: I've been thinking it would be useful to have a minimal Cython package, call it "cyarrow", containing some pxd files and a small amount of compiled pyx code (using a C compiler only) that enables projects written in Cython to interact with Arrow datasets in minimal ways (for example, iterating over their values, interacting with dictionary-encoded/categorical arrays) that don't amount to reimplementation of the "hard stuff" where they would want to utilize pyarrow or the C++ library instead. Otherwise, every Python project that has compiled code in Cython and wants to use the C interface would have to create their own minimal implementation. Target user for this project would be Python projects like scikit-learn that are mostly written in Cython > [Python] An independent Cython package for Cython-based projects that want to > program against the C data interface > -- > > Key: ARROW-8945 > URL: https://issues.apache.org/jira/browse/ARROW-8945 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > > I've been thinking it would be useful to have a minimal Cython package, call > it "cyarrow", containing some pxd files and a small amount of compiled pyx > code (using a C compiler only) that enables projects written in Cython to > interact with Arrow datasets in minimal ways (for example, iterating over > their values, interacting with dictionary-encoded/categorical arrays) that > don't amount to reimplementation of the "hard stuff" where they would want to > utilize pyarrow or the C++ library instead. Otherwise, every Python project > that has compiled code in Cython and wants to use the C interface > (https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst) > would have to create their own minimal implementation. > Target user for this project would be Python projects like scikit-learn that > are mostly written in Cython -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8945) [Python] An independent Cython package for Cython-based projects that want to program against the C data interface
[ https://issues.apache.org/jira/browse/ARROW-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8945: Description: I've been thinking it would be useful to have a minimal Cython package, call it "cyarrow", containing some pxd files and a small amount of compiled pyx code (using a C compiler only) that enables projects written in Cython to interact with Arrow datasets in minimal ways (for example, iterating over their values, interacting with dictionary-encoded/categorical arrays) that don't amount to reimplementation of the "hard stuff" where they would want to utilize pyarrow or the C++ library instead. Otherwise, every Python project that has compiled code in Cython and wants to use the C interface would have to create their own minimal implementation. Target user for this project would be Python projects like scikit-learn that are mostly written in Cython was:I've been thinking it would be useful to have a minimal Cython package, call it "cyarrow", containing some pxd files and a small amount of compiled pyx code (using a C compiler only) that enables projects written in Cython to interact with Arrow datasets in minimal ways (for example, iterating over their values, interacting with dictionary-encoded/categorical arrays) that don't amount to reimplementation of the "hard stuff" where they would want to utilize pyarrow or the C++ library instead. Otherwise, every Python project that has compiled code in Cython and wants to use the C interface would have to create their own minimal implementation. > [Python] An independent Cython package for Cython-based projects that want to > program against the C data interface > -- > > Key: ARROW-8945 > URL: https://issues.apache.org/jira/browse/ARROW-8945 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > > I've been thinking it would be useful to have a minimal Cython package, call > it "cyarrow", containing some pxd files and a small amount of compiled pyx > code (using a C compiler only) that enables projects written in Cython to > interact with Arrow datasets in minimal ways (for example, iterating over > their values, interacting with dictionary-encoded/categorical arrays) that > don't amount to reimplementation of the "hard stuff" where they would want to > utilize pyarrow or the C++ library instead. Otherwise, every Python project > that has compiled code in Cython and wants to use the C interface would have > to create their own minimal implementation. > Target user for this project would be Python projects like scikit-learn that > are mostly written in Cython -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8945) [Python] An independent Cython package for Cython-based projects that want to program against the C data interface
[ https://issues.apache.org/jira/browse/ARROW-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8945: Summary: [Python] An independent Cython package for Cython-based projects that want to program against the C data interface (was: [Python] An independent Cython package for projects that want to program against the C data interface) > [Python] An independent Cython package for Cython-based projects that want to > program against the C data interface > -- > > Key: ARROW-8945 > URL: https://issues.apache.org/jira/browse/ARROW-8945 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > > I've been thinking it would be useful to have a minimal Cython package, call > it "cyarrow", containing some pxd files and a small amount of compiled pyx > code (using a C compiler only) that enables projects written in Cython to > interact with Arrow datasets in minimal ways (for example, iterating over > their values, interacting with dictionary-encoded/categorical arrays) that > don't amount to reimplementation of the "hard stuff" where they would want to > utilize pyarrow or the C++ library instead. Otherwise, every Python project > that has compiled code in Cython and wants to use the C interface would have > to create their own minimal implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8214) [C++] Flatbuffers based serialization protocol for Expressions
[ https://issues.apache.org/jira/browse/ARROW-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116339#comment-17116339 ] Wes McKinney commented on ARROW-8214: - Yes, I would definitely like to see that happen. Using Flatbuffers is desirable to avoid the need to link libprotobuf.a > [C++] Flatbuffers based serialization protocol for Expressions > -- > > Key: ARROW-8214 > URL: https://issues.apache.org/jira/browse/ARROW-8214 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Krisztian Szucs >Priority: Major > Labels: dataset > > It might provide a more scalable solution for serialization. > cc [~bkietz] [~fsaintjacques] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8772) [C++] Expand SumKernel benchmark to more types
[ https://issues.apache.org/jira/browse/ARROW-8772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8772. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7267 [https://github.com/apache/arrow/pull/7267] > [C++] Expand SumKernel benchmark to more types > -- > > Key: ARROW-8772 > URL: https://issues.apache.org/jira/browse/ARROW-8772 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Frank Du >Assignee: Frank Du >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Expand SumKernel benchmark to cover more types, Float, Double, Int8, Int16, > Int32, Int64. > Currently it only has Int64 item, useful for further optimize job. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8860) [C++] IPC/Feather decompression broken for nested arrays
[ https://issues.apache.org/jira/browse/ARROW-8860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8860: Priority: Critical (was: Major) > [C++] IPC/Feather decompression broken for nested arrays > > > Key: ARROW-8860 > URL: https://issues.apache.org/jira/browse/ARROW-8860 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Priority: Critical > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > When writing a table with a Struct typed column, this is read back with > garbage values when using compression (which is the default): > {code:python} > >>> table = pa.table({'col': pa.StructArray.from_arrays([[0, 1, 2], [1, 2, > >>> 3]], names=["f1", "f2"])}) > # roundtrip through feather > >>> feather.write_feather(table, "test_struct.feather") > >>> table2 = feather.read_table("test_struct.feather") > >>> table2.column("col") > > [ > -- is_valid: all not null > -- child 0 type: int64 > [ > 24, > 1261641627085906436, > 1369095386551025664 > ] > -- child 1 type: int64 > [ > 24, > 1405756815161762308, > 281479842103296 > ] > ] > {code} > When not using compression, it is read back correctly: > {code:python} > >>> feather.write_feather(table, "test_struct.feather", > >>> compression="uncompressed") > >>> > >>> > >>> table2 = feather.read_table("test_struct.feather") > >>> > >>> > >>> table2.column("col") > >>> > >>> > > [ > -- is_valid: all not null > -- child 0 type: int64 > [ > 0, > 1, > 2 > ] > -- child 1 type: int64 > [ > 1, > 2, > 3 > ] > ] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8873) [Plasma][C++] Usage model for Object IDs. Object IDs don't disappear after delete
[ https://issues.apache.org/jira/browse/ARROW-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8873: Summary: [Plasma][C++] Usage model for Object IDs. Object IDs don't disappear after delete (was: Usage model for Object IDs. Object IDs don't disappear after delete) > [Plasma][C++] Usage model for Object IDs. Object IDs don't disappear after > delete > - > > Key: ARROW-8873 > URL: https://issues.apache.org/jira/browse/ARROW-8873 > Project: Apache Arrow > Issue Type: Test > Components: C++, Python >Affects Versions: 0.17.0 >Reporter: Abe Mammen >Priority: Major > > I have an environment that uses Arrow + Plasma to send requests between > Python clients and a C++ server that responds with search results etc. > I use a sequence number based approach for Object ID creation so its > understood on both sides. All that works well. So each request from the > client creates a unique Object ID, creates and seals it etc. On the other > end, a get against that Object ID retrieves the request payload, releases and > deletes the Object ID. A similar response scheme for Object IDs are used from > the server side to the client to get search results etc where it creates its > own unique Object ID understood by the client. The server side creates and > seals and the Python client side does a get and deletes the Object ID (there > is no release method in Python it appears). I have experimented with deleting > the plasma buffer. > The end result is that as transactions build up, the server side memory use > goes way up and I can see that a good # of the objects aren't deleted from > the Plasma store until the server exits. I have nulled out the search result > part too so that is not what is accumulating. I have not done a memory > profile but wanted to get some feedback on some what might be wrong here. > Is there a better way to use Object IDs for example? And what might be > causing the huge memory usage. In this example, I had ~4M transactions > between clients and the server which hit a memory usage of about 10+ GB which > is in the ballpark of the size of all the payloads. Besides doing > release-deletes on Object IDs, is there a better way to purge and remove > these objects? > Any help is appreciated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8801) [Python] Memory leak on read from parquet file with UTC timestamps using pandas
[ https://issues.apache.org/jira/browse/ARROW-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8801: Priority: Blocker (was: Major) > [Python] Memory leak on read from parquet file with UTC timestamps using > pandas > --- > > Key: ARROW-8801 > URL: https://issues.apache.org/jira/browse/ARROW-8801 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.16.0, 0.17.0 > Environment: Tested using pyarrow 0.17.0, pandas 1.0.3, python 3.7.5, > mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, python 3.8.2, > ubuntu 20.04 (linux). >Reporter: Rauli Ruohonen >Priority: Blocker > Fix For: 1.0.0 > > > Given dump.py script > > {code:java} > import pandas as pd > import numpy as np > x = pd.to_datetime(np.random.randint(0, 2**32, size=2**20), unit='ms', > utc=True) > pd.DataFrame({'x': x}).to_parquet('data.parquet', engine='pyarrow', > compression=None) > {code} > and load.py script > > {code:java} > import sys > import pandas as pd > def foo(engine): > for _ in range(2**9): > pd.read_parquet('data.parquet', engine=engine) > print('Done') > input() > foo(sys.argv[1]) > {code} > running first "python dump.py" and then "python load.py pyarrow", on my > machine python memory usage stays at 4+ GB while it waits for input. If using > "python load.py fastparquet" instead, it is about 100 MB, so it should be a > pyarrow issue instead of a pandas issue. The leak disappears if "utc=True" is > removed from dump.py, in which case the timestamp is timezone-unaware. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-8580) Pyarrow exceptions are not helpful
[ https://issues.apache.org/jira/browse/ARROW-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-8580. --- Resolution: Cannot Reproduce If you can provide a reproducible example of such an unhelpful error message, we will certainly fix it > Pyarrow exceptions are not helpful > -- > > Key: ARROW-8580 > URL: https://issues.apache.org/jira/browse/ARROW-8580 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Soroush Radpour >Priority: Major > > I'm trying to understand an exception in the code using pyarrow, and it is > not very helpful. > File "pyarrow/_parquet.pyx", line 1036, in pyarrow._parquet.ParquetReader.open > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: IOError: b'Service Unavailable'. Detail: Python exception: > RuntimeError > > It would be great if each of the three exceptions was unwrapped with full > stack trace and error messages that came with it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8671) [C++] Use IPC body compression metadata approved in ARROW-300
[ https://issues.apache.org/jira/browse/ARROW-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8671: Priority: Critical (was: Major) > [C++] Use IPC body compression metadata approved in ARROW-300 > -- > > Key: ARROW-8671 > URL: https://issues.apache.org/jira/browse/ARROW-8671 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Wes McKinney >Priority: Critical > Fix For: 1.0.0 > > > This will adapt the existing code to use the new metadata, while maintaining > backward compatibility code to recognize the "experimental" metadata written > in 0.17.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)