[arrow] branch master updated (3cc12ab -> 4e51f98)
This is an automated email from the ASF dual-hosted git repository. shiro pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 3cc12ab ARROW-6172 [Java] Provide benchmarks to set IntVector with different methods add 4e51f98 ARROW-6240: [Ruby] Arrow::Decimal128Array#get_value returns BigDecimal No new revisions were added by this update. Summary of changes: .../lib/arrow/{tensor.rb => decimal128-array.rb} | 6 +++--- ruby/red-arrow/lib/arrow/loader.rb | 6 +- .../test/test-decimal128-array-builder.rb | 22 +++--- ruby/red-arrow/test/test-decimal128-array.rb | 8 4 files changed, 23 insertions(+), 19 deletions(-) copy ruby/red-arrow/lib/arrow/{tensor.rb => decimal128-array.rb} (91%)
[arrow-site] branch master updated: ARROW-6246: [Website] Add link to R documentation site
This is an automated email from the ASF dual-hosted git repository. wesm pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-site.git The following commit(s) were added to refs/heads/master by this push: new 41d02ac ARROW-6246: [Website] Add link to R documentation site 41d02ac is described below commit 41d02ac5e96fafd3dc7663d5214cdc7cd0dedb26 Author: Neal Richardson AuthorDate: Thu Aug 15 06:59:43 2019 -0700 ARROW-6246: [Website] Add link to R documentation site --- _includes/header.html | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/_includes/header.html b/_includes/header.html index 7c02533..4174bae 100644 --- a/_includes/header.html +++ b/_includes/header.html @@ -52,9 +52,10 @@ Project Docs Python C++ -Java API -C GLib API -Javascript API +Java +C GLib +JavaScript +R
[GitHub] [arrow-site] wesm merged pull request #11: ARROW-6246: [Website] Add link to R documentation site
wesm merged pull request #11: ARROW-6246: [Website] Add link to R documentation site URL: https://github.com/apache/arrow-site/pull/11 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [arrow-site] wesm commented on issue #11: ARROW-6246: [Website] Add link to R documentation site
wesm commented on issue #11: ARROW-6246: [Website] Add link to R documentation site URL: https://github.com/apache/arrow-site/pull/11#issuecomment-521651223 LGTM, thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[arrow] branch master updated: ARROW-6180: [C++][Parquet] Add RandomAccessFile::GetStream that returns InputStream that reads a file segment independent of the file's state, fix concurrent buffered Pa
This is an automated email from the ASF dual-hosted git repository. wesm pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 2c808a2 ARROW-6180: [C++][Parquet] Add RandomAccessFile::GetStream that returns InputStream that reads a file segment independent of the file's state, fix concurrent buffered Parquet column reads 2c808a2 is described below commit 2c808a2cbd62300a36d682ebd7bd25ad8b6cd500 Author: Wes McKinney AuthorDate: Thu Aug 15 11:45:24 2019 -0500 ARROW-6180: [C++][Parquet] Add RandomAccessFile::GetStream that returns InputStream that reads a file segment independent of the file's state, fix concurrent buffered Parquet column reads This enables different functions to read portions of a `RandomAccessFile` as an InputStream without interfering with each other. This also addresses PARQUET-1636 and adds a unit test for buffered column chunk reads. In the refactor to use the Arrow IO interfaces, I broke this by allowing the raw RandomAccessFile to be passed into multiple `BufferedInputStream` at once, so the file position was being manipulated by different column readers. We didn't catch the problem because we didn't have any unit tests, so this patch addresses that deficiency. Closes #5085 from wesm/ARROW-6180 and squashes the following commits: e4ad370d5 Code review comments 2645bec64 Add unit test that exhibits PARQUET-1636 76dc71c4f stub 3eb0136d1 Finish basic unit tests 4fd3d610a Start implementation Authored-by: Wes McKinney Signed-off-by: Wes McKinney --- cpp/src/arrow/io/interfaces.cc | 66 cpp/src/arrow/io/interfaces.h | 10 + cpp/src/arrow/io/memory-test.cc | 67 cpp/src/arrow/testing/random.h | 33 +++--- cpp/src/parquet/properties.cc | 7 ++- cpp/src/parquet/properties.h| 2 +- cpp/src/parquet/reader-test.cc | 96 + 7 files changed, 262 insertions(+), 19 deletions(-) diff --git a/cpp/src/arrow/io/interfaces.cc b/cpp/src/arrow/io/interfaces.cc index 06acb99..8c4f480 100644 --- a/cpp/src/arrow/io/interfaces.cc +++ b/cpp/src/arrow/io/interfaces.cc @@ -17,11 +17,15 @@ #include "arrow/io/interfaces.h" +#include #include #include #include +#include +#include "arrow/buffer.h" #include "arrow/status.h" +#include "arrow/util/logging.h" #include "arrow/util/string_view.h" namespace arrow { @@ -70,5 +74,67 @@ Status Writable::Write(const std::string& data) { Status Writable::Flush() { return Status::OK(); } +class FileSegmentReader : public InputStream { + public: + FileSegmentReader(std::shared_ptr file, int64_t file_offset, +int64_t nbytes) + : file_(std::move(file)), +closed_(false), +position_(0), +file_offset_(file_offset), +nbytes_(nbytes) { +FileInterface::set_mode(FileMode::READ); + } + + Status CheckOpen() const { +if (closed_) { + return Status::IOError("Stream is closed"); +} +return Status::OK(); + } + + Status Close() override { +closed_ = true; +return Status::OK(); + } + + Status Tell(int64_t* position) const override { +RETURN_NOT_OK(CheckOpen()); +*position = position_; +return Status::OK(); + } + + bool closed() const override { return closed_; } + + Status Read(int64_t nbytes, int64_t* bytes_read, void* out) override { +RETURN_NOT_OK(CheckOpen()); +int64_t bytes_to_read = std::min(nbytes, nbytes_ - position_); +RETURN_NOT_OK( +file_->ReadAt(file_offset_ + position_, bytes_to_read, bytes_read, out)); +position_ += *bytes_read; +return Status::OK(); + } + + Status Read(int64_t nbytes, std::shared_ptr* out) override { +RETURN_NOT_OK(CheckOpen()); +int64_t bytes_to_read = std::min(nbytes, nbytes_ - position_); +RETURN_NOT_OK(file_->ReadAt(file_offset_ + position_, bytes_to_read, out)); +position_ += (*out)->size(); +return Status::OK(); + } + + private: + std::shared_ptr file_; + bool closed_; + int64_t position_; + int64_t file_offset_; + int64_t nbytes_; +}; + +std::shared_ptr RandomAccessFile::GetStream( +std::shared_ptr file, int64_t file_offset, int64_t nbytes) { + return std::make_shared(std::move(file), file_offset, nbytes); +} + } // namespace io } // namespace arrow diff --git a/cpp/src/arrow/io/interfaces.h b/cpp/src/arrow/io/interfaces.h index 678366b..95022e3 100644 --- a/cpp/src/arrow/io/interfaces.h +++ b/cpp/src/arrow/io/interfaces.h @@ -144,6 +144,16 @@ class ARROW_EXPORT RandomAccessFile : public InputStream, public Seekable { /// Necessary because we hold a std::unique_ptr ~RandomAccessFile() override; + /// \brief Create an isolated InputStream that reads a segment of a + /// RandomAccessFile. Multiple such stream can be
[arrow] branch master updated: ARROW-6259: [C++] Add -Wno-extra-semi-stmt when compiling with clang 8 to work around Flatbuffers bug, suppress other new LLVM 8 warnings
This is an automated email from the ASF dual-hosted git repository. wesm pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new fb8cb89 ARROW-6259: [C++] Add -Wno-extra-semi-stmt when compiling with clang 8 to work around Flatbuffers bug, suppress other new LLVM 8 warnings fb8cb89 is described below commit fb8cb8968fa28c3b3e943cb86dbe5c57d97ea422 Author: Wes McKinney AuthorDate: Thu Aug 15 19:09:24 2019 -0500 ARROW-6259: [C++] Add -Wno-extra-semi-stmt when compiling with clang 8 to work around Flatbuffers bug, suppress other new LLVM 8 warnings LLVM 8 introduces `-Wextra-semi-stmt` and Flatbuffers generates code with superfluous semicolons (upstream bug report https://github.com/google/flatbuffers/issues/5482). This is breaking our macOS builds for the last few hours because conda-forge upgraded their compiler toolchain from Apple clang 4.0.1 to clang 8.0.0 this afternoon. Closes #5096 from wesm/ARROW-6259 and squashes the following commits: 96cbba9e8 Suppress -Wshadow-field and -Wc++2a-compat also 686339caf Add -Wno-extra-semi-stmt when compiling with clang 8 to work around Flatbuffers bug Authored-by: Wes McKinney Signed-off-by: Wes McKinney --- cpp/cmake_modules/SetupCxxFlags.cmake | 9 + 1 file changed, 9 insertions(+) diff --git a/cpp/cmake_modules/SetupCxxFlags.cmake b/cpp/cmake_modules/SetupCxxFlags.cmake index 9eba9e8..09d5bf2 100644 --- a/cpp/cmake_modules/SetupCxxFlags.cmake +++ b/cpp/cmake_modules/SetupCxxFlags.cmake @@ -168,6 +168,15 @@ if("${BUILD_WARNING_LEVEL}" STREQUAL "CHECKIN") if("${COMPILER_VERSION}" VERSION_GREATER "3.9") set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wno-zero-as-null-pointer-constant") endif() + +if("${COMPILER_VERSION}" VERSION_GREATER "7.0") + # ARROW-6259: Flatbuffers generates code with superfluous semicolons, so + # we suppress this warning for now. See upstream bug report + # https://github.com/google/flatbuffers/issues/5482 + set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wno-extra-semi-stmt \ +-Wno-shadow-field -Wno-c++2a-compat") +endif() + set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wno-unknown-warning-option") elseif("${COMPILER_FAMILY}" STREQUAL "gcc") set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wall \
[arrow] branch master updated: ARROW-6204: [GLib] Add garrow_array_is_in_chunked_array()
This is an automated email from the ASF dual-hosted git repository. kou pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 9a6c82e ARROW-6204: [GLib] Add garrow_array_is_in_chunked_array() 9a6c82e is described below commit 9a6c82e9799cfb213f8103dfacaf36f5a30f4be8 Author: Yosuke Shiro AuthorDate: Fri Aug 16 06:34:52 2019 +0900 ARROW-6204: [GLib] Add garrow_array_is_in_chunked_array() This is follow-up of https://github.com/apache/arrow/pull/5047#issuecomment-520103706. Closes #5086 from shiro615/glib-isin-chunked-array and squashes the following commits: 6724dfdc4 Simplify 6d5105a73 Fix documents 798b6ed85 Fix test cases for Arrow::Array#is_in_chunked_array ad98fd972 Add garrow_array_is_in_chunked_array() Authored-by: Yosuke Shiro Signed-off-by: Sutou Kouhei --- c_glib/arrow-glib/compute.cpp | 39 +- c_glib/arrow-glib/compute.h | 6 +++ c_glib/test/test-is-in.rb | 92 --- 3 files changed, 114 insertions(+), 23 deletions(-) diff --git a/c_glib/arrow-glib/compute.cpp b/c_glib/arrow-glib/compute.cpp index b489913..fb33e72 100644 --- a/c_glib/arrow-glib/compute.cpp +++ b/c_glib/arrow-glib/compute.cpp @@ -25,6 +25,7 @@ #include #include +#include #include #include #include @@ -1440,7 +1441,43 @@ garrow_array_is_in(GArrowArray *left, arrow_left_datum, arrow_right_datum, _datum); - if (garrow_error_check(error, status, "[array][isin]")) { + if (garrow_error_check(error, status, "[array][is-in]")) { +auto arrow_array = arrow_datum.make_array(); +return GARROW_BOOLEAN_ARRAY(garrow_array_new_raw(_array)); + } else { +return NULL; + } +} + +/** + * garrow_array_is_in_chunked_array: + * @left: A left hand side #GArrowArray. + * @right: A right hand side #GArrowChunkedArray. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: (nullable) (transfer full): The #GArrowBooleanArray + * showing whether each element in the left array is contained + * in right chunked array. + * + * Since: 0.15.0 + */ +GArrowBooleanArray * +garrow_array_is_in_chunked_array(GArrowArray *left, + GArrowChunkedArray *right, + GError **error) +{ + auto arrow_left = garrow_array_get_raw(left); + auto arrow_left_datum = arrow::compute::Datum(arrow_left); + auto arrow_right = garrow_chunked_array_get_raw(right); + auto arrow_right_datum = arrow::compute::Datum(arrow_right); + auto memory_pool = arrow::default_memory_pool(); + arrow::compute::FunctionContext context(memory_pool); + arrow::compute::Datum arrow_datum; + auto status = arrow::compute::IsIn(, + arrow_left_datum, + arrow_right_datum, + _datum); + if (garrow_error_check(error, status, "[array][is-in-chunked-array]")) { auto arrow_array = arrow_datum.make_array(); return GARROW_BOOLEAN_ARRAY(garrow_array_new_raw(_array)); } else { diff --git a/c_glib/arrow-glib/compute.h b/c_glib/arrow-glib/compute.h index 3a0b3a8..79e43e8 100644 --- a/c_glib/arrow-glib/compute.h +++ b/c_glib/arrow-glib/compute.h @@ -20,6 +20,7 @@ #pragma once #include +#include G_BEGIN_DECLS @@ -258,5 +259,10 @@ GArrowBooleanArray * garrow_array_is_in(GArrowArray *left, GArrowArray *right, GError **error); +GARROW_AVAILABLE_IN_0_15 +GArrowBooleanArray * +garrow_array_is_in_chunked_array(GArrowArray *left, + GArrowChunkedArray *right, + GError **error); G_END_DECLS diff --git a/c_glib/test/test-is-in.rb b/c_glib/test/test-is-in.rb index 1af6ac0..5b1b360 100644 --- a/c_glib/test/test-is-in.rb +++ b/c_glib/test/test-is-in.rb @@ -18,31 +18,79 @@ class TestIsIn < Test::Unit::TestCase include Helper::Buildable - def test_no_null -left_array = build_int16_array([1, 0, 1, 2]) -right_array = build_int16_array([2, 0]) -assert_equal(build_boolean_array([false, true, false, true]), - left_array.is_in(right_array)) - end + sub_test_case("Array") do +def test_no_null + left = build_int16_array([1, 0, 1, 2]) + right = build_int16_array([2, 0]) + assert_equal(build_boolean_array([false, true, false, true]), + left.is_in(right)) +end - def test_null_in_left_array -left_array = build_int16_array([1, 0, nil, 2]) -right_array = build_int16_array([2, 0, 3]) -assert_equal(build_boolean_array([false, true, nil, true]), - left_array.is_in(right_array)) - end +def test_null_in_left + left =
[arrow] branch master updated: ARROW-6170: [R] Faster docker-compose build
This is an automated email from the ASF dual-hosted git repository. kou pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new ea91067 ARROW-6170: [R] Faster docker-compose build ea91067 is described below commit ea9106798993c9b54127c1a6f1b13a6aa394f9de Author: Antoine Pitrou AuthorDate: Fri Aug 16 07:08:24 2019 +0900 ARROW-6170: [R] Faster docker-compose build Use parallel package compilation and installation. Closes #5039 from pitrou/ARROW-6170-faster-build-r and squashes the following commits: 5ef5f06df Hopefully appease lint thing c40eca821 ARROW-6170: Faster docker-compose build Authored-by: Antoine Pitrou Signed-off-by: Sutou Kouhei --- .dockerignore | 3 +++ r/Dockerfile | 11 --- 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/.dockerignore b/.dockerignore index 16bdebb..64e3890 100644 --- a/.dockerignore +++ b/.dockerignore @@ -38,6 +38,9 @@ cpp/.idea cpp/build cpp/*-build cpp/*_build +cpp/build-debug +cpp/build-release +cpp/build-test cpp/Testing cpp/thirdparty !cpp/thirdparty/jemalloc diff --git a/r/Dockerfile b/r/Dockerfile index a43ac20..01262bf 100644 --- a/r/Dockerfile +++ b/r/Dockerfile @@ -60,9 +60,14 @@ ENV ARROW_R_DEV=TRUE ENV PKG_CONFIG_PATH=${PKG_CONFIG_PATH}:/build/cpp/src/arrow:/opt/conda/lib/pkgconfig ENV LD_LIBRARY_PATH=/opt/conda/lib/:/build/cpp/src/arrow:/arrow/r/src -RUN Rscript -e "install.packages('devtools', repos = 'http://cran.rstudio.com')" && \ -Rscript -e "devtools::install_github('romainfrancois/decor')" && \ -Rscript -e "install.packages(c( \ +# Ensure parallel R package installation +RUN printf "options(Ncpus = parallel::detectCores())\n" >> /etc/R/Rprofile.site +# Also ensure parallel compilation of each individual package +RUN printf "MAKEFLAGS=-j8\n" >> /usr/lib/R/etc/Makeconf + +RUN Rscript -e "install.packages('devtools', repos = 'http://cran.rstudio.com')" +RUN Rscript -e "devtools::install_github('romainfrancois/decor')" +RUN Rscript -e "install.packages(c( \ 'Rcpp', 'dplyr', 'stringr', 'glue', 'vctrs', \ 'purrr', \ 'assertthat', \
[arrow] branch master updated: ARROW-6186: [Packaging][deb] Add missing headers to libplasma-dev for Ubuntu 16.04
This is an automated email from the ASF dual-hosted git repository. kou pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new be95f47 ARROW-6186: [Packaging][deb] Add missing headers to libplasma-dev for Ubuntu 16.04 be95f47 is described below commit be95f4725d72205058a0e732a49163ee82305868 Author: Sutou Kouhei AuthorDate: Fri Aug 16 06:30:21 2019 +0900 ARROW-6186: [Packaging][deb] Add missing headers to libplasma-dev for Ubuntu 16.04 Closes #5050 from kou/packages-linux-ubuntu-xenial-add-missing-plasma-headers and squashes the following commits: bd4cba03e Add missing headers to libplasma-dev for Ubuntu 16.04 Authored-by: Sutou Kouhei Signed-off-by: Sutou Kouhei --- dev/tasks/linux-packages/debian.ubuntu-xenial/libplasma-dev.install | 1 + 1 file changed, 1 insertion(+) diff --git a/dev/tasks/linux-packages/debian.ubuntu-xenial/libplasma-dev.install b/dev/tasks/linux-packages/debian.ubuntu-xenial/libplasma-dev.install index d3538d2..fc5904e 100644 --- a/dev/tasks/linux-packages/debian.ubuntu-xenial/libplasma-dev.install +++ b/dev/tasks/linux-packages/debian.ubuntu-xenial/libplasma-dev.install @@ -1,3 +1,4 @@ +usr/include/plasma/ usr/lib/*/libplasma.a usr/lib/*/libplasma.so usr/lib/*/pkgconfig/plasma.pc
[arrow] branch master updated: ARROW-6130: [Release] Use 0.15.0 as the next release
This is an automated email from the ASF dual-hosted git repository. wesm pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 65b2286 ARROW-6130: [Release] Use 0.15.0 as the next release 65b2286 is described below commit 65b2286e34f857d90245990978e56c5c7ecbb7fb Author: Sutou Kouhei AuthorDate: Thu Aug 15 20:39:26 2019 -0500 ARROW-6130: [Release] Use 0.15.0 as the next release See discussion on the mailing list: [Discuss] Do a 0.15.0 release before 1.0.0? https://lists.apache.org/thread.html/98b59e461c8937d33660214028dcd78a47f52fbb762217d996194941@%3Cdev.arrow.apache.org%3E Closes #5007 from kou/release-use-0.15.0-as-the-next-release and squashes the following commits: 6833dd7e4 Change version to 0.15.0-SNAPSHOT by hand 66362f8bf Remove duplicated section dac30c581 Update .deb package names for 0.15.0 b39c0540d Update versions for 0.15.0-SNAPSHOT Authored-by: Sutou Kouhei Signed-off-by: Wes McKinney --- c_glib/configure.ac| 2 +- c_glib/meson.build | 2 +- cpp/CMakeLists.txt | 2 +- csharp/Directory.Build.props | 2 +- dev/release/rat_exclude_files.txt | 50 ++--- .../linux-packages/debian.ubuntu-xenial/control| 78 +++ .../libarrow-cuda-glib15.install} | 0 .../libarrow-cuda15.install} | 0 .../libarrow-dataset15.install}| 0 .../libarrow-glib15.install} | 0 .../libarrow-python15.install} | 0 .../libarrow15.install}| 0 .../libgandiva-glib15.install} | 0 .../libgandiva15.install} | 0 .../libparquet-glib15.install} | 0 .../libparquet15.install} | 0 .../libplasma-glib15.install} | 0 .../libplasma15.install} | 0 dev/tasks/linux-packages/debian/control| 84 +++ .../libarrow-cuda-glib15.install} | 0 .../libarrow-cuda15.install} | 0 .../libarrow-dataset15.install}| 0 ...-flight14.install => libarrow-flight15.install} | 0 .../libarrow-glib15.install} | 0 .../libarrow-python15.install} | 0 .../libarrow15.install}| 0 .../libgandiva-glib15.install} | 0 .../libgandiva15.install} | 0 .../libparquet-glib15.install} | 0 .../libparquet15.install} | 0 .../libplasma-glib15.install} | 0 .../libplasma15.install} | 0 dev/tasks/tasks.yml| 248 ++--- java/adapter/avro/pom.xml | 2 +- java/adapter/jdbc/pom.xml | 2 +- java/adapter/orc/pom.xml | 2 +- java/algorithm/pom.xml | 2 +- java/flight/pom.xml| 2 +- java/format/pom.xml| 2 +- java/gandiva/pom.xml | 2 +- java/memory/pom.xml| 2 +- java/performance/pom.xml | 2 +- java/plasma/pom.xml| 2 +- java/pom.xml | 2 +- java/tools/pom.xml | 2 +- java/vector/pom.xml| 2 +- js/package.json| 2 +- matlab/CMakeLists.txt | 2 +- python/setup.py| 2 +- ruby/red-arrow-cuda/lib/arrow-cuda/version.rb | 2 +- ruby/red-arrow/lib/arrow/version.rb| 2 +- ruby/red-gandiva/lib/gandiva/version.rb| 2 +- ruby/red-parquet/lib/parquet/version.rb| 2 +- ruby/red-plasma/lib/plasma/version.rb | 2 +- rust/arrow/Cargo.toml | 2 +- rust/datafusion/Cargo.toml | 6 +- rust/datafusion/README.md | 2 +- rust/parquet/Cargo.toml| 4 +- rust/parquet/README.md | 4 +- 59 files changed, 264 insertions(+), 264 deletions(-) diff --git a/c_glib/configure.ac b/c_glib/configure.ac index 66f88c0..e1eafd8 100644 --- a/c_glib/configure.ac +++ b/c_glib/configure.ac @@ -17,7 +17,7 @@ AC_PREREQ(2.65) -m4_define([arrow_glib_version], 1.0.0-SNAPSHOT) +m4_define([arrow_glib_version],
[arrow] branch master updated: ARROW-6249: [Java] Remove useless class ByteArrayWrapper
This is an automated email from the ASF dual-hosted git repository. emkornfield pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new db6d5dd ARROW-6249: [Java] Remove useless class ByteArrayWrapper db6d5dd is described below commit db6d5dd55492f91ee402c7cda9a2678556c8200e Author: tianchen AuthorDate: Thu Aug 15 19:25:33 2019 -0700 ARROW-6249: [Java] Remove useless class ByteArrayWrapper Related to [ARROW-6249](https://issues.apache.org/jira/browse/ARROW-6249). This class was introduced into encoding part to compare byte[] values equals. Since now we compare value/vector equals by new added visitor API by ARROW-6022 instead of comparing getObject, this class is no use anymore. Closes #5093 from tianchen92/ARROW-6249 and squashes the following commits: ae7e61844 ARROW-6249: Remove useless class ByteArrayWrapper Authored-by: tianchen Signed-off-by: Micah Kornfield --- .../arrow/vector/dictionary/ByteArrayWrapper.java | 52 -- 1 file changed, 52 deletions(-) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/dictionary/ByteArrayWrapper.java b/java/vector/src/main/java/org/apache/arrow/vector/dictionary/ByteArrayWrapper.java deleted file mode 100644 index bcfac39..000 --- a/java/vector/src/main/java/org/apache/arrow/vector/dictionary/ByteArrayWrapper.java +++ /dev/null @@ -1,52 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.arrow.vector.dictionary; - -import java.util.Arrays; - -/** - * Wrapper class for byte array. - */ -public class ByteArrayWrapper { - private final byte[] data; - - /** - * Constructs a new instance. - */ - public ByteArrayWrapper(byte[] data) { -if (data == null) { - throw new NullPointerException(); -} - -this.data = data; - } - - @Override - public boolean equals(Object other) { -if (!(other instanceof ByteArrayWrapper)) { - return false; -} - -return Arrays.equals(data, ((ByteArrayWrapper)other).data); - } - - @Override - public int hashCode() { -return Arrays.hashCode(data); - } -}
[arrow] branch master updated: ARROW-6212: [Java] Support vector rank operation
This is an automated email from the ASF dual-hosted git repository. emkornfield pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 71b32b9 ARROW-6212: [Java] Support vector rank operation 71b32b9 is described below commit 71b32b9b87fa9825d2112644c7ce15d6f71b9174 Author: liyafan82 AuthorDate: Thu Aug 15 19:43:36 2019 -0700 ARROW-6212: [Java] Support vector rank operation Given an unsorted vector, we want to get the index of the ith smallest element in the vector. This function is supported by the rank operation. We provide an implementation that gets the index with the desired rank, without sorting the vector (the vector is left intact), and the implementation takes O(n) time, where n is the vector length. Closes #5066 from liyafan82/fly_0812_rank and squashes the following commits: 623b08531 Support vector rank operation Authored-by: liyafan82 Signed-off-by: Micah Kornfield --- .../apache/arrow/algorithm/rank/VectorRank.java| 89 + .../apache/arrow/algorithm/sort/IndexSorter.java | 16 ++- .../arrow/algorithm/rank/TestVectorRank.java | 146 + 3 files changed, 249 insertions(+), 2 deletions(-) diff --git a/java/algorithm/src/main/java/org/apache/arrow/algorithm/rank/VectorRank.java b/java/algorithm/src/main/java/org/apache/arrow/algorithm/rank/VectorRank.java new file mode 100644 index 000..43c9a5b --- /dev/null +++ b/java/algorithm/src/main/java/org/apache/arrow/algorithm/rank/VectorRank.java @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.algorithm.rank; + +import java.util.stream.IntStream; + +import org.apache.arrow.algorithm.sort.IndexSorter; +import org.apache.arrow.algorithm.sort.VectorValueComparator; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.util.Preconditions; +import org.apache.arrow.vector.IntVector; +import org.apache.arrow.vector.ValueVector; + +/** + * Utility for calculating ranks of vector elements. + * @param the vector type + */ +public class VectorRank { + + private VectorValueComparator comparator; + + /** + * Vector indices. + */ + private IntVector indices; + + private final BufferAllocator allocator; + + /** + * Constructs a vector rank utility. + * @param allocator the allocator to use. + */ + public VectorRank(BufferAllocator allocator) { +this.allocator = allocator; + } + + /** + * Given a rank r, gets the index of the element that is the rth smallest in the vector. + * The operation is performed without changing the vector, and takes O(n) time, + * where n is the length of the vector. + * @param vector the vector from which to get the element index. + * @param comparator the criteria for vector element comparison. + * @param rank the rank to determine. + * @return the element index with the given rank. + */ + public int indexAtRank(V vector, VectorValueComparator comparator, int rank) { +Preconditions.checkArgument(rank >= 0 && rank < vector.getValueCount()); +try { + indices = new IntVector("index vector", allocator); + indices.allocateNew(vector.getValueCount()); + IntStream.range(0, vector.getValueCount()).forEach(i -> indices.set(i, i)); + + comparator.attachVector(vector); + this.comparator = comparator; + + int pos = getRank(0, vector.getValueCount() - 1, rank); + return indices.get(pos); +} finally { + indices.close(); +} + } + + private int getRank(int low, int high, int rank) { +int mid = IndexSorter.partition(low, high, indices, comparator); +if (mid < rank) { + return getRank(mid + 1, high, rank); +} else if (mid > rank) { + return getRank(low, mid - 1, rank); +} else { + // mid == rank + return mid; +} + } +} diff --git a/java/algorithm/src/main/java/org/apache/arrow/algorithm/sort/IndexSorter.java b/java/algorithm/src/main/java/org/apache/arrow/algorithm/sort/IndexSorter.java index d85eb6f..0f03e5c 100644 ---
[arrow] branch master updated: ARROW-6199: [Java] Avro adapter avoid potential resource leak.
This is an automated email from the ASF dual-hosted git repository. emkornfield pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new dd4532a ARROW-6199: [Java] Avro adapter avoid potential resource leak. dd4532a is described below commit dd4532a0cdaccf8e7811086bc5360b13ef9a6c36 Author: tianchen AuthorDate: Thu Aug 15 19:49:53 2019 -0700 ARROW-6199: [Java] Avro adapter avoid potential resource leak. Related to [ARROW-6199](https://issues.apache.org/jira/browse/ARROW-6199). Currently, avro consumer interface has no close API, which may cause resource leak like AvroBytesConsumer#cacheBuffer. To resolve this, make consumer extends AutoCloseable and create CompositeAvroConsumer to encompasses consume and close logic. Closes #5059 from tianchen92/ARROW-6199 and squashes the following commits: d60d94c48 fix 42f22da7c clear vectors in close 5b91da75f fix comments 3ffc07600 ARROW-6199: Avro adapter avoid potential resource leak. Authored-by: tianchen Signed-off-by: Micah Kornfield --- .../java/org/apache/arrow/AvroToArrowUtils.java| 22 +++ .../arrow/consumers/AvroBooleanConsumer.java | 5 ++ .../apache/arrow/consumers/AvroBytesConsumer.java | 5 ++ .../apache/arrow/consumers/AvroDoubleConsumer.java | 5 ++ .../apache/arrow/consumers/AvroFloatConsumer.java | 5 ++ .../apache/arrow/consumers/AvroIntConsumer.java| 5 ++ .../apache/arrow/consumers/AvroLongConsumer.java | 5 ++ .../apache/arrow/consumers/AvroNullConsumer.java | 5 ++ .../apache/arrow/consumers/AvroStringConsumer.java | 5 ++ .../apache/arrow/consumers/AvroUnionsConsumer.java | 16 +++-- .../arrow/consumers/CompositeAvroConsumer.java | 69 ++ .../java/org/apache/arrow/consumers/Consumer.java | 7 ++- .../arrow/consumers/NullableTypeConsumer.java | 5 ++ 13 files changed, 141 insertions(+), 18 deletions(-) diff --git a/java/adapter/avro/src/main/java/org/apache/arrow/AvroToArrowUtils.java b/java/adapter/avro/src/main/java/org/apache/arrow/AvroToArrowUtils.java index 25611a5..77f34df 100644 --- a/java/adapter/avro/src/main/java/org/apache/arrow/AvroToArrowUtils.java +++ b/java/adapter/avro/src/main/java/org/apache/arrow/AvroToArrowUtils.java @@ -20,7 +20,6 @@ package org.apache.arrow; import static org.apache.arrow.vector.types.FloatingPointPrecision.DOUBLE; import static org.apache.arrow.vector.types.FloatingPointPrecision.SINGLE; -import java.io.EOFException; import java.io.IOException; import java.util.ArrayList; import java.util.HashMap; @@ -37,6 +36,7 @@ import org.apache.arrow.consumers.AvroLongConsumer; import org.apache.arrow.consumers.AvroNullConsumer; import org.apache.arrow.consumers.AvroStringConsumer; import org.apache.arrow.consumers.AvroUnionsConsumer; +import org.apache.arrow.consumers.CompositeAvroConsumer; import org.apache.arrow.consumers.Consumer; import org.apache.arrow.consumers.NullableTypeConsumer; import org.apache.arrow.memory.BufferAllocator; @@ -246,19 +246,15 @@ public class AvroToArrowUtils { VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, 0); -int valueCount = 0; -while (true) { - try { -for (Consumer consumer : consumers) { - consumer.consume(decoder); -} -valueCount++; -//reach end will throw EOFException. - } catch (EOFException eofException) { -root.setRowCount(valueCount); -break; - } +CompositeAvroConsumer compositeConsumer = null; +try { + compositeConsumer = new CompositeAvroConsumer(consumers); + compositeConsumer.consume(decoder, root); +} catch (Exception e) { + compositeConsumer.close(); + throw new RuntimeException("Error occurs while consume process.", e); } + return root; } } diff --git a/java/adapter/avro/src/main/java/org/apache/arrow/consumers/AvroBooleanConsumer.java b/java/adapter/avro/src/main/java/org/apache/arrow/consumers/AvroBooleanConsumer.java index b2fe704..c2876f1 100644 --- a/java/adapter/avro/src/main/java/org/apache/arrow/consumers/AvroBooleanConsumer.java +++ b/java/adapter/avro/src/main/java/org/apache/arrow/consumers/AvroBooleanConsumer.java @@ -63,4 +63,9 @@ public class AvroBooleanConsumer implements Consumer { return this.vector; } + @Override + public void close() throws Exception { +writer.close(); + } + } diff --git a/java/adapter/avro/src/main/java/org/apache/arrow/consumers/AvroBytesConsumer.java b/java/adapter/avro/src/main/java/org/apache/arrow/consumers/AvroBytesConsumer.java index 2c649f9..c0cfaec 100644 --- a/java/adapter/avro/src/main/java/org/apache/arrow/consumers/AvroBytesConsumer.java +++ b/java/adapter/avro/src/main/java/org/apache/arrow/consumers/AvroBytesConsumer.java @@ -79,4 +79,9 @@ public class
[arrow] branch master updated (91e33dc -> 09bb8b8)
This is an automated email from the ASF dual-hosted git repository. emkornfield pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 91e33dc ARROW-6038: [C++] Faster type equality add 09bb8b8 ARROW-6219: [Java] Add API for JDBC adapter that can convert less then the full result set at a time No new revisions were added by this update. Summary of changes: .../arrow/adapter/jdbc/ArrowVectorIterator.java| 159 +++ .../org/apache/arrow/adapter/jdbc/JdbcToArrow.java | 65 - .../arrow/adapter/jdbc/JdbcToArrowConfig.java | 26 +- .../adapter/jdbc/JdbcToArrowConfigBuilder.java | 10 +- .../arrow/adapter/jdbc/JdbcToArrowUtils.java | 14 +- .../arrow/adapter/jdbc/consumer/ArrayConsumer.java | 7 +- .../adapter/jdbc/consumer/BigIntConsumer.java | 9 +- .../adapter/jdbc/consumer/BinaryConsumer.java | 9 +- .../arrow/adapter/jdbc/consumer/BitConsumer.java | 9 +- .../arrow/adapter/jdbc/consumer/BlobConsumer.java | 9 +- .../arrow/adapter/jdbc/consumer/ClobConsumer.java | 9 +- .../jdbc/consumer/CompositeJdbcConsumer.java | 22 +- .../arrow/adapter/jdbc/consumer/DateConsumer.java | 9 +- .../adapter/jdbc/consumer/DecimalConsumer.java | 9 +- .../adapter/jdbc/consumer/DoubleConsumer.java | 9 +- .../arrow/adapter/jdbc/consumer/FloatConsumer.java | 9 +- .../arrow/adapter/jdbc/consumer/IntConsumer.java | 9 +- .../arrow/adapter/jdbc/consumer/JdbcConsumer.java | 10 +- .../adapter/jdbc/consumer/SmallIntConsumer.java| 9 +- .../arrow/adapter/jdbc/consumer/TimeConsumer.java | 9 +- .../adapter/jdbc/consumer/TimestampConsumer.java | 9 +- .../adapter/jdbc/consumer/TinyIntConsumer.java | 9 +- .../adapter/jdbc/consumer/VarCharConsumer.java | 9 +- .../arrow/adapter/jdbc/JdbcToArrowConfigTest.java | 6 +- .../arrow/adapter/jdbc/h2/JdbcToArrowTest.java | 34 +-- .../jdbc/h2/JdbcToArrowVectorIteratorTest.java | 315 + .../test/resources/h2/test1_all_datatypes_h2.yml | 2 +- .../jdbc/src/test/resources/h2/test1_int_h2.yml| 2 +- 28 files changed, 730 insertions(+), 77 deletions(-) create mode 100644 java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/ArrowVectorIterator.java create mode 100644 java/adapter/jdbc/src/test/java/org/apache/arrow/adapter/jdbc/h2/JdbcToArrowVectorIteratorTest.java
[arrow] branch master updated: ARROW-5952: [Python] fix conversion of chunked dictionary array with 0 chunks
This is an automated email from the ASF dual-hosted git repository. wesm pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 5479d30 ARROW-5952: [Python] fix conversion of chunked dictionary array with 0 chunks 5479d30 is described below commit 5479d3047a23410de00f50687764a4f4300baba5 Author: Joris Van den Bossche AuthorDate: Thu Aug 15 21:47:38 2019 -0500 ARROW-5952: [Python] fix conversion of chunked dictionary array with 0 chunks https://issues.apache.org/jira/browse/ARROW-5952 Closes #5081 from jorisvandenbossche/ARROW-5952-dictionary-zero-chunks and squashes the following commits: 2f11fb94d Nits 742db0e34 create empty dictionary array of correct type feb06d310 ARROW-5952: fix conversion of chunked dictionary array with 0 chunks Lead-authored-by: Joris Van den Bossche Co-authored-by: Wes McKinney Signed-off-by: Wes McKinney --- cpp/src/arrow/python/arrow_to_pandas.cc | 47 - python/pyarrow/tests/test_pandas.py | 13 + 2 files changed, 47 insertions(+), 13 deletions(-) diff --git a/cpp/src/arrow/python/arrow_to_pandas.cc b/cpp/src/arrow/python/arrow_to_pandas.cc index f97782d..39857d7 100644 --- a/cpp/src/arrow/python/arrow_to_pandas.cc +++ b/cpp/src/arrow/python/arrow_to_pandas.cc @@ -487,7 +487,7 @@ inline Status ConvertNulls(const PandasOptions& options, const ChunkedArray& dat inline Status ConvertStruct(const PandasOptions& options, const ChunkedArray& data, PyObject** out_values) { PyAcquireGIL lock; - if (data.num_chunks() <= 0) { + if (data.num_chunks() == 0) { return Status::OK(); } // ChunkedArray has at least one chunk @@ -1042,6 +1042,14 @@ class DatetimeTZBlock : public DatetimeBlock { std::string timezone_; }; +Status MakeZeroLengthArray(const std::shared_ptr& type, + std::shared_ptr* out) { + std::unique_ptr builder; + RETURN_NOT_OK(MakeBuilder(default_memory_pool(), type, )); + RETURN_NOT_OK(builder->Resize(0)); + return builder->Finish(out); +} + class CategoricalBlock : public PandasBlock { public: explicit CategoricalBlock(const PandasOptions& options, MemoryPool* pool, @@ -1063,6 +1071,10 @@ class CategoricalBlock : public PandasBlock { using T = typename TRAITS::T; constexpr int npy_type = TRAITS::npy_type; +if (data->num_chunks() == 0) { + RETURN_NOT_OK(AllocateNDArray(npy_type, 1)); + return Status::OK(); +} // Sniff the first chunk const std::shared_ptr arr_first = data->chunk(0); const auto& dict_arr_first = checked_cast(*arr_first); @@ -1132,15 +1144,17 @@ class CategoricalBlock : public PandasBlock { converted_data = out.chunked_array(); } else { // check if all dictionaries are equal - const std::shared_ptr arr_first = data->chunk(0); - const auto& dict_arr_first = checked_cast(*arr_first); + if (data->num_chunks() > 1) { +const std::shared_ptr arr_first = data->chunk(0); +const auto& dict_arr_first = checked_cast(*arr_first); - for (int c = 1; c < data->num_chunks(); c++) { -const std::shared_ptr arr = data->chunk(c); -const auto& dict_arr = checked_cast(*arr); +for (int c = 1; c < data->num_chunks(); c++) { + const std::shared_ptr arr = data->chunk(c); + const auto& dict_arr = checked_cast(*arr); -if (!(dict_arr_first.dictionary()->Equals(dict_arr.dictionary( { - return Status::NotImplemented("Variable dictionary type not supported"); + if (!(dict_arr_first.dictionary()->Equals(dict_arr.dictionary( { +return Status::NotImplemented("Variable dictionary type not supported"); + } } } converted_data = data; @@ -1168,13 +1182,20 @@ class CategoricalBlock : public PandasBlock { } // TODO(wesm): variable dictionaries -auto arr = converted_data->chunk(0); -const auto& dict_arr = checked_cast(*arr); +std::shared_ptr dict; +if (data->num_chunks() == 0) { + // no dictionary values => create empty array + RETURN_NOT_OK(MakeZeroLengthArray(dict_type.value_type(), )); +} else { + auto arr = converted_data->chunk(0); + const auto& dict_arr = checked_cast(*arr); + dict = dict_arr.dictionary(); +} placement_data_[rel_placement] = abs_placement; -PyObject* dict; -RETURN_NOT_OK(ConvertArrayToPandas(options_, dict_arr.dictionary(), nullptr, )); -dictionary_.reset(dict); +PyObject* pydict; +RETURN_NOT_OK(ConvertArrayToPandas(options_, dict, nullptr, )); +dictionary_.reset(pydict); ordered_ = dict_type.ordered(); return Status::OK(); diff --git a/python/pyarrow/tests/test_pandas.py b/python/pyarrow/tests/test_pandas.py index 12a6bc3..437fdad
[arrow] branch master updated: ARROW-6262: [Developer] Show JIRA issue before merging
This is an automated email from the ASF dual-hosted git repository. wesm pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 884ed65 ARROW-6262: [Developer] Show JIRA issue before merging 884ed65 is described below commit 884ed654e26114798fca486e3742caa97a544b7b Author: Sutou Kouhei AuthorDate: Thu Aug 15 21:46:31 2019 -0500 ARROW-6262: [Developer] Show JIRA issue before merging It's useful to confirm whehter the associated JIRA issue is right or not. We couldn't find wrong associated JIRA issue after we merge the pull request https://github.com/apache/arrow/pull/5050 . Closes #5097 from kou/dev-merge-show-jira-issue-before-merge and squashes the following commits: 6c9ad5be9 Show JIRA issue before merging Authored-by: Sutou Kouhei Signed-off-by: Wes McKinney --- dev/merge_arrow_pr.py | 47 +++ 1 file changed, 23 insertions(+), 24 deletions(-) diff --git a/dev/merge_arrow_pr.py b/dev/merge_arrow_pr.py index dfe9e33..7588fef 100755 --- a/dev/merge_arrow_pr.py +++ b/dev/merge_arrow_pr.py @@ -187,12 +187,6 @@ class JiraIssue(object): self.cmd.fail("JIRA issue %s already has status '%s'" % (self.jira_id, cur_status)) -console_output = format_resolved_issue_status(self.jira_id, cur_status, - fields.summary, - fields.assignee, - fields.components) -print(console_output) - resolve = [x for x in self.jira_con.transitions(self.jira_id) if x['name'] == "Resolve Issue"][0] self.jira_con.transition_issue(self.jira_id, resolve["id"], @@ -201,27 +195,31 @@ class JiraIssue(object): print("Successfully resolved %s!" % (self.jira_id)) +self.issue = self.jira_con.issue(self.jira_id) +self.show() -def format_resolved_issue_status(jira_id, status, summary, assignee, - components): -if assignee is None: -assignee = "NOT ASSIGNED!!!" -else: -assignee = assignee.displayName +def show(self): +fields = self.issue.fields -if len(components) == 0: -components = 'NO COMPONENTS!!!' -else: -components = ', '.join((x.name for x in components)) +assignee = fields.assignee +if assignee is None: +assignee = "NOT ASSIGNED!!!" +else: +assignee = assignee.displayName + +components = fields.components +if len(components) == 0: +components = 'NO COMPONENTS!!!' +else: +components = ', '.join((x.name for x in components)) -return """=== JIRA {} === -Summary\t\t{} -Assignee\t{} -Components\t{} -Status\t\t{} -URL\t\t{}/{}""".format(jira_id, summary, assignee, components, status, - '/'.join((JIRA_API_BASE, 'browse')), - jira_id) +print("=== JIRA {} ===".format(self.jira_id)) +print("Summary\t\t{}".format(fields.summary)) +print("Assignee\t{}".format(assignee)) +print("Components\t{}".format(components)) +print("Status\t\t{}".format(fields.status.name)) +print("URL\t\t{}/{}".format('/'.join((JIRA_API_BASE, 'browse')), +self.jira_id)) class GitHubAPI(object): @@ -293,6 +291,7 @@ class PullRequest(object): print("\n=== Pull Request #%s ===" % self.number) print("title\t%s\nsource\t%s\ntarget\t%s\nurl\t%s" % (self.title, self.description, self.target_ref, self.url)) +self.jira_issue.show() @property def is_merged(self):
[arrow] branch master updated: ARROW-6185: [Java] Provide hash table based dictionary builder
This is an automated email from the ASF dual-hosted git repository. emkornfield pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 4b971ee ARROW-6185: [Java] Provide hash table based dictionary builder 4b971ee is described below commit 4b971ee0948bc12ef9955f743882bd1ce3452231 Author: liyafan82 AuthorDate: Thu Aug 15 20:19:45 2019 -0700 ARROW-6185: [Java] Provide hash table based dictionary builder This is related ARROW-5862. We provide another type of dictionary builder based on hash table. Compared with a search based dictionary encoder, a hash table based encoder process each new element in O(1) time, but require extra memory space. Closes #5054 from liyafan82/fly_0809_hashbuild and squashes the following commits: 77e24531e Provide hash table based dictionary builder Authored-by: liyafan82 Signed-off-by: Micah Kornfield --- .../HashTableBasedDictionaryBuilder.java | 174 ++ .../TestHashTableBasedDictionaryEncoder.java | 203 + 2 files changed, 377 insertions(+) diff --git a/java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/HashTableBasedDictionaryBuilder.java b/java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/HashTableBasedDictionaryBuilder.java new file mode 100644 index 000..eff0f05 --- /dev/null +++ b/java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/HashTableBasedDictionaryBuilder.java @@ -0,0 +1,174 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.algorithm.dictionary; + +import java.util.HashMap; + +import org.apache.arrow.memory.util.ArrowBufPointer; +import org.apache.arrow.memory.util.hash.ArrowBufHasher; +import org.apache.arrow.memory.util.hash.SimpleHasher; +import org.apache.arrow.vector.ElementAddressableVector; + +/** + * A dictionary builder is intended for the scenario frequently encountered in practice: + * the dictionary is not known a priori, so it is generated dynamically. + * In particular, when a new value arrives, it is tested to check if it is already + * in the dictionary. If so, it is simply neglected, otherwise, it is added to the dictionary. + * + * + * This class builds the dictionary based on a hash table. + * Each add operation can be finished in O(1) time, + * where n is the current dictionary size. + * + * + * The dictionary builder is intended to build a single dictionary. + * So it cannot be used for different dictionaries. + * + * Below gives the sample code for using the dictionary builder + * {@code + * HashTableBasedDictionaryBuilder dictionaryBuilder = ... + * ... + * dictionaryBuild.addValue(newValue); + * ... + * } + * + * + * With the above code, the dictionary vector will be populated, + * and it can be retrieved by the {@link HashTableBasedDictionaryBuilder#getDictionary()} method. + * After that, dictionary encoding can proceed with the populated dictionary encoder. + * + * + * @param the dictionary vector type. + */ +public class HashTableBasedDictionaryBuilder { + + /** + * The dictionary to be built. + */ + private final V dictionary; + + /** + * If null should be encoded. + */ + private final boolean encodeNull; + + /** + * The hash map for distinct dictionary entries. + * The key is the pointer to the dictionary element, whereas the value is the index in the dictionary. + */ + private HashMap hashMap = new HashMap<>(); + + /** + * The hasher used for calculating the hash code. + */ + private final ArrowBufHasher hasher; + + /** + * Next pointer to try to add to the hash table. + */ + private ArrowBufPointer nextPointer; + + /** + * Constructs a hash table based dictionary builder. + * + * @param dictionary the dictionary to populate. + */ + public HashTableBasedDictionaryBuilder(V dictionary) { +this(dictionary, false); + } + + /** + * Constructs a hash table based dictionary builder. + * + * @param dictionary the dictionary to populate. + * @param encodeNull if null values should be added to the
[arrow] branch master updated: ARROW-6038: [C++] Faster type equality
This is an automated email from the ASF dual-hosted git repository. wesm pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 91e33dc ARROW-6038: [C++] Faster type equality 91e33dc is described below commit 91e33dcb6aa3c05eaf9d9d9f09579bb29e3fe175 Author: Antoine Pitrou AuthorDate: Thu Aug 15 21:29:00 2019 -0500 ARROW-6038: [C++] Faster type equality When checking for type equality, compute and cache a fingerprint of the type so as to avoid costly nested type walking and multiple comparisons. Before: ``` Benchmark Time CPU Iterations TypeEqualsSimple 13 ns 13 ns 55242976 150.558M items/s TypeEqualsComplex 430 ns430 ns1637275 4.43634M items/s TypeEqualsWithMetadata 595 ns595 ns1199216 3.20778M items/s SchemaEquals 1465 ns 1465 ns 479512 1.30226M items/s SchemaEqualsWithMetadata922 ns922 ns 7637522.0683M items/s ``` After: ``` Benchmark Time CPU Iterations TypeEqualsSimple 11 ns 11 ns 65531752 178.723M items/s TypeEqualsComplex20 ns 20 ns 33939830 95.1497M items/s TypeEqualsWithMetadata 31 ns 31 ns 22979555 62.4052M items/s SchemaEquals 40 ns 40 ns 17786532 48.1683M items/s SchemaEqualsWithMetadata 46 ns 46 ns 15173158 41.3242M items/s ``` Closes #4983 from pitrou/ARROW-6038-faster-type-equality and squashes the following commits: 2fdaf4adb ARROW-6038: Faster type equality Authored-by: Antoine Pitrou Signed-off-by: Wes McKinney --- cpp/src/arrow/CMakeLists.txt | 1 + cpp/src/arrow/compare.cc | 24 +- cpp/src/arrow/extension_type-test.cc | 11 + cpp/src/arrow/type-benchmark.cc | 170 + cpp/src/arrow/type-test.cc| 268 +++ cpp/src/arrow/type.cc | 354 +- cpp/src/arrow/type.h | 155 ++- cpp/src/arrow/util/key-value-metadata-test.cc | 18 ++ cpp/src/arrow/util/key_value_metadata.cc | 11 + cpp/src/arrow/util/key_value_metadata.h | 2 + integration/integration_test.py | 61 ++--- 11 files changed, 961 insertions(+), 114 deletions(-) diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index 0085238..4839fb8 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -381,6 +381,7 @@ add_arrow_test(tensor-test) add_arrow_test(sparse_tensor-test) add_arrow_benchmark(builder-benchmark) +add_arrow_benchmark(type-benchmark) add_subdirectory(array) add_subdirectory(csv) diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index 05a1d1f..222d4f9 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -1163,21 +1163,35 @@ bool SparseTensorEquals(const SparseTensor& left, const SparseTensor& right) { } bool TypeEquals(const DataType& left, const DataType& right, bool check_metadata) { - bool are_equal; // The arrays are the same object if ( == ) { -are_equal = true; +return true; } else if (left.id() != right.id()) { -are_equal = false; +return false; } else { +// First try to compute fingerprints +if (check_metadata) { + const auto& left_metadata_fp = left.metadata_fingerprint(); + const auto& right_metadata_fp = right.metadata_fingerprint(); + if (left_metadata_fp != right_metadata_fp) { +return false; + } +} + +const auto& left_fp = left.fingerprint(); +const auto& right_fp = right.fingerprint(); +if (!left_fp.empty() && !right_fp.empty()) { + return left_fp == right_fp; +} + +// TODO remove check_metadata here? internal::TypeEqualsVisitor visitor(right, check_metadata); auto error = VisitTypeInline(left, ); if (!error.ok()) { DCHECK(false) << "Types are not comparable: " << error.ToString(); } -are_equal = visitor.result(); +return visitor.result(); } - return are_equal; } bool ScalarEquals(const Scalar& left, const Scalar& right) { diff --git a/cpp/src/arrow/extension_type-test.cc b/cpp/src/arrow/extension_type-test.cc index 2f680af..06fd6a9 100644 --- a/cpp/src/arrow/extension_type-test.cc +++
[arrow] branch master updated: ARROW-5862: [Java] Provide dictionary builder
This is an automated email from the ASF dual-hosted git repository. emkornfield pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 1f5ebd0 ARROW-5862: [Java] Provide dictionary builder 1f5ebd0 is described below commit 1f5ebd0fae2c49831d9c52c64bd5e1b81e1b860a Author: liyafan82 AuthorDate: Thu Aug 15 19:54:18 2019 -0700 ARROW-5862: [Java] Provide dictionary builder The dictionary builder servers for the following scenario which is frequently encountered in practice when dictionary encoding is involved: the dictionary values are not known a priori, so they are determined dynamically, as new data arrive continually. In particular, when a new value arrives, it is tested to check if it is already in the dictionary. If so, it is simply neglected, otherwise, it is added to the dictionary. When all values have been evaluated, the dictionary can be considered complete. So encoding can start afterward. The code snippet using a dictionary builder should be like this: DictonaryBuilder dictionaryBuilder = ... dictionaryBuilder.startBuild(); ... dictionaryBuild.addValue(newValue); ... dictionaryBuilder.endBuild(); Closes #4813 from liyafan82/fly_0705_build and squashes the following commits: 2007b87c7 Provide dictionary builder Authored-by: liyafan82 Signed-off-by: Micah Kornfield --- .../SearchTreeBasedDictionaryBuilder.java | 162 +++ .../TestSearchTreeBasedDictionaryBuilder.java | 222 + 2 files changed, 384 insertions(+) diff --git a/java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/SearchTreeBasedDictionaryBuilder.java b/java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/SearchTreeBasedDictionaryBuilder.java new file mode 100644 index 000..a6f5642 --- /dev/null +++ b/java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/SearchTreeBasedDictionaryBuilder.java @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.algorithm.dictionary; + +import java.util.TreeSet; + +import org.apache.arrow.algorithm.sort.VectorValueComparator; +import org.apache.arrow.vector.ValueVector; + +/** + * A dictionary builder is intended for the scenario frequently encountered in practice: + * the dictionary is not known a priori, so it is generated dynamically. + * In particular, when a new value arrives, it is tested to check if it is already + * in the dictionary. If so, it is simply neglected, otherwise, it is added to the dictionary. + * + * + * This class builds the dictionary based on a binary search tree. + * Each add operation can be finished in O(log(n)) time, + * where n is the current dictionary size. + * + * + * The dictionary builder is intended to build a single dictionary. + * So it cannot be used for different dictionaries. + * + * Below gives the sample code for using the dictionary builder + * {@code + * SearchTreeBasedDictionaryBuilder dictionaryBuilder = ... + * ... + * dictionaryBuild.addValue(newValue); + * ... + * } + * + * + * With the above code, the dictionary vector will be populated, + * and it can be retrieved by the {@link SearchTreeBasedDictionaryBuilder#getDictionary()} method. + * After that, dictionary encoding can proceed with the populated dictionary. + * + * @param the dictionary vector type. + */ +public class SearchTreeBasedDictionaryBuilder { + + /** + * The dictionary to be built. + */ + private final V dictionary; + + /** + * The criteria for sorting in the search tree. + */ + protected final VectorValueComparator comparator; + + /** + * If null should be encoded. + */ + private final boolean encodeNull; + + /** + * The search tree for storing the value index. + */ + private TreeSet searchTree; + + /** + * Construct a search tree-based dictionary builder. + * @param dictionary the dictionary vector. + * @param comparator the criteria for value equality. + */ + public SearchTreeBasedDictionaryBuilder(V dictionary,
[arrow] branch master updated (4b971ee -> 3420d30)
This is an automated email from the ASF dual-hosted git repository. ravindra pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 4b971ee ARROW-6185: [Java] Provide hash table based dictionary builder add 3420d30 ARROW-6208: [Java] Correct byte order before comparing in ByteFunctionHelpers No new revisions were added by this update. Summary of changes: .../arrow/memory/util/ByteFunctionHelpers.java | 4 ++-- .../arrow/memory/util/TestByteFunctionHelpers.java | 22 ++ 2 files changed, 24 insertions(+), 2 deletions(-)